# Feature Engineering Pipeline with Feature Store

## Overview
In this project i made use of Hopsworks feature store. Hopsworks and its Feature Store are an open source data-intensive AI platform used for the development and operation of machine learning models at scale. The Hopsworks Feature Store provides the HSFS API 

- to enable clients to write features to feature groups in the feature store, 
- and to read features from feature views 

![The Hopsworks Architecture!](../images/fs_architecture.jpg "The Hopsworks Architecture")

## Feature Pipelines
The Feature Pipeline is the foundation of the FTI architecture. It is responsible for transforming raw data into engineered features that are ready for both training and inference. This involves:

- **Data Extraction**: Retrieving raw data from various sources, such as relational databases, APIs, or data lakes. This part can be separated from the Feature Pipeline.
- **Feature Engineering**: Performing transformations like aggregations, scaling, encoding, and computing derived metrics.
- **Feature Storage**: Saving the processed features in a feature store (e.g., Feast, Hopsworks) for reuse during training and inference.


![The ETL Architecture!](../images/ETL_architecture.png "The ETL Architecture")

In [3]:
import os
import sys
from pathlib import Path
import time
from dotenv import load_dotenv
import hopsworks
from confluent_kafka import Producer
import pandas as pd

sys.path.insert(0, str(Path().resolve().parent / "src"))

from paths import  TRANSFORMED_DATA_DIR


# load environment
load_dotenv()


HOPSWORK_LOGIN_API_KEY = os.getenv("HOPSWORK_LOGIN_API_KEY")


In [4]:
# hopsworks version
hopsworks.__version__

'4.1.8'

In [34]:
# Login to the Hopsworks feature store
connection = hopsworks.login(
    host='c.app.hopsworks.ai',                 # DNS of your Feature Store instance
    port=443,  
    project='air_quality_project', 
    engine="python",
    api_key_value=HOPSWORK_LOGIN_API_KEY
)


2025-03-09 00:29:31,473 INFO: Closing external client and cleaning up certificates.


Connection closed.
2025-03-09 00:29:31,557 INFO: Initializing external client
2025-03-09 00:29:31,558 INFO: Base URL: https://c.app.hopsworks.ai:443
2025-03-09 00:29:36,823 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1214615


In [22]:
# connect to feature store
project_name = "air_quality_project"

try:
    feature_store = connection.get_feature_store(name=project_name)
    print(f"✅ Successfully Connected to {feature_store.project_name}")
except Exception as e:
    print(f"❌ Feature store not available!")

✅ Successfully Connected to air_quality_project


In [23]:
# Get or Create a feature group

fg = feature_store.get_or_create_feature_group(
    name="air_quality_historical_data_2020_to_2025",
    version=1,
    description="Historical Data of Air Quality in Lagos",
    primary_key=['row_id'],
    event_time='timestamp',
    online_enabled=True,
)

In [24]:
# dataframe
df = pd.read_csv(f"{TRANSFORMED_DATA_DIR}/weather_20200101_to_20250201.csv")
df.head()

Unnamed: 0,row_id,aqi,co,no,no2,o3,so2,pm2_5,pm10,nh3,timestamp,date,time,aqi_range
0,0,5,1682.28,0.13,18.85,12.88,8.82,64.62,90.85,17.48,2020-11-25 01:00:00,2020-11-25,01:00:00,Very Poor
1,1,5,2109.53,0.36,21.94,9.3,10.37,93.95,127.43,21.03,2020-11-25 02:00:00,2020-11-25,02:00:00,Very Poor
2,2,5,2750.4,1.41,26.39,4.16,12.52,136.28,181.39,25.59,2020-11-25 03:00:00,2020-11-25,03:00:00,Very Poor
3,3,5,3337.86,4.81,28.45,0.78,14.07,175.09,233.2,28.63,2020-11-25 04:00:00,2020-11-25,04:00:00,Very Poor
4,4,5,3738.4,10.95,28.45,0.1,15.26,200.27,262.51,30.91,2020-11-25 05:00:00,2020-11-25,05:00:00,Very Poor


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36311 entries, 0 to 36310
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   row_id     36311 non-null  int64  
 1   aqi        36311 non-null  int64  
 2   co         36311 non-null  float64
 3   no         36311 non-null  float64
 4   no2        36311 non-null  float64
 5   o3         36311 non-null  float64
 6   so2        36311 non-null  float64
 7   pm2_5      36311 non-null  float64
 8   pm10       36311 non-null  float64
 9   nh3        36311 non-null  float64
 10  timestamp  36311 non-null  object 
 11  date       36311 non-null  object 
 12  time       36311 non-null  object 
 13  aqi_range  36311 non-null  object 
dtypes: float64(8), int64(2), object(4)
memory usage: 3.9+ MB


In [28]:
# Convert timestamp column to datetime
df["timestamp"] = pd.to_datetime(df["timestamp"])

# Convert 'date' column to datetime format
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Convert 'time' column to time format
df['time'] = pd.to_datetime(df['time'], format='%H:%M:%S')

In [29]:
# save dataframe into feature group
start_time = time.time()

try:   
    fg.save(df, write_options={"wait_for_job": False})
except Exception as err:
    print(f"Feature group {fg.name} already exists! or Error encountered")
    raise err

print("Upload time %s seconds ---" % (time.time() - start_time))
print('✅ Done!')

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1214615/fs/1202247/fg/1403746


Uploading Dataframe: 100.00% |██████████| Rows 36311/36311 | Elapsed Time: 00:26 | Remaining Time: 00:00


Launching job: air_quality_historical_data_2020_to_2025_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1214615/jobs/named/air_quality_historical_data_2020_to_2025_1_offline_fg_materialization/executions
Upload time 50.216068983078 seconds ---
✅ Done!


In [36]:
# updates the feature description
feature_descriptions = [
    {"name": "row_id", "description": "Unique identifier for each record."},
    {"name": "aqi", "description": "Air Quality Index (AQI) value indicating the pollution level."},
    {"name": "co", "description": "Carbon Monoxide (CO) concentration in µg/m³."},
    {"name": "no", "description": "Nitric Oxide (NO) concentration in µg/m³."},
    {"name": "no2", "description": "Nitrogen Dioxide (NO₂) concentration in µg/m³."},
    {"name": "o3", "description": "Ozone (O₃) concentration in µg/m³."},
    {"name": "so2", "description": "Sulfur Dioxide (SO₂) concentration in µg/m³."},
    {"name": "pm2_5", "description": "Fine Particulate Matter (PM2.5) concentration in µg/m³."},
    {"name": "pm10", "description": "Coarse Particulate Matter (PM10) concentration in µg/m³."},
    {"name": "nh3", "description": "Ammonia (NH₃) concentration in µg/m³."},
    {"name": "timestamp", "description": "timestamp"},
    {"name": "date", "description": "The date of the recorded measurement (YYYY-MM-DD)."},
    {"name": "time", "description": "The time of the recorded measurement (HH:MM:SS)."},
    {"aqi_bucket": "timestamp", "description": "Categorical label describing the AQI level."}
]

for desc in feature_descriptions: 
    fg.update_feature_description(desc["name"], desc["description"])

RestAPIError: Metadata operation error: (url: https://c.app.hopsworks.ai/hopsworks-api/api/project/1214615/featurestores/1202247/featuregroups/1403746). Server response: 
HTTP code: 500, HTTP reason: Internal Server Error, body: b'{"errorCode":120003,"usrMsg":"Transaction marked for rollback.","errorMsg":"The last transaction did not complete as expected"}', error code: 120003, error msg: The last transaction did not complete as expected, user msg: Transaction marked for rollback.

## Create Feature View

A feature view is a logical view over (or interface to) a set of features that may come from different feature groups. You create a feature view by joining together features from existing feature groups. 

Feature views can include:

- the label for the supervised ML problem
- transformation functions that should be applied to specified features consistently between training and serving
- the ability to create training data
- the ability to retrieve a feature vector with the most recent feature values

In [37]:
# This feature view only uses on feature group, so the query is trivial

# select all features except row_id, timestamp, date, time, aqi
query = fg.select_except(["row_id","timestamp","date","time", "aqi"])



try:
    # create feature view if it doesn't exist yet
    feature_view = feature_store.create_feature_view(
        name='air_quality_view',
        descriprion="Features from Air Quality Data",
        labels=["aqi_bucket"],
        query=query,
    )
except:
    print('Feature view already existed. Skip creation.')
    

    

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1214615/fs/1202247/fv/air_quality_view/version/1


In [38]:
# get feature view
feature_view = feature_store.get_feature_view(
    name="air_quality_view"
)

