In [1]:
import sys
from pathlib import Path

def is_google_colab() -> bool:
    if "google.colab" in str(get_ipython()):
        return True
    return False

def clone_repository() -> None:
    !git clone https://github.com/featurestorebook/mlfs-book.git
    %cd mlfs-book

def install_dependencies() -> None:
    !pip install --upgrade uv
    !uv pip install --all-extras --system --requirement pyproject.toml

if is_google_colab():
    clone_repository()
    install_dependencies()
    root_dir = str(Path().absolute())
    print("Google Colab environment")
else:
    root_dir = Path().absolute()
    # Strip ~/notebooks/ccfraud from PYTHON_PATH if notebook started in one of these subdirectories
    if root_dir.parts[-1:] == ('airquality',):
        root_dir = Path(*root_dir.parts[:-1])
    if root_dir.parts[-1:] == ('notebooks',):
        root_dir = Path(*root_dir.parts[:-1])
    root_dir = str(root_dir) 
    print("Local environment")

# Add the root directory to the `PYTHONPATH` to use the `recsys` Python module from the notebook.
if root_dir not in sys.path:
    sys.path.append(root_dir)
print(f"Added the following directory to the PYTHONPATH: {root_dir}")
    
# Set the environment variables from the file <root_dir>/.env
from mlfs import config
settings = config.HopsworksSettings(_env_file=f"{root_dir}/.env")

Local environment
Added the following directory to the PYTHONPATH: c:\Users\lulev\Desktop\KTH\mlfs-book
HopsworksSettings initialized!


# <span style="font-width:bold; font-size: 3rem; color:#333;">Training Pipeline</span>

## üóíÔ∏è This notebook is divided into the following sections:

1. Select features for the model and create a Feature View with the selected features
2. Create training data using the feature view
3. Train model
4. Evaluate model performance
5. Save model to model registry

### <span style='color:#ff5f27'> üìù Imports

In [2]:
import os
from datetime import datetime, timedelta
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from xgboost import plot_importance
from sklearn.metrics import mean_squared_error, r2_score
import hopsworks
from mlfs.airquality import util
import json

import warnings
warnings.filterwarnings("ignore")

## <span style="color:#ff5f27;"> üì° Connect to Hopsworks Feature Store </span>

In [3]:
# Check if HOPSWORKS_API_KEY env variable is set or if it is set in ~/.env
if settings.HOPSWORKS_API_KEY is not None:
    api_key = settings.HOPSWORKS_API_KEY.get_secret_value()
    os.environ['HOPSWORKS_API_KEY'] = api_key
project = hopsworks.login(engine="python")
fs = project.get_feature_store() 

secrets = hopsworks.get_secrets_api()
location_str = secrets.get_secret("SENSOR_LOCATION_JSON").value
city_dict = json.loads(location_str)

2025-11-15 12:06:39,509 INFO: Initializing external client
2025-11-15 12:06:39,510 INFO: Base URL: https://c.app.hopsworks.ai:443






2025-11-15 12:06:41,284 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1279155


In [4]:
# Retrieve feature groups
air_quality_fg = fs.get_feature_group(
    name='air_quality',
    version=1,
)
weather_fg = fs.get_feature_group(
    name='weather',
    version=1,
)

--- 

## <span style="color:#ff5f27;"> üñç Feature View Creation and Retrieving </span>

In [5]:
# Select features for training data.
selected_features = air_quality_fg.select(['pm25', 'date', 'city']).join(weather_fg.select_features(), on=['city'])

2025-11-15 12:06:43,590 INFO: Using ['temperature_2m_mean', 'precipitation_sum', 'wind_speed_10m_max', 'wind_direction_10m_dominant'] from feature group `weather` as features for the query. To include primary key and event time use `select_all`.


### Feature Views

`Feature Views` are selections of features from different **Feature Groups** that make up the input and output API (or schema) for a model. A **Feature Views** can create **Training Data** and also be used in Inference to retrieve inference data.

The Feature Views allows a schema in form of a query with filters, defining a model target feature/label and additional transformation functions (declarative feature encoding).

In order to create Feature View we can use `FeatureStore.get_or_create_feature_view()` method.

You can specify the following parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - declarative feature encoding (not used here)

- `query` - selected features/labels for the model 

In [6]:
feature_view = fs.get_or_create_feature_view(
    name='air_quality_fv',
    description="weather features with air quality as the target",
    version=1,
    labels=['pm25'],
    query=selected_features,
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1279155/fs/1265766/fv/air_quality_fv/version/1


## <span style="color:#ff5f27;">ü™ù Split the training data into train/test data sets </span>

We use a time-series split here, with training data before this date `start_date_test_data` and test data after this date

In [7]:
start_date_test_data = "2025-10-15"
# Convert string to datetime object
test_start = datetime.strptime(start_date_test_data, "%Y-%m-%d")

In [8]:
X_train, X_test, y_train, y_test = feature_view.train_test_split(
    test_start=test_start
)

Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (1.39s) 


In [9]:
X_train

Unnamed: 0,date,city,temperature_2m_mean,precipitation_sum,wind_speed_10m_max,wind_direction_10m_dominant
0,2020-01-09 00:00:00+00:00,moirana,1.439583,9.900001,23.263912,246.766129
1,2020-01-10 00:00:00+00:00,moirana,1.181250,12.100000,9.028754,235.425720
2,2020-01-11 00:00:00+00:00,moirana,1.039583,23.799997,8.669949,355.277649
3,2020-01-12 00:00:00+00:00,moirana,1.850000,11.099999,15.696165,221.092133
4,2020-01-13 00:00:00+00:00,moirana,1.185417,14.700000,12.727921,224.097076
...,...,...,...,...,...,...
4173,2025-10-10 00:00:00+00:00,bodo,5.876667,32.100002,24.627789,224.267975
4174,2025-10-11 00:00:00+00:00,bodo,6.351667,23.200001,22.910259,269.655243
4175,2025-10-12 00:00:00+00:00,bodo,6.149584,1.000000,16.198000,343.743927
4176,2025-10-13 00:00:00+00:00,bodo,5.395416,7.299999,17.902534,112.315926


In [10]:
X_features = X_train.drop(columns=['date'])
X_test_features = X_test.drop(columns=['date'])

The `Feature View` is now saved in Hopsworks and you can retrieve it using `FeatureStore.get_feature_view(name='...', version=1)`.

---

## <span style="color:#ff5f27;">üß¨ Modeling</span>

We will train a regression model to predict pm25 using our 4 features (wind_speed, wind_dir, temp, precipitation)

In [11]:
models = {}
for city in city_dict:
    # Creating an instance of the XGBoost Regressor
    models[city] = XGBRegressor()
    # Fitting the XGBoost Regressor to the training data
    mask = X_features["city"]==city
    models[city].fit(X_features.loc[mask].drop(columns=["city"]), y_train.loc[mask])


In [12]:
y_test

Unnamed: 0,pm25
2061,9.0
2062,10.0
2063,5.0
2064,6.0
2065,6.0
2066,13.0
2067,24.0
2068,20.0
2069,19.0
2070,29.0


In [13]:
metrics_dict = {}
for city in city_dict:
    # Predicting target values on the test set
    mask = X_test_features["city"]==city
    y_pred = models[city].predict(X_test_features.loc[mask].drop(columns=["city"]))

    # Calculating Mean Squared Error (MSE) using sklearn
    mse = mean_squared_error(y_test.loc[mask].iloc[:,0], y_pred)
    print(f"{city} MSE: {mse}")

    # # Calculating R squared using sklearn
    # r2 = r2_score(y_test.iloc[:,0], y_pred)
    # print("R squared:", r2)

    metrics_dict[city] = {
        "mse": str(mse)
    }
metrics_dict

bodo MSE: 48.666263580322266
moirana MSE: 174.13275146484375


{'bodo': {'mse': '48.666264'}, 'moirana': {'mse': '174.13275'}}

In [14]:
# df = y_test
# df['predicted_pm25'] = y_pred

In [15]:
# df['date'] = X_test['date']
# df = df.sort_values(by=['date'])
# df.head(5)

In [16]:
# Creating a directory for the model artifacts if it doesn't exist
model_dir = "air_quality_model"
if not os.path.exists(model_dir):
    os.mkdir(model_dir)
images_dir = model_dir + "/images"
if not os.path.exists(images_dir):
    os.mkdir(images_dir)

In [17]:
# file_path = images_dir + "/pm25_hindcast.png"
# plt = util.plot_air_quality_forecast(city, street, df, file_path, hindcast=True) 
# plt.show()

In [18]:
# Plotting feature importances using the plot_importance function from XGBoost
# plot_importance(xgb_regressor)
# feature_importance_path = images_dir + "/feature_importance.png"
# plt.savefig(feature_importance_path)
# plt.show()

---

## <span style='color:#ff5f27'>üóÑ Model Registry</span>

One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [19]:
# Saving the XGBoost regressor object as a json file in the model directory
for city in city_dict:
    models[city].save_model(f"{model_dir}/{city}_model.json")

In [20]:
mr = project.get_model_registry()

# Creating a Python model in the model registry named 'air_quality_xgboost_model'

for city in city_dict:
    aq_model = mr.python.create_model(
        name=f"air_quality_xgboost_model_{city}", 
        metrics= metrics_dict[city],
        feature_view=feature_view,
        description="Air Quality (PM2.5) predictor",
    )

    # Saving the model artifacts to the 'air_quality_model' directory in the model registry
    aq_model.save(model_dir)

  0%|          | 0/6 [00:00<?, ?it/s]

Uploading c:\Users\lulev\Desktop\KTH\mlfs-book\notebooks\airquality\air_quality_model/bodo_model.json: 0.000%|‚Ä¶

Uploading c:\Users\lulev\Desktop\KTH\mlfs-book\notebooks\airquality\air_quality_model/bod√∂_model.json: 0.000%|‚Ä¶

Uploading c:\Users\lulev\Desktop\KTH\mlfs-book\notebooks\airquality\air_quality_model/model.json: 0.000%|     ‚Ä¶

Uploading c:\Users\lulev\Desktop\KTH\mlfs-book\notebooks\airquality\air_quality_model/moirana_model.json: 0.00‚Ä¶

Uploading c:\Users\lulev\Desktop\KTH\mlfs-book\notebooks\airquality\air_quality_model\images/feature_importanc‚Ä¶

Uploading c:\Users\lulev\Desktop\KTH\mlfs-book\notebooks\airquality\air_quality_model\images/pm25_hindcast.png‚Ä¶

Uploading c:\Users\lulev\Desktop\KTH\mlfs-book\notebooks\airquality\model_schema.json: 0.000%|          | 0/67‚Ä¶

Model created, explore it at https://c.app.hopsworks.ai:443/p/1279155/models/air_quality_xgboost_model_bodo/1


  0%|          | 0/6 [00:00<?, ?it/s]

Uploading c:\Users\lulev\Desktop\KTH\mlfs-book\notebooks\airquality\air_quality_model/bodo_model.json: 0.000%|‚Ä¶

Uploading c:\Users\lulev\Desktop\KTH\mlfs-book\notebooks\airquality\air_quality_model/bod√∂_model.json: 0.000%|‚Ä¶

Uploading c:\Users\lulev\Desktop\KTH\mlfs-book\notebooks\airquality\air_quality_model/model.json: 0.000%|     ‚Ä¶

Uploading c:\Users\lulev\Desktop\KTH\mlfs-book\notebooks\airquality\air_quality_model/moirana_model.json: 0.00‚Ä¶

Uploading c:\Users\lulev\Desktop\KTH\mlfs-book\notebooks\airquality\air_quality_model\images/feature_importanc‚Ä¶

Uploading c:\Users\lulev\Desktop\KTH\mlfs-book\notebooks\airquality\air_quality_model\images/pm25_hindcast.png‚Ä¶

Uploading c:\Users\lulev\Desktop\KTH\mlfs-book\notebooks\airquality\model_schema.json: 0.000%|          | 0/67‚Ä¶

Model created, explore it at https://c.app.hopsworks.ai:443/p/1279155/models/air_quality_xgboost_model_moirana/1


---
## <span style="color:#ff5f27;">‚è≠Ô∏è **Next:** Part 04: Batch Inference</span>

In the following notebook you will use your model for Batch Inference.
