# <span style="font-width:bold; font-size: 3rem; color:#333;">Training Pipeline</span>

## 🗒️ This notebook is divided into the following sections:

1. Create Feature Views
2. Train model
3. Validate model
4. Save model to model registry

### <span style='color:#ff5f27'> 📝 Imports

In [1]:
import os
import datetime
import time
import json
import pickle
import joblib

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from xgboost import XGBRegressor
from xgboost import plot_importance
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings("ignore")

2024-02-18 14:22:32,794 INFO: generated new fontManager


## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [2]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://snurran.hops.works/p/5240
Connected. Call `.close()` to terminate connection gracefully.


In [3]:
# Retrieve feature groups
air_quality_fg = fs.get_feature_group(
    name='air_quality',
    version=1,
)
weather_fg = fs.get_feature_group(
    name='weather',
    version=1,
)

--- 

## <span style="color:#ff5f27;"> 🖍 Feature View Creation and Retrieving </span>

In [6]:
# Select features for training data.
selected_features = air_quality_fg.select(['city', 'pm25']).join(
    weather_fg.select_all(),  on=['city', 'date'],
)

In [7]:
# Uncomment this if you would like to view your selected features
selected_features.show(5)

Finished: Reading data from Hopsworks, using ArrowFlight (0.92s) 


Unnamed: 0,city,pm25,date,temperature_2m_mean,precipitation_sum,wind_speed_10m_max,wind_direction_10m_dominant
0,stockholm,11.0,2017-10-23 00:00:00+00:00,5.299833,0.0,14.77755,81.049316
1,stockholm,10.0,2017-11-13 00:00:00+00:00,-1.093917,0.0,12.599998,293.791504
2,stockholm,12.0,2017-11-14 00:00:00+00:00,-0.91475,0.2,13.698934,248.167618
3,stockholm,9.0,2017-12-02 00:00:00+00:00,-0.062667,0.0,19.85556,238.739838
4,stockholm,14.0,2017-12-08 00:00:00+00:00,4.033167,0.0,23.675304,204.792206


`Feature Views` stands between **Feature Groups** and **Training Dataset**. Сombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.create_feature_view()` method.

You can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

In [9]:
# Get or create the 'air_quality_fv' feature view
feature_view = fs.get_or_create_feature_view(
    name='air_quality_fv',
    version=1,
    labels=['pm25'],
    query=selected_features,
)

Feature view created successfully, explore it at 
https://snurran.hops.works/p/5240/fs/5188/fv/air_quality_fv/version/1


In [14]:
td_version, td_job = feature_view.create_train_test_split(test_size=0.2)

Training dataset job started successfully, you can follow the progress at 
https://snurran.hops.works/p/5240/jobs/named/air_quality_fv_1_create_fv_td_18022024161536/executions




(2, <hsfs.core.job.Job at 0x7fbf10599b70>)

In [None]:
X_train, X_test, y_train, y_test = feature_view.train_test_split(test_size=0.2)


In [19]:
X_train, X_test, y_train, y_test = feature_view.train_test_split(test_size=0.2)

Finished: Reading data from Hopsworks, using ArrowFlight (2.00s) 


Data type could not be inferred for column 'pm25'. Defaulting to 'String'
Data type could not be inferred for column 'temperature_2m_mean'. Defaulting to 'String'
Data type could not be inferred for column 'precipitation_sum'. Defaulting to 'String'
Data type could not be inferred for column 'wind_speed_10m_max'. Defaulting to 'String'
Data type could not be inferred for column 'wind_direction_10m_dominant'. Defaulting to 'String'
Data type could not be inferred for column 'date'. Defaulting to 'String'


TypeError: Object of type Timestamp is not JSON serializable

For now `Feature View` is saved in Hopsworks and you can retrieve it using `FeatureStore.get_feature_view()`.

In [15]:

td_version=2
X_train, X_test, y_train, y_test = feature_view.get_train_test_split(td_version)


In [16]:
X_train

Unnamed: 0,city,date,temperature_2m_mean,precipitation_sum,wind_speed_10m_max,wind_direction_10m_dominant
0,stockholm,2020-07-23,14.412334,0.0,21.659916,285.032990
1,stockholm,2022-10-17,10.293582,5.8,25.420181,261.859436
2,stockholm,2022-04-26,6.668583,0.1,15.990646,343.442841
3,stockholm,2018-12-27,-0.508500,0.0,11.928989,286.460083
4,stockholm,2019-05-29,10.593583,0.0,26.052162,263.531799
...,...,...,...,...,...,...
1795,stockholm,2021-04-04,7.143584,0.0,29.795301,239.564911
1796,stockholm,2018-08-23,19.756083,0.0,20.073025,210.017105
1797,stockholm,2023-01-19,-1.064750,0.0,11.200571,218.105850
1798,stockholm,2021-10-08,12.126919,0.0,14.904173,222.140091


In [17]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Data columns (total 6 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   city                         1800 non-null   object        
 1   date                         1800 non-null   datetime64[ns]
 2   temperature_2m_mean          1800 non-null   float32       
 3   precipitation_sum            1800 non-null   float32       
 4   wind_speed_10m_max           1800 non-null   float32       
 5   wind_direction_10m_dominant  1800 non-null   float32       
dtypes: datetime64[ns](1), float32(4), object(1)
memory usage: 56.4+ KB


---

## <span style="color:#ff5f27;">🧬 Modeling</span>

In [None]:
# Creating a LabelEncoder object
label_encoder = LabelEncoder()

# Fitting the encoder to the data in the 'city_name' column
label_encoder.fit(X[['city_name']])

# Transforming the 'city_name' column data using the fitted encoder
encoded = label_encoder.transform(X[['city_name']])

In [None]:
# Convert the output of the label encoding to a dense array and concatenate with the original data
X = pd.concat([X, pd.DataFrame(encoded)], axis=1)

# Drop columns 'date', 'city_name', 'unix_time' from the DataFrame 'X'
X = X.drop(columns=['date', 'city_name', 'unix_time'])

# Rename the newly added column with label-encoded city names to 'city_name_encoded'
X = X.rename(columns={0: "city_name_encoded"})

In [None]:
# Extracting the target variable 'pm2_5' from the DataFrame 'X' and assigning it to the variable 'y'
y = X.pop('pm2_5')

In [None]:
# Splitting the data into training and testing sets using the train_test_split function
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

### <span style='color:#ff5f27'> ⚖️ Model Validation

In [None]:
# Storing the current time as the start time of the cell execution
start_of_cell = time.time()

# Creating an instance of the XGBoost Regressor
xgb_regressor = XGBRegressor()

# Fitting the XGBoost Regressor to the training data
xgb_regressor.fit(X_train, y_train)

# Predicting target values on the test set
y_pred = xgb_regressor.predict(X_test)

# Calculating Mean Squared Error (MSE) using sklearn
mse = mean_squared_error(y_test, y_pred)
print("MSE:", mse)

# Calculating Root Mean Squared Error (RMSE) using sklearn
rmse = mean_squared_error(y_test, y_pred, squared=False)
print("RMSE:", rmse)

# Calculating R squared using sklearn
r2 = r2_score(y_test, y_pred)
print("R squared:", r2)

# Storing the current time as the end time of the cell execution
end_of_cell = time.time()

# Printing information about the execution, including the time taken
print(f"Took {round(end_of_cell - start_of_cell, 2)} sec.\n")

In [None]:
# Creating a DataFrame 'df_' to store true and predicted values for evaluation
df_ = pd.DataFrame({
    "y_true": y_test,
    "y_pred": y_pred,
})

In [None]:
# Creating a residual plot using Seaborn
residplot = sns.residplot(data=df_, x="y_true", y="y_pred", color='orange')

# Adding title, xlabel, and ylabel to the residual plot
plt.title('Model Residuals')
plt.xlabel('Observation #')
plt.ylabel('Error')

# Displaying the residual plot
plt.show()

# Getting the figure from the residual plot and displaying it separately
fig = residplot.get_figure()
fig.show()

In [None]:
# Plotting feature importances using the plot_importance function from XGBoost
# 'xgb_regressor' is the trained XGBoost Regressor
# Setting 'max_num_features' to 25 to display the top 25 most important features
plot_importance(xgb_regressor, max_num_features=25)

---

## <span style='color:#ff5f27'>🗄 Model Registry</span>

One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [None]:
# Retrieve the model registry
mr = project.get_model_registry()

### <span style="color:#ff5f27;">⚙️ Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/machine-learning-api/latest/generated/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

# Creating input and output schemas using the 'Schema' class for features (X) and target variable (y)
input_schema = Schema(X)
output_schema = Schema(y)

# Creating a model schema using 'ModelSchema' with the input and output schemas
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

# Converting the model schema to a dictionary representation
schema_dict = model_schema.to_dict()

In [None]:
# Creating a directory for the model artifacts if it doesn't exist
model_dir = "air_quality_model"
if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)

# Saving the label encoder and XGBoost regressor as joblib files in the model directory
joblib.dump(label_encoder, model_dir + '/label_encoder.pkl')
joblib.dump(xgb_regressor, model_dir + '/xgboost_regressor.pkl')

# Saving the residual plot figure as an image in the model directory
fig.savefig(model_dir + "/residplot.png")

In [None]:
# Creating a Python model in the model registry named 'air_quality_xgboost_model'
aq_model = mr.python.create_model(
    name="air_quality_xgboost_model", 
    metrics={
        "RMSE": rmse,
        "MSE": mse,
        "R squared": r2,
    },
    model_schema=model_schema,
    input_example=X_test.sample().values, 
    description="Air Quality (PM2.5) predictor",
)

# Saving the model artifacts to the 'air_quality_model' directory in the model registry
aq_model.save(model_dir)

---
## <span style="color:#ff5f27;">⏭️ **Next:** Part 04: Batch Inference</span>

In the following notebook you will use your model for Batch Inference.
