# <span style="font-width:bold; font-size: 3rem; color:#1EB182;">**Hopsworks Feature Store** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Training Pipeline</span>


## 🗒️ This notebook is divided into 3 main sections:
1. Feature Selection.
2. Feature preprocessing.
3. Training datasets creation.
4. Loading the training data.
5. Train the model.
6. Register model to Hopsworks model registry.

![02_training-dataset](../../images/02_training-dataset.png)

### <span style='color:#ff5f27'> 📝 Imports

In [None]:
import os

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
import xgboost as xgb

from sklearn.metrics import r2_score

import warnings
warnings.filterwarnings('ignore')

---

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

In [None]:
# Retrieve Feature Groups
citibike_usage_fg = fs.get_or_create_feature_group(
    name="citibike_usage",
    version=1,
)

us_holidays_fg = fs.get_or_create_feature_group(
    name="us_holidays",
    version=1,
)

meteorological_measurements_fg = fs.get_or_create_feature_group(
    name="meteorological_measurements",
    version=1,
)

---

## <span style="color:#ff5f27;"> 🖍 Feature View Creation and Retrieving </span>

Let's start by selecting all the features you want to include for model training/inference.

In [None]:
# Select features for training data
selected_features = meteorological_measurements_fg.select_except(["timestamp"])\
                          .join(
                                us_holidays_fg.select_except(["timestamp"]),
                                on="date", join_type="left"
                          )\
                          .join(
                              citibike_usage_fg.select_except(["timestamp"]),
                              on="date", join_type="left"
                          )

In [None]:
# Uncomment this if you would like to view your selected features
# selected_features.show(5)

`Feature Views` stands between **Feature Groups** and **Training Dataset**. Сombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.get_or_create_feature_view()` method.

We can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

In [None]:
feature_view = fs.get_or_create_feature_view(
    name='citibike_fv',
    query=selected_features,
    labels=["users_count"],
    version=1,   
)

---

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

To create training dataset you will use the `FeatureView.train_test_split()` method.

Here are some importand things:

- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- You can choose necessary format using **data_format** parameter.

- **start_time** and **end_time** in order to filter dataset in specific time range.

- You can create **train, test** splits using `train_test_split()`. 

- You can create **train,validation, test** splits using `train_validation_test_splits()` methods.

- The only thing is that we should specify desired ratio of splits.

In [None]:
X_train, X_test, y_train, y_test = feature_view.train_test_split(
    train_start="2023-01-01",
    train_end="2023-05-01",
    test_start="2023-05-02",
    test_end="2023-05-31",
)

In [None]:
# Set the multi-level index for the training set using 'date' and 'station_id' columns
X_train = X_train.set_index(["date", "station_id"])

# Set the multi-level index for the test set using 'date' and 'station_id' columns
X_test = X_test.set_index(["date", "station_id"])

# Convert the specified columns in the training set to float type
X_train.iloc[:, 1:-1] = X_train.iloc[:, 1:-1].astype(float)

# Convert the specified columns in the test set to float type
X_test.iloc[:, 1:-1] = X_test.iloc[:, 1:-1].astype(float)

print(f'⛳️ X_train shape: {X_train.shape}')
print(f'⛳️ y_train shape: {y_train.shape}')

In [None]:
# Drop rows with missing values in the training set
X_train.dropna(inplace=True)

# Drop rows with missing values in the test set
X_test.dropna(inplace=True)

# Drop rows with missing values in the training labels
y_train.dropna(inplace=True)

# Drop rows with missing values in the test labels
y_test.dropna(inplace=True)

# Display the first three rows of the training set
X_train.head(3)

---
## <span style="color:#ff5f27;">🧬 Modeling</span>

In [None]:
# Create an XGBoost Regressor
regressor = xgb.XGBRegressor()

# Fit the model using the training set
regressor.fit(X_train, y_train)

In [None]:
# Predict using the trained XGBoost model
y_pred = regressor.predict(X_test)

# Calculate and print the R2 score for the XGBoost model
r2_xgb = r2_score(y_pred, y_test.values)
print("🎯 R2 score for XGBoost model:", r2_xgb)

In [None]:
# Create a DataFrame with true and predicted values
df_ = pd.DataFrame({
    "y_true": np.hstack(y_test.values),
    "y_pred": y_pred,
})

# Create a residual plot using Seaborn
residplot = sns.residplot(data=df_, x="y_true", y="y_pred", color='#613F75')

# Set plot titles and labels
plt.title('Model Residuals')
plt.xlabel('Observation #')
plt.ylabel('Error')

# Show the plot
plt.show()

In [None]:
# Get the figure from the residual plot
fig = residplot.get_figure()

---
### <span style="color:#ff5f27;">⚙️ Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/3.0/user_guides/mlops/registry/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

# Create input and output schemas using the provided training data
input_schema = Schema(X_train)
output_schema = Schema(y_train)

# Create a model schema with the input and output schemas
model_schema = ModelSchema(
    input_schema=input_schema, 
    output_schema=output_schema,
)

# Convert the model schema to a dictionary
model_schema.to_dict()

## <span style='color:#ff5f27'>🗄 Model Registry</span>

One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [None]:
# Create a directory for the model if it does not exist
model_dir = "citibike_xgb_model"
if not os.path.isdir(model_dir):
    os.mkdir(model_dir)

# Save the XGBoost regressor model as json file to the specified directory
regressor.save_model(model_dir + "/model.json")

# Save the residual plot figure as an image in the model directory
fig.savefig(model_dir + "/residplot.png")

In [None]:
# Get the model registry for the project
mr = project.get_model_registry()

# Create a Python model in the model registry
citibike_model = mr.python.create_model(
    name="citibike_xgb_model", 
    metrics={"r2_score": r2_xgb},
    model_schema=model_schema,
    input_example=X_train.sample(), 
    description="Citibike users per station Predictor",
)

# Save the model directory to the model registry
citibike_model.save(model_dir)

## <span style="color:#ff5f27;">⏭️ **Next:** Part 04: Batch Inference </span>

In the next notebook you will use your registered model to predict batch data.
