# <span style="font-width:bold; font-size: 3rem; color:#1EB182;">**Hopsworks Feature Store** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Training Pipeline</span>

## üóíÔ∏è This notebook is divided into 3 main sections:
1. Feature selection.
2. Feature transformations.
3. Training datasets creation.
4. Train the model.
5. Register model to Hopsworks model registry.

![02_training-dataset](../../images/02_training-dataset.png)

In [None]:
!pip install xgboost

In [None]:
import os
import joblib

import pandas as pd

from sklearn.metrics import (
    mean_absolute_error, 
    r2_score,
)
import xgboost as xgb

import matplotlib.pyplot as plt
import seaborn as sns

# Mute warnings
import warnings
warnings.filterwarnings("ignore")

## <span style="color:#ff5f27;"> üì° Connecting to the Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

In [None]:
# Retrieve feature groups
rides_fg = fs.get_feature_group(
    name="nyc_taxi_rides",
    version=1,
)

fares_fg = fs.get_feature_group(
    name="nyc_taxi_fares",
    version=1,
)

---

## <span style="color:#ff5f27;"> üñç Feature View Creation and Retrieval </span>

First you need to build a query object from desired features.

In [None]:
# Select features for training data
selected_features = fares_fg.select(['total_fare', "tolls"])\
                .join(rides_fg.select_except(['taxi_id', "driver_id", "pickup_datetime",
                                              "pickup_longitude", "pickup_latitude",
                                              "dropoff_longitude", "dropoff_latitude"]),
                      on=['ride_id'])

# Uncomment this if you would like to view your selected features
# selected_features.show(5)

`Feature Views` stands between **Feature Groups** and **Training Dataset**. –°ombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.create_feature_view()` method.

We can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

In [None]:
# Get or create the 'nyc_taxi_fares_fv' feature view
feature_view = fs.get_or_create_feature_view(
    name='nyc_taxi_fares_fv',
    version=1,
    query=selected_features,
    labels=["total_fare"],
)

---

## <span style="color:#ff5f27;">üèãÔ∏è Training Dataset Creation</span>
    
In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

Training Dataset may contain splits such as:

    Training set - the subset of training data used to train a model.
    Validation set - the subset of training data used to evaluate hparams when training a model
    Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using the `feature_view.train_test_split` method.

In [None]:
X_train, X_test, y_train, y_test = feature_view.train_test_split(
    description='NYC taxi fares dataset',
    test_size=0.2,
)

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
# List of columns to drop from X_train and X_test DataFrames
cols_to_drop = ['ride_id']

# Drop specified columns from X_train DataFrame
X_train = X_train.drop(cols_to_drop, axis=1)

# Drop specified columns from X_test DataFrame
X_test = X_test.drop(cols_to_drop, axis=1)

---
## <span style="color:#ff5f27;">üß¨ Modeling</span>

In [None]:
# Create an instance of XGBRegressor
regressor = xgb.XGBRegressor()

# Train the regressor using the training data
regressor.fit(X_train, y_train)

In [None]:
# Make predictions on the test data using the trained XGBoost regressor
y_pred = regressor.predict(X_test)

# Calculate the Mean Absolute Error (MAE)
xgb_mae = mean_absolute_error(y_test, y_pred)

# Print the calculated XGBRegressor MAE
print("XGBRegressor MAE:", xgb_mae)

# Store the calculated metrics in a dictionary
metrics = {
    'mae': xgb_mae
}

### Remember, the data is random, so the results are not accurate at all.

In [None]:
# Create a DataFrame containing true and predicted values
df_ = pd.DataFrame({
    "y_true": y_test.total_fare.tolist(),
    "y_pred": y_pred,
})

# Create a residual plot using Seaborn
residplot = sns.residplot(
    data=df_, 
    x="y_true", 
    y="y_pred", 
    color='#613F75',
)

# Set plot title and axis labels
plt.title('Model Residuals')
plt.xlabel('Observation #')
plt.ylabel('Error')

# Display the residual plot
plt.show()

In [None]:
# Get the figure from the Seaborn residual plot
fig = residplot.get_figure()

# Display the figure
fig.show()

---
### <span style="color:#ff5f27;">‚öôÔ∏è Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/machine-learning-api/latest/generated/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

# Define an input schema using the features (X_train)
input_schema = Schema(X_train)

# Define an output schema using the target variable (y_train)
output_schema = Schema(y_train)

# Create a model schema using the defined input and output schemas
model_schema = ModelSchema(
    input_schema=input_schema, 
    output_schema=output_schema,
)

# Convert the model schema to a dictionary representation
model_schema_dict = model_schema.to_dict()

## <span style='color:#ff5f27'>üóÑ Model Registry</span>

One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [None]:
# Specify the directory for saving the model artifacts
model_dir = "nyc_taxi_fares_model"

# Check if the directory exists; if not, create it
if not os.path.isdir(model_dir):
    os.mkdir(model_dir)

# Save the trained XGBoost regressor to a json file in the specified directory
regressor.save_model(model_dir + "/model.json")

# Save the residual plot figure as an image file in the specified directory
fig.savefig(model_dir + "/residplot.png")

With the schema in place, you can finally register our model.

In [None]:
# Get the model registry
mr = project.get_model_registry()

# Create a Python model in the model registry
nyc_model = mr.python.create_model(
    name="nyc_taxi_fares_model",
    metrics=metrics,
    model_schema=model_schema,
    input_example=X_train.sample().values,
    description="NYC taxi fares predictor",
)

# Save the model artifacts to the specified directory
nyc_model.save(model_dir)

## <span style="color:#ff5f27;">‚è≠Ô∏è **Next:** Part 04: Batch Inference </span>

In the next notebook you will use your registered model to predict batch data.
