# <span style="font-width:bold; font-size: 3rem; color:#1EB182;">**Hopsworks Feature Store** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Training Dataset and Modeling</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/citibike/3_training_dataset_and_modeling.ipynb)



## 🗒️ This notebook is divided into 3 main sections:
1. Feature Selection.
2. Feature preprocessing.
3. Training datasets creation.
4. Loading the training data.
5. Train the model.
6. Register model to Hopsworks model registry.

![02_training-dataset](../../images/02_training-dataset.png)

### <span style='color:#ff5f27'> 📝 Imports

In [None]:
import pandas as pd
import numpy as np

from functions import *

from sklearn.model_selection import train_test_split
import xgboost as xgb

from sklearn.metrics import r2_score

import warnings
warnings.filterwarnings('ignore')

---

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [None]:
!pip install -U hopsworks --quiet

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

In [None]:
citibike_usage_fg = fs.get_or_create_feature_group(
    name="citibike_usage",
    version=1
)

In [None]:
citibike_stations_info_fg = fs.get_or_create_feature_group(
    name="citibike_stations_info",
    version=1
)

In [None]:
us_holidays_fg = fs.get_or_create_feature_group(
    name="us_holidays",
    version=1
)

In [None]:
meteorological_measurements_fg = fs.get_or_create_feature_group(
    name="meteorological_measurements",
    version=1
)

---

## <span style="color:#ff5f27;"> 🖍 Feature View Creation and Retrieving </span>

Let's start by selecting all the features you want to include for model training/inference.

In [None]:
# Select features for training data.
query = meteorological_measurements_fg.select_except(["timestamp"])\
                                      .join(
                                            us_holidays_fg.select_except(["timestamp"]),
                                            on="date", join_type="left"
                                      )\
                                      .join(
                                          citibike_usage_fg.select_except(["timestamp"]),
                                          on="date", join_type="left"
                                      )

In [None]:
# # uncomment and run cell below if you want to see some rows from this query
# # but you will have to wait some time

# query.read()

`Feature Views` stands between **Feature Groups** and **Training Dataset**. Сombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.get_or_create_feature_view()` method.

We can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

In [None]:
feature_view = fs.get_or_create_feature_view(
    name='citibike_fv',
    query=query,
    labels=["users_count"],
    version=1    
)

In [None]:
feature_view

For now this `Feature View` is saved in Hopsworks and we can retrieve it using the same method.

In [None]:
feature_view = fs.get_feature_view(
    name='citibike_fv',
    version=1    
)

In [None]:
feature_view

---

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

To create training dataset we use `FeatureView.create_training_data()` method.

Here are some importand things:

- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- We can choose necessary format using **data_format** parameter.

- **start_time** and **end_time** in order to filter dataset in specific time range.

- We can create **train, test** splits using `create_train_test_split()`. 

- We can create **train,validation, test** splits using `create_train_validation_test_splits()` methods.

- The only thing is that we should specify desired ratio of splits.

In [None]:
version, job = feature_view.create_train_test_split(
    train_start="2022-01-01",
    train_end="2022-05-01",
    test_start="2022-05-02",
    test_end="2022-05-31",
    write_options = {'wait_for_job': False},
)

---

## <span style="color:#ff5f27;">🪝 Training Dataset Retrieval</span>

In [None]:
X_train, X_test, y_train, y_test = feature_view.get_train_test_split(
    training_dataset_version=version
)

In [None]:
X_train.iloc[:, 1:-1] = X_train.iloc[:, 1:-1].astype(float)
X_test.iloc[:, 1:-1] = X_test.iloc[:, 1:-1].astype(float)

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
X_train = X_train.set_index(["date", "station_id"])
X_test = X_test.set_index(["date", "station_id"])

In [None]:
X_train.head(3)

---
## <span style="color:#ff5f27;">🧬 Modeling</span>

In [None]:
regressor = xgb.XGBRegressor()
 
# Fitting the model
regressor.fit(X_train, y_train)
 
# Predict the model
y_pred = regressor.predict(X_test)
 
r2_xgb = r2_score(y_pred, y_test.values)
print("R2 score for XGBoost model:", r2_xgb)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns


df_ = pd.DataFrame({
    "y_true": y_test,
    "y_pred": y_pred
})

residplot = sns.residplot(data=df_, x="y_true", y="y_pred", color='#613F75')
plt.title('Model Residuals')
plt.xlabel('Obsevation #')
plt.ylabel('Error')

plt.show()

In [None]:
import os

if os.path.isdir("assets") == False:
    os.mkdir("assets")


fig = residplot.get_figure()
fig.savefig("assets/residplot.png") 
fig.show()

---
### <span style="color:#ff5f27;">⚙️ Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/3.0/user_guides/mlops/registry/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

## <span style='color:#ff5f27'>🗄 Model Registry</span>

One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [None]:
import joblib
import os
import shutil


# The 'nyc_taxi_fares_model' directory will be saved to the model registry
model_dir="citibike_xgb_model"
if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)

joblib.dump(regressor, model_dir + '/citibike_xgb_model.pkl')

shutil.copyfile("assets/residplot.png", model_dir + "/residplot.png")

In [None]:
mr = project.get_model_registry()

citibike_model = mr.python.create_model(
    name="citibike_xgb_model", 
    metrics={"r2_score": r2_xgb},
    model_schema=model_schema,
    input_example=X_train.sample(), 
    description="Citibike users per station Predictor")

citibike_model.save(model_dir)

---

### <span style="color:#ff5f27;">🥳 <b> Next Steps  </b> </span>
Congratulations you've now completed the CityBikes tutorial for Managed Hopsworks.

Check out our other tutorials on ➡ https://github.com/logicalclocks/hopsworks-tutorials

Or documentation at ➡ https://docs.hopsworks.ai