# <span style="font-width:bold; font-size: 3rem; color:#1EB182;">**Hopsworks Feature Store** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Training Dataset and Modeling</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/nyc_taxi_fares/3_training_dataset_and_modeling.ipynb)


## üóíÔ∏è This notebook is divided into 3 main sections:
1. Feature selection.
2. Feature transformations.
3. Training datasets creation.
4. Train the model.
5. Register model to Hopsworks model registry.

![02_training-dataset](../../images/02_training-dataset.png)

In [None]:
!pip install -U hopsworks --quiet

!pip install xgboost

In [None]:
import os
import joblib

import pandas as pd

from sklearn.metrics import mean_absolute_error, r2_score
import xgboost as xgb

import matplotlib.pyplot as plt
import seaborn as sns

import warnings

# Mute warnings
warnings.filterwarnings("ignore")

## <span style="color:#ff5f27;"> üì° Connecting to the Hopsworks Feature Store </span>

In [None]:
import hopsworks


project = hopsworks.login()
fs = project.get_feature_store()

In [None]:
# Retrieve feature groups
rides_fg = fs.get_or_create_feature_group(
    name="nyc_taxi_rides",
    version=1
)

fares_fg = fs.get_or_create_feature_group(
    name="nyc_taxi_fares",
    version=1
)

In [None]:
rides_fg.read().head()

---

## <span style="color:#ff5f27;"> üñç Feature View Creation and Retrieval </span>

First you need to build a query object from desired features.

In [None]:
# Select features for training data.
query = fares_fg.select(['total_fare', "tolls"])\
                            .join(rides_fg.select_except(['taxi_id', "driver_id", "pickup_datetime",
                                                          "pickup_longitude", "pickup_latitude",
                                                          "dropoff_longitude", "dropoff_latitude"]),
                                  on=['ride_id'])

# query.show(2)

`Feature Views` stands between **Feature Groups** and **Training Dataset**. –°ombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.create_feature_view()` method.

We can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

In [None]:
feature_view = fs.get_or_create_feature_view(
    name='nyc_taxi_fares_fv',
    version=1,
    query=query,
    labels=["total_fare"]
)

---

## <span style="color:#ff5f27;">üèãÔ∏è Training Dataset Creation</span>
    
In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

Training Dataset may contain splits such as:

    Training set - the subset of training data used to train a model.
    Validation set - the subset of training data used to evaluate hparams when training a model
    Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using fs.create_train_validation_test_split() method.

In [None]:
td_version, td_job = feature_view.create_train_test_split(
    description = 'NYC taxi fares dataset',
    data_format = 'csv',
    test_size = 0.2,
    write_options = {'wait_for_job': True},
    coalesce = True,
)

In [None]:
X_train, X_test, y_train, y_test = feature_view.get_train_test_split(
    training_dataset_version=td_version
)

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
cols_to_drop = ['ride_id']

X_train = X_train.drop(cols_to_drop, axis=1)
X_test = X_test.drop(cols_to_drop, axis=1)

---
## <span style="color:#ff5f27;">üß¨ Modeling</span>

In [None]:
regressor = xgb.XGBRegressor()

regressor.fit(X_train.values, y_train.values)

In [None]:
y_pred = regressor.predict(X_test)

xgb_mae = mean_absolute_error(y_test, y_pred)

print("XGBRegressor MAE:", xgb_mae)

metrics = {
    'mae': xgb_mae
}


### Remember, the data is random, so the results are not accurate at all.

In [None]:
df_ = pd.DataFrame({
    "y_true": y_test.total_fare.tolist(),
    "y_pred": y_pred
})

residplot = sns.residplot(data=df_, x="y_true", y="y_pred", color='#613F75')
plt.title('Model Residuals')
plt.xlabel('Obsevation #')
plt.ylabel('Error')

plt.show()

In [None]:
fig = residplot.get_figure()
fig.show()

---
### <span style="color:#ff5f27;">‚öôÔ∏è Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/machine-learning-api/latest/generated/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

## <span style='color:#ff5f27'>üóÑ Model Registry</span>

One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [None]:
# The 'nyc_taxi_fares_model' directory will be saved to the model registry
model_dir="nyc_taxi_fares_model"
if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)

joblib.dump(regressor, model_dir + '/nyc_taxi_fares_model.pkl')

fig.savefig(model_dir + "/residplot.png") 


With the schema in place, you can finally register our model.

In [None]:
mr = project.get_model_registry()

nyc_model = mr.python.create_model(
    name="nyc_taxi_fares_model", 
    metrics=metrics,
    model_schema=model_schema,
    input_example=X_train.sample().values, 
    description="NYC taxi fares predictor.")

nyc_model.save(model_dir)

---

## <span style="color:#ff5f27;"> üìÆ Retrieving model from Model Registry </span>

In [None]:
retrieved_model = mr.get_model(
    name="nyc_taxi_fares_model",
    version=1
)
saved_model_dir = retrieved_model.download()

In [None]:
retrieved_xgboost_model = joblib.load(saved_model_dir + "/nyc_taxi_fares_model.pkl")
retrieved_xgboost_model

## <span style="color:#ff5f27;"> ü§ñ Making the predictions </span>

In [None]:
predictions = retrieved_xgboost_model.predict(X_test)
predictions[:10]

It's important to know that every time you save a model with the same name, a new version of the model will be saved, so nothing will be overwritten. In this way, you can compare several versions of the same model - or create a model with a new name, if you prefer that.

---

### <span style="color:#ff5f27;">ü•≥ <b> Next Steps  </b> </span>
Congratulations you've now completed the Nyc Taxi Fares tutorial for Managed Hopsworks.

Check out our other tutorials on ‚û° https://github.com/logicalclocks/hopsworks-tutorials

Or documentation at ‚û° https://docs.hopsworks.ai