# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Training Dataset and Modeling</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/air_quality/3_training_dataset_and_modeling.ipynb)

<span style="font-width:bold; font-size: 1.4rem;">This notebook explains how to read from a feature group, create a feature view and training dataset within the feature store, train a model and register it in model registry.</span>

## 🗒️ This notebook is divided into the following sections:

1. Fetch Feature Groups.
2. Define Transformation functions.
3. Create Feature Views.
4. Create Training Dataset with training, validation and test splits.
5. Loading the training data.
6. Train the model.
7. Register model in Hopsworks model registry.

![part2](../../images/02_training-dataset.png) 

## <span style='color:#ff5f27'> 📝 Imports

In [None]:
import os
import joblib

from sklearn.metrics import mean_absolute_error, r2_score
import xgboost as xgb

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks


project = hopsworks.login()
fs = project.get_feature_store() 

In [None]:
air_quality_fg = fs.get_or_create_feature_group(
    name = 'air_quality_fg',
    version = 1
)

weather_fg = fs.get_or_create_feature_group(
    name = 'weather_fg',
    version = 1
)

In [None]:
query = air_quality_fg.select_all().join(weather_fg.select_all())

--- 

## <span style="color:#ff5f27;"> 🖍 Feature View Creation and Retrieving </span>

In [None]:
query = air_quality_fg.select_all().join(weather_fg.select_all())

query_5_records = query.show(5)
col_names = query_5_records.columns

query_5_records

### <span style="color:#ff5f27;">🧑🏻‍🔬 Transformation functions</span>

Hopsworks Feature Store provides functionality to attach transformation functions to training datasets.

Hopsworks Feature Store also comes with built-in transformation functions such as `min_max_scaler`, `standard_scaler`, `robust_scaler` and `label_encoder`.

In [None]:
[t_func.name for t_func in fs.get_transformation_functions()]

You can retrieve transformation function you need.

To attach transformation function to training dataset provide transformation functions as dict, where key is feature name and value is online transformation function name.

Also training dataset must be created from the Query object. Once attached transformation function will be applied on whenever save, insert and get_serving_vector methods are called on training dataset object.

In [None]:
category_cols = ['city','conditions']

le = fs.get_transformation_function(name='label_encoder') 

transformation_functions = {
    col: le
    for col 
    in category_cols
}

`Feature Views` stands between **Feature Groups** and **Training Dataset**. Сombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.get_or_create_feature_view()` method.

You can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

In [None]:
feature_view = fs.get_or_create_feature_view(name="air_quality_fv",
                                             version=1,
                                             transformation_functions=transformation_functions,
                                             query=query)

---

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

To create training dataset we use `FeatureView.create_training_data()` method.

Here are some importand things:

- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- We can choose necessary format using **data_format** parameter.

- **start_time** and **end_time** in order to filter dataset in specific time range.

- We can create **train, test** splits using `create_train_test_split()`. 

- We can create **train,validation, test** splits using `create_train_validation_test_splits()` methods.

- The only thing is that we should specify desired ratio of splits.

In [None]:
td_version, td_job = feature_view.create_training_data(
    description='Ait Quality Project dataset',
    data_format='csv',
    write_options={'wait_for_job': True},
    coalesce=True,
)

---
## <span style="color:#ff5f27;">🪝 Training Dataset Retrieval</span>

In [None]:
data = feature_view.get_training_data(
    training_dataset_version=td_version
)

In [None]:
X, _ = data

In [None]:
X

---
## <span style="color:#ff5f27;">🧬 Modeling</span>

In [None]:
split_line = int(X.shape[0] * 0.75)

X_train = X.iloc[:split_line]
X_test = X.iloc[split_line:]

In [None]:
X_train = X_train.sort_values(by=["date", 'city'], ascending=[False, True]).reset_index(drop=True)

In [None]:
X_train.head()

In [None]:
X_train = X_train.sort_values(by=["date", 'city'], ascending=[False, True]).reset_index(drop=True)
X_train["aqi_next_day"] = X_train.groupby('city')['aqi'].shift(1)

X_test = X_test.sort_values(by=["date", 'city'], ascending=[False, True]).reset_index(drop=True)
X_test["aqi_next_day"] = X_test.groupby('city')['aqi'].shift(1)

In [None]:
X_train = X_train.drop(columns=["date"]).fillna(0)
y_train = X_train.pop("aqi_next_day")

X_test = X_test.drop(columns=["date"]).fillna(0)
y_test = X_test.pop("aqi_next_day")

---
## <span style="color:#ff5f27;">🧬 Modeling</span>

In [None]:
regressor = xgb.XGBRegressor()

regressor.fit(X_train.values, y_train.values)

### <span style='color:#ff5f27'> 📐 Model Validation

In [None]:
y_pred = regressor.predict(X_test)

xgb_mae = mean_absolute_error(y_test, y_pred)

print("XGBRegressor MAE:", xgb_mae)

metrics = {
    'mae': xgb_mae
}

In [None]:
fig,ax = plt.subplots(1,1,figsize=(12,5))

x_ax = range(len(y_test))
ax = plt.scatter(x_ax, y_test, s=5, color="blue", label="original")
ax = plt.plot(x_ax, y_pred, lw=0.8, color="red", label="predicted")

plt.legend()
plt.title('Regression Quality',fontsize=18)

plt.show()

---
## <span style='color:#ff5f27'>🗄 Model Registry</span>

One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [None]:
mr = project.get_model_registry()

### <span style="color:#ff5f27;">⚙️ Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/machine-learning-api/latest/generated/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X)
output_schema = Schema(y)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

In [None]:
model_dir="air_quality_model"
if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)
    
model_path = model_dir + '/air_quality_model.pkl'
fig.savefig(model_dir + '/regression_quality.png') # save plot also
joblib.dump(regressor, model_path)

In [None]:
model = mr.python.create_model(
    name="air_quality_model",
    metrics=metrics,
    description="XGBoost Regressor.",
    input_example=X_test.sample(),
    model_schema=model_schema
)

model.save(model_path)

## <span style="color:#ff5f27;"> 📮 Retrieving model from Model Registry </span>

In [None]:
retrieved_model = mr.get_model(
    name="air_quality_model",
    version=1
)
saved_model_dir = retrieved_model.download()

In [None]:
retrieved_xgboost_model = joblib.load(saved_model_dir + "/air_quality_model.pkl")
retrieved_xgboost_model

## <span style="color:#ff5f27;"> 🤖 Making the predictions </span>

### <span style="color:#ff5f27;"> ✨ Load Batch Data</span>

In [None]:
feature_view.init_batch_scoring(training_dataset_version=td_version)
batch_data = feature_view.get_batch_data()
batch_data.head()

In [None]:
X_batch = batch_data.sort_values(by=["date", 'city'], ascending=[False, True]).reset_index(drop=True)
X_batch["aqi_next_day"] = X_batch.groupby('city')['aqi'].shift(1)
X_batch = X_batch.drop(columns=["date"]).fillna(0)
y_batch = X_batch.pop("aqi_next_day")


In [None]:
predictions = retrieved_xgboost_model.predict(X_batch)
predictions[:10]

---

### <span style="color:#ff5f27;">🥳 <b> Next Steps  </b> </span>
Congratulations you've now completed the Air Quality tutorial for Managed Hopsworks.

Check out our other tutorials on ➡ https://github.com/logicalclocks/hopsworks-tutorials

Or documentation at ➡ https://docs.hopsworks.ai