# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Training Data & Modeling</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/fraud_batch/2_training_dataset_and_modeling.ipynb)

<span style="font-width:bold; font-size: 1.4rem;">This notebook explains how to read from a feature group, create training dataset within the feature store, train a model and save it to model registry.</span>

## 🗒️ This notebook is divided into the following sections:

1. Fetch Feature Groups.
2. Define Transformation functions.
3. Create Feature Views.
4. Create Training Dataset with training, validation and test splits.
5. Train the model.
6. Register model in Hopsworks Model Registry.
7. Load batch data.
8. Predict using model from Model Registry.

![part2](../images/02_training-dataset.png) 

In [None]:
!pip install -U xgboost --quiet

In [None]:
import joblib
import os

import pandas as pd
import numpy as np
from matplotlib import pyplot
import seaborn as sns

import xgboost as xgb
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

import warnings

# Mute warnings
warnings.filterwarnings("ignore")

In [None]:
!pip install -U hopsworks --quiet

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()
fs = project.get_feature_store()

### <span style="color:#ff5f27;"> 🔪 Feature Selection </span>

You will start by selecting all the features you want to include for model training/inference.

In [None]:
# Load feature groups.
trans_fg = fs.get_feature_group('transactions_fraud_batch_fg', version=1)
window_aggs_fg = fs.get_feature_group('transactions_4h_aggs_fraud_batch_fg', version=1)

In [None]:
# Select features for training data.
query = trans_fg.select(["fraud_label", "category", "amount", "age_at_transaction", "days_until_card_expires", "loc_delta"])\
    .join(window_aggs_fg.select_except(["cc_num"]))

In [None]:
## uncomment this if you would like to view query results
#query.show(5)

Recall that you computed the features in `transactions_4h_aggs_fraud_batch_fg` using 4-hour aggregates. If you had created multiple feature groups with identical schema for different window lengths, and wanted to include them in the join you would need to include a prefix argument in the join to avoid feature name clash. See the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/query_api/#join) for more details.

---

### <span style="color:#ff5f27;"> 🤖 Transformation Functions </span>


You will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this you simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [None]:
# Load transformation functions.
label_encoder = fs.get_transformation_function(name="label_encoder")

# Map features to transformations.
transformation_functions = {
    "category": label_encoder
}

## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.
In order to create a Feature View you may use `fs.create_feature_view()`. Here we try first to get the feature view, and if we can't an exception is thrown and we create the feature view.

In [None]:
feature_view = fs.get_or_create_feature_view(
    name='transactions_view_fraud_batch_fv',
    version=1,
    query=query,
    labels=["fraud_label"],
    transformation_functions=transformation_functions
)

---

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using `fs.create_train_validation_test_split()` method.

In [None]:
TEST_SIZE = 0.2

td_version, td_job = feature_view.create_train_test_split(
    description = 'transactions fraud batch training dataset',
    data_format = 'csv',
    test_size = TEST_SIZE,
    write_options = {'wait_for_job': True}
)


The feature view and training dataset are now visible in the UI

![fg-overview](../images/fv_overview.gif)

In [None]:
X_train, X_test, y_train, y_test = feature_view.get_train_test_split(td_version)

In [None]:
X_train = X_train.sort_values("datetime")
y_train = y_train.reindex(X_train.index)

In [None]:
X_test = X_test.sort_values("datetime")
y_test = y_test.reindex(X_test.index)

In [None]:
X_train.drop(["datetime"], axis=1, inplace=True)
X_test.drop(["datetime"], axis=1, inplace=True)

In [None]:
y_train.value_counts(normalize=True)

Notice that the distribution is extremely skewed, which is natural considering that fraudulent transactions make up a tiny part of all transactions. Thus you should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data, or modifying the decision threshold. In this example, you will use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class, which in our case means that higher importance will be placed on positive (fraudulent) samples.

---

## <span style="color:#ff5f27;"> 🧬 Modeling</span>

Next you will train a model. Here, you set larger class weight for the positive class.

In [None]:
clf = xgb.XGBClassifier()

clf.fit(X_train.values, y_train)

In [None]:
# Train Predictions
y_pred_train = clf.predict(X_train.values)

# Test Predictions
y_pred_test = clf.predict(X_test.values)

In [None]:
# Compute f1 score
metrics = {"f1_score": f1_score(y_test, y_pred_test, average='macro')}
metrics

In [None]:
results = confusion_matrix(y_test, y_pred_test)
print(results)

In [None]:
df_cm = pd.DataFrame(results, ['True Normal', 'True Fraud'],['Pred Normal', 'Pred Fraud'])

cm = sns.heatmap(df_cm, annot=True)

fig = cm.get_figure()
fig.show()

---

### <span style="color:#ff5f27;">⚙️ Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/3.0/user_guides/mlops/registry/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X_train.values)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

## <span style="color:#ff5f27;">📝 Register model</span>

One of the features in Hopsworks is the model registry. This is where we can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [None]:
# The 'fraud_batch_model' directory will be saved to the model registry
model_dir="fraud_batch_model"
if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)

joblib.dump(clf, model_dir + '/xgboost_fraud_batch_model.pkl')

fig.savefig(model_dir + "/confusion_matrix.png") 

In [None]:
mr = project.get_model_registry()

fraud_model = mr.python.create_model(
    name="xgboost_fraud_batch_model", 
    metrics=metrics,
    model_schema=model_schema,
    input_example=X_train.sample(), 
    description="Fraud Batch Predictor")

fraud_model.save(model_dir)

---

## <span style='color:#ff5f27'>🚀 Fetch and test the model</span>

Finally you can start making predictions with your model! Retrieve your model from Hopsworks model registry.

In [None]:
retrieved_model = mr.get_model(
    name="xgboost_fraud_batch_model",
    version=1
)
saved_model_dir = retrieved_model.download()

In [None]:
retrieved_xgboost_model = joblib.load(saved_model_dir + "/xgboost_fraud_batch_model.pkl")
retrieved_xgboost_model

---
## <span style="color:#ff5f27;">🔮  Batch Prediction </span>


In [None]:
feature_view.init_batch_scoring(td_version)

batch_data = feature_view.get_batch_data()

batch_data.drop(["datetime"], axis=1, inplace=True)

batch_data.head()

In [None]:
predictions = retrieved_xgboost_model.predict(batch_data)

predictions[:5]

---