# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Training Data & Modeling</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/fraud_online/2_training_dataset_and_modeling.ipynb)

<span style="font-width:bold; font-size: 1.4rem;">This notebook explains how to read from a feature group, create training dataset within the feature store, train a model and save it to model registry.</span>

## 🗒️ This notebook is divided into the following sections:

1. Fetch Feature Groups.
2. Define Transformation functions.
3. Create Feature Views.
4. Create Training Dataset with training, validation and test splits.
5. Train the model.
6. Register model in Hopsworks model registry.
7. Load batch data.
8. Predict using model from Model Registry.

![part2](../images/02_training-dataset.png) 

In [None]:
!pip install -U xgboost --quiet

In [None]:
import joblib
import os
import shutil

import pandas as pd
import numpy as np
from matplotlib import pyplot
import seaborn as sns

import xgboost as xgb
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

import warnings

# Mute warnings
warnings.filterwarnings("ignore")

In [None]:
!pip install -U hopsworks --quiet

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

---

## <span style="color:#ff5f27;"> 🔪 Feature Selection </span>

You will start by selecting all the features you want to include for model training/inference.

In [None]:
# Load feature groups.
trans_fg = fs.get_feature_group('transactions_fraud_online_fg', version=1)
profile_online_fg = fs.get_feature_group('profile_fraud_online_fg', version=1)

query = trans_fg.select_all().join(profile_online_fg.select_all())

In [None]:
## uncomment this if you would like to view query results
#query.show(5)

Recall that you computed the features in `transactions_fraud_online_fg`. If you had created multiple feature groups with identical schema for different window lengths, and wanted to include them in the join you would need to include a prefix argument in the join to avoid feature name clash. See the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/query_api/#join) for more details.

---

### <span style="color:#ff5f27;"> 🤖 Transformation Functions </span>


You will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this you simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [None]:
# Load the transformation functions.
label_encoder = fs.get_transformation_function(name="label_encoder")

# Map features to transformation functions.
transformation_functions = {
    "country": label_encoder,
    "gender": label_encoder
}


## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.
In order to create or get a Feature View you may use `fs.get_or_create_feature_view()`

In [None]:
feature_view = fs.get_or_create_feature_view(
    name='transactions_fraud_online_fv',
    version=1,
    query=query,
    labels=["fraud_label"],
    transformation_functions=transformation_functions
)

---

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Ddataset with train/test splits can be created using `fs.create_train_test_split()` method.
Dataset with train/valid/test splits can be created using `fs.create_train_test_split()` method.

**You can use event time filters like `train_start`, `train_end`, `valid_start`, `valid_end`... Values can be either in unix format, string,  datetime.datetime...** 

**Or, use `validation_size` and `test_size` parameters.**

In [None]:
# Training/Test splits, datasets creation. Using timerange arguments.
train_start = "2022/01/01"
train_end = "2022/03/10"
test_start = "2022/03/10"
test_end = "2022/03/31"

td_version, td_job = feature_view.create_train_test_split(
    train_start=train_start,
    train_end=train_end,
    test_start=test_start,
    test_end=test_end,
    data_format = "csv",
    coalesce = True,
    write_options = {'wait_for_job': True},
    )


X_train, X_test, y_train, y_test = feature_view.get_train_test_split(1)

The feature view and training dataset are now visible in the UI

![fg-overview](../images/fv_overview.gif)

In [None]:
X_train = X_train.sort_values("datetime")
y_train = y_train.reindex(X_train.index)

In [None]:
X_test = X_test.sort_values("datetime")
y_test = y_test.reindex(X_test.index)

In [None]:
test_sample = X_test.cc_num.values[0]

In [None]:
X_train.drop(["tid", "cc_num", "datetime"], axis=1, inplace=True)
X_test.drop(["tid", "cc_num","datetime"], axis=1, inplace=True)

In [None]:
y_train.value_counts(normalize=True)

Notice that the distribution is extremely skewed, which is natural considering that fraudulent transactions make up a tiny part of all transactions. Thus you should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data, or modifying the decision threshold. In this example, you will use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class, which in our case means that higher importance will be placed on positive (fraudulent) samples.

---

## <span style="color:#ff5f27;"> 🧬 Modeling</span>

Next you will train a model. Here, you set larger class weight for the positive class.

In [None]:
clf = xgb.XGBClassifier()

clf.fit(X_train.values, y_train)

In [None]:
# Train Predictions
y_pred_train = clf.predict(X_train.values)

# Test Predictions
y_pred_test = clf.predict(X_test.values)

In [None]:
y_pred_test

In [None]:
X_test

In [None]:
# Compute f1 score
metrics = {"f1_score": f1_score(y_test, y_pred_test, average='macro')}
metrics

In [None]:
results = confusion_matrix(y_test, y_pred_test, labels=[False, True])
print(results)

In [None]:
if os.path.isdir("assets") == False:
    os.mkdir("assets")

df_cm = pd.DataFrame(results, ['True Normal', 'True Fraud'],['Pred Normal', 'Pred Fraud'])

cm = sns.heatmap(df_cm, annot=True)

fig = cm.get_figure()
fig.savefig("assets/confusion_matrix.png") 
fig.show()

---

### <span style="color:#ff5f27;">⚙️ Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/3.0/user_guides/mlops/registry/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X_train.values)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

## <span style="color:#ff5f27;">📝 Register model</span>

One of the features in Hopsworks is the model registry. This is where we can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [None]:
# The 'fraud_online_model' directory will be saved to the model registry
model_dir="fraud_online_model"
if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)

joblib.dump(clf, model_dir + '/xgboost_fraud_online_model.pkl')

shutil.copyfile("assets/confusion_matrix.png", model_dir + "/confusion_matrix.png")

In [None]:
mr = project.get_model_registry()

fraud_model = mr.python.create_model(
    name="xgboost_fraud_online_model", 
    metrics=metrics,
    model_schema=model_schema,
    input_example=4700702588013561, 
    description="Fraud Online Predictor")

fraud_model.save(model_dir)

---

## <a class="anchor" id="1.5_bullet" style="color:#ff5f27"> 🚀 Model Deployment</a>


### About Model Serving
Models can be served via KFServing or "default" serving, which means a Docker container exposing a Flask server. For KFServing models, or models written in Tensorflow, you do not need to write a prediction file (see the section below). However, for sklearn models using default serving, you do need to proceed to write a prediction file.

In order to use KFServing, you must have Kubernetes installed and enabled on your cluster.

### <span style="color:#ff5f27;">📎 Predictor script for Python models</span>


Scikit-learn and XGBoost models are deployed as Python models, in which case you need to provide a **Predict** class that implements the **predict** method. The **predict()** method invokes the model on the inputs and returns the prediction as a list.

The **init()** method is run when the predictor is loaded into memory, loading the model from the local directory it is materialized to, *ARTIFACT_FILES_PATH*.

The directive "%%writefile" writes out the cell before to the given Python file. We will use the **predict_example.py** file to create a deployment for our model. 

In [None]:
%%writefile predict_example.py
import os
import numpy as np
import hsfs
import joblib

class Predict(object):

    def __init__(self):
        """ Initializes the serving state, reads a trained model"""        
        # get feature store handle
        fs_conn = hsfs.connection()
        self.fs = fs_conn.get_feature_store()
        
        # get feature views
        self.fv = self.fs.get_feature_view("transactions_fraud_online_fv", 1)
        
        # initialise serving
        self.fv.init_serving(1)

        # load the trained model
        self.model = joblib.load(os.environ["ARTIFACT_FILES_PATH"] + "/xgboost_fraud_online_model.pkl")
        print("Initialization Complete")

    def predict(self, inputs):
        """ Serves a prediction request usign a trained model"""
        feature_vector = self.fv.get_feature_vector({"cc_num": inputs[0]})
        x = feature_vector[3:] # get rid of 'tid', 'datetime', 'cc_num'
    
        # Numpy Arrays are not JSON serializable
        return self.model.predict(np.asarray(x).reshape(1, -1)).tolist()
    

If you wonder why we use the path Models/fraud_tutorial_model/1/model.pkl, it is useful to know that the Data Sets tab in the Hopsworks UI lets you browse among the different files in the project. Registered models will be found underneath the Models directory. Since you saved you model with the name fraud_tutorial_model, that's the directory you should look in. 1 is just the version of the model you want to deploy.

This script needs to be put into a known location in the Hopsworks file system. Let's call the file predict_example.py and put it in the Models directory.

In [None]:
dataset_api = project.get_dataset_api()

uploaded_file_path = dataset_api.upload("predict_example.py", "Models", overwrite=True)
predictor_script_path = os.path.join("/Projects", project.name, uploaded_file_path)

### Create the deployment
Here, you fetch the model you want from the model registry and define a configuration for the deployment. For the configuration, you need to specify the serving type (default or KFserving).

In [None]:
try:
    ms = project.get_model_serving()
    deployment = ms.get_deployment("fraudonlinemodeldeployment")
except:
    deployment = fraud_model.deploy(
        name="fraudonlinemodeldeployment",
        serving_tool="DEFAULT",
        script_file=predictor_script_path
    )

In [None]:
print("Deployment: " + deployment.name)
deployment.describe()

#### The deployment has now been registered. However, to start it you need to run the following command:

In [None]:
state = deployment.get_state()

if state.status != "Running":
    deployment.start()
    deployment.describe()
else:
    print("Deployment already running")

---

## <span style='color:#ff5f27'>🔮 Predicting using deployment</span>


Finally you can start making predictions with your model!

Send inference requests to the deployed model as follows:

In [None]:
data = {
    "inputs": [fraud_model.input_example]
}

In [None]:
# to troubleshoot you can use `get_logs()` method
deployment.get_logs()

In [None]:
prediction = deployment.predict(data)

In [None]:
prediction

### Stop Deployment
To stop the deployment you simply run:

In [None]:
deployment.stop()

---

## <span style="color:#ff5f27;">👾 StreamLit App</span>


If you want to see interactive dashboards - use a **StreamLit App**.

Type the next commands in terminal to run a Streamlit App:

`cd {%path_to_hopsworks_tutorials%/bitcoin/}`

`python -m streamlit run streamlit_app.py`

### <span style="color:#ff5f27;"> 👓  Exploration</span>
In the Hopsworks feature store, the metadata allows for multiple levels of explorations and review. Here you will explore a few of those capacities. 

### <span style="color:#ff5f27;"> 🔎 Search</span>
Using the search function in the UI, you can query any aspect of the feature groups, feature_view and training data that was previously created.

### <span style="color:#ff5f27;"> 📊 Statistics</span>
You can also enable statistics in one or all the feature groups.

In [None]:
trans_fg = fs.get_or_create_feature_group("transactions_fraud_online_fg", version=1)
trans_fg.statistics_config = {
    "enabled": True,
    "histograms": True,
    "correlations": True
}

trans_fg.update_statistics_config()
trans_fg.compute_statistics()

![fg-statistics](../images/fg_statistics.gif)


### ⛓️ <b> Lineage </b> 
In all the feature groups and feature view you can look at the relation between each abstractions; what feature group created which training dataset and that is used in which model.
This allows for a clear undestanding of the pipeline in relation to each element. 