![hopsworks_logo](../images/hopsworks_logo.png)

# Part 03: Model training & UI Exploration
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/fraud_batch/3_model_training.ipynb)

**Note**: you may get an error when installing hopsworks on Colab, and it is safe to ignore it.

In this last notebook, we will train a model on the dataset we created in the previous tutorial. We will train our model using standard Python and Scikit-learn, although it could just as well be trained with other machine learning frameworks such as PySpark, TensorFlow, and PyTorch. We will also show some of the exploration that can be done in Hopsworks, notably the search functions and the lineage.

## 🗒️ This notebook is divided in 3 main sections:
1. **Loading the training data**
2. **Train the model**
3. **Explore feature groups and views** via the UI.

![tutorial-flow](../images/03_model.png)

In [None]:
!pip install -U hopsworks --quiet

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

---

## <span style="color:#ff5f27;"> ✨ Load Training Data </span>

First, we'll need to fetch the training dataset that we created in the previous notebook. We will use January - February data training and testing.

In [None]:
feature_view = fs.get_feature_view("transactions_fraud_online_fv", 1)

In [None]:
X_train, y_train = feature_view.get_training_data(1)
X_test, y_test = feature_view.get_training_data(2)

We will train a model to predict `fraud_label` given the rest of the features.

Let's check the distribution of our target label.

In [None]:
y_train.value_counts(normalize=True)

Notice that the distribution is extremely skewed, which is natural considering that fraudulent transactions make up a tiny part of all transactions. Thus we should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data, or modifying the decision threshold. In this example, we'll use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class, which in our case means that higher importance will be placed on positive (fraudulent) samples.

---

## <span style="color:#ff5f27;"> ⚜️ Weights and Biases </span>

[Weights and Biases](https://wandb.ai/) is a free Python library that allows you to track, compare, and visualize ML experiments -> build better models faster. 

In our case we will use **W&B** to track Data Lineage using **Artifacts**, find the best hyperparameters using **Sweep** and visualize model performance.

To begin with, let's install `wandb` library.

In [None]:
!pip install wandb --quiet

Sign up for a free account by going to the [sign up page](https://wandb.ai/home).

After that you should login.

In [None]:
import wandb

wandb.login()

In [None]:
PROJECT_NAME = 'fraud_online'

---

## <span style="color:#ff5f27;"> 🗃 W&B Artifacts </span>

Use W&B Artifacts for dataset versioning, model versioning, and tracking dependencies and results across machine learning pipelines. Think of an artifact as a versioned folder of data. You can store entire datasets directly in artifacts, or use artifact references to point to data in other systems like S3, GCP, or your own system. 

Also you can visualize Data Lineage for better understanding of project pipeline.

In [None]:
# create a run in W&B
run = wandb.init(
    project=PROJECT_NAME,
    job_type="upload_feature_view",
    name='metadata'
)

# create an artifact for all the raw data
raw_data_at = wandb.Artifact(
    "transactions_view_fraud_online_fv_version1", 
    type="feature_view",
    metadata = {key:value.__repr__() for key,value in feature_view.to_dict().items()}
)

# save artifact to W&B
run.log_artifact(raw_data_at)

In [None]:
run = wandb.init(
    project=PROJECT_NAME,
    name="train_validation_test_split",
    job_type='split'
)

data_at = run.use_artifact("transactions_view_fraud_online_fv_version1:latest")
data_dir = data_at.download()

artifacts = {}

for split in ['train','validation','test']:
    artifacts[split] = wandb.Artifact(f'{split}_split', type="split")  
    
for split, artifact in artifacts.items():
    run.log_artifact(artifact)

To check Data Lineage follow next steps:

1. Go to the [W&B main page](https://wandb.ai/home).

2. Select **"nyc_taxi_fares"** project.

3. Select the **"Artifacts"** icon in the left sidebar.

4. Inspect the `transactions_view_fraud_online_fv_version1` type artifact.

5. Go to the **"Lineage"** and then press **"Explode"**.

So for now Data Lineage should look like this:

![image.png](../images/wandb_fraud_online_lineage.png)

---
## <span style="color:#ff5f27;"> 🏋️‍♀️ Define Model Training Function</span>

It is important to define a function which will be used by the Sweep agent.

In this function we:

- Set default hyperparameters for the model.

- Initialize a new W&B Run using `wandb.init`.

- Register all hyperparameters through `wandb.config`.

- Create a RandomForestClassifier with a set of hyperparameters.

- Fit a RandomForestClassifier.

- Predict and evaluate.

- Log all metrics using `wandb.log`.

- Plot beautiful plots using `wandb.sklearn.plot_classifier`.

#### <span style="color:#ff5f27;"> 📝 Importing Libraries</span>

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib import pyplot
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import confusion_matrix, f1_score

In [None]:
def train_model(X_train=X_train, y_train=y_train,X_test=X_test, y_test=y_test):

    config_defaults = {
        'n_estimators': 100, 
        'criterion': 'gini',
        'min_samples_split': 10
    }
    features = X_train.columns

    wandb.init(config=config_defaults)
    config = wandb.config

    model = RandomForestClassifier(
        n_estimators=config.n_estimators, 
        criterion=config.criterion,
        min_samples_split=config.min_samples_split,
        max_features='sqrt',
        n_jobs=-1,
        random_state=42,
        class_weight={0: 0.1, 1: 0.9}
    )
    model.fit(X_train, y_train)

    y_preds = model.predict(X_test)
    y_probas = model.predict_proba(X_test)
  
    score = f1_score(y_test, y_preds)
    print(f"F1_score: {round(score, 4)}")

    wandb.log({"F1_score": score})
    
    wandb.sklearn.plot_classifier(model, X_train, X_test, y_train, y_test, 
                                y_preds, y_probas, features, model_name='RandomForestClassifier')

---

## <span style="color:#ff5f27;"> 🔧 Define Sweep Configurations</span>

The next step is to define configurations for Sweep.

**Weights & Biases Sweeps** are used to automate hyperparameter optimization and explore the space of possible models.

You will initialize Sweep in form of a dictionary.

You should include next steps:

- `method`: specify your search strategy (**Bayesian**, **Grid** and **Random** searches.)

- `metric`: define the name and goal (maximize or minimize) of the metric. **Example**: *name: MSE, goal: minimize*.

- `parameters`: define the hyperparameters as the keys of a dictionary and their corresponding values to search over in the form of a list stored as the values of this dictionary.

In [None]:
sweep_configs = {
    "method": "grid",
    "metric": {
        "name": "f1_score",
        "goal": "maximize"
    },
    "parameters": {
        "n_estimators": {
            "values": [75,150]
        },
        "criterion": {
            "values": ['gini','entropy']
        },
        "min_samples_split": {
            "values": [5,15]
        }
    }
}

Then we initialize the sweep and run the sweep agent.

In [None]:
sweep_id = wandb.sweep(
    sweep=sweep_configs,
    project=PROJECT_NAME
)

In [None]:
wandb.agent(
    sweep_id=sweep_id,
    function=train_model,
    count=5
)

# if this cell crushes, restart the kernel and dont run cells with artifacts above
# run only Sweep cells instead

### <span style="color:#ff5f27;">🎉 Great! 📈</span>

Now you can go to the **Weights & Biases** UI to look at the results

There you can find some great plots which will help you to control model development process such as **Feature Importance**, **ROC Curve**, **Confusion Matrix** and others.

---

## <span style="color:#ff5f27;">🧬 Modeling</span>

![plots.gif](../images/plots.gif)

Also you can explore how different hyperparameters affect model.

In addition you can sort all observations by desired column. In our case you will sort by F1_score metric in order to find the best set of Hyperparameters.

![hyperparams.gif](../images/hyperparams.gif)

---
## <span style="color:#ff5f27;">🚀 Fit the best model</span>

In [None]:
# Train model.
pos_class_weight = 0.9

clf = RandomForestClassifier(
    n_estimators=150,
    criterion='entropy',
    min_samples_split=15,
    max_features='sqrt',
    n_jobs=-1,
    random_state=42,
    class_weight={0: 0.1, 1: 0.9}
)

clf.fit(X_train, y_train)

In [None]:
# Test Predictions
y_pred_test = clf.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred_test)
sns.heatmap(cm, annot=True)

---

### <span style="color:#ff5f27;">🔮 Saving Model in W&B</span>

In [None]:
import joblib

run = wandb.init(project="fraud_batch", job_type="model_building", name = 'rfregressor') 

data_at = run.use_artifact("train_split:latest")
data_dir = data_at.download()

model_artifact = wandb.Artifact(
            "RandomForestRegressor", type="model",
            description="This model is trained on the data from  `Hopsworks  Feature View`.\
                You can check it on the https://app.hopsworks.ai.\
                Just Login and go to the `Feature Views` page and find **nyc_fares_fv**.",
            metadata=dict(sweep_configs))

joblib.dump(clf, "model.joblib")

model_artifact.add_file("model.joblib")

wandb.save("model.joblib")

run.log_artifact(model_artifact)

---

## <span style="color:#ff5f27;">📝 Register model in Hopsworks</span>

One of the features in Hopsworks is the model registry. This is where we can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

Let's connect to the model registry using the [HSML library](https://docs.hopsworks.ai/machine-learning-api/latest) from Hopsworks.

In [None]:
mr = project.get_model_registry()

In [None]:
import joblib
joblib.dump(clf, 'model.pkl')

---

### <span style="color:#ff5f27;">⚙️ Model Schema</span>

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

In [None]:
from sklearn.metrics import f1_score
# Compute f1 score
metrics = {"fscore": f1_score(y_test, y_pred_test, average='micro')}
metrics

In [None]:
test_credit_card = [4467360740682089]
model = mr.sklearn.create_model(
    name="transactions_fraud_online_model",
    metrics=metrics,
    description="Isolation forest anomaly detection model",
    input_example = test_credit_card,
    model_schema=model_schema
)

model.save('model.pkl')

---

## <span style="color:#ff5f27;">🚀 Deploy model</span>
### About Model Serving
Models can be served via KFServing or "default" serving, which means a Docker container exposing a Flask server. For KFServing models, or models written in Tensorflow, you do not need to write a prediction file (see the section below). However, for sklearn models using default serving, you do need to proceed to write a prediction file.

In order to use KFServing, you must have Kubernetes installed and enabled on your cluster.

### Create the Prediction File
In order to deploy a model, you need to write a Python file containing the logic to return a prediction from the model. Don't worry, this is usually a matter of just modifying some paths in a template script. An example can be seen in the code block below, where we have taken this Scikit-learn template script and changed two paths (see comments).

In [None]:
%%writefile predict_example.py
import os
import numpy as np
import hsfs
import joblib

class Predict(object):

    def __init__(self):
        """ Initializes the serving state, reads a trained model"""        
        # get feature store handle
        fs_conn = hsfs.connection()
        self.fs = fs_conn.get_feature_store()
        
        # get feature views
        self.fv = self.fs.get_feature_view("transactions_fraud_online_fv", 1)
        
        # initialise serving
        self.fv.init_serving(1)

        # load the trained model
        self.model = joblib.load(os.environ["ARTIFACT_FILES_PATH"] + "/model.pkl")
        print("Initialization Complete")

    def predict(self, inputs):
        """ Serves a prediction request usign a trained model"""
        return self.model.predict(np.asarray(self.fv.get_feature_vector({"cc_num": inputs[0]})).reshape(1, -1)).tolist() # Numpy Arrays are not JSON serializable


If you wonder why we use the path Models/fraud_tutorial_model/1/model.pkl, it is useful to know that the Data Sets tab in the Hopsworks UI lets you browse among the different files in the project. Registered models will be found underneath the Models directory. Since we saved our model with the name fraud_tutorial_model, that's the directory we should look in. 1 is just the version of the model we want to deploy.

This script needs to be put into a known location in the Hopsworks file system. Let's call the file predict_example.py and put it in the Models directory.

In [None]:
import os
dataset_api = project.get_dataset_api()

uploaded_file_path = dataset_api.upload("predict_example.py", "Models", overwrite=True)

predictor_script_path = os.path.join("/Projects", project.name, uploaded_file_path)

In [None]:
predictor_script_path

### on Windows, `predictor_script_path` may be like that:
`'/Projects\\romankah\\Models/predict_example.py'`
### Thats incorrect, you have to rewrite it to the correct format:
`'/Projects/romankah/Models/predict_example.py'`

---

## 📡 Create the deployment
Here, we fetch the model we want from the model registry and define a configuration for the deployment. For the configuration, we need to specify the serving type (default or KFserving) and in this case, since we use default serving and an sklearn model, we need to give the location of the prediction script.

In [None]:
# Use the model name from the previous notebook.
model = mr.get_model("transactions_fraud_online_model", version=1)

In [None]:
# Give it any name you want
deployment = model.deploy(
    name="fraudonlinemodeldeployment", 
    model_server="PYTHON",
    script_file=predictor_script_path,
#     serving_tool = "KSERVE"
)

In [None]:
print("Deployment: " + deployment.name)
deployment.describe()

#### The deployment has now been registered. However, to start it you need to run:

In [None]:
deployment.start()

In [None]:
deployment.get_logs()

---

## <span style='color:#ff5f27'>🔮 Predicting using deployment</span>
Let's use the input example that we registered together with the model to query the deployment.

In [None]:
model.input_example

In [None]:
data = {
    "inputs": model.input_example
}

deployment.predict(data)

In [None]:
deployment.get_logs()

In [None]:
deployment.stop()

---

## <span style="color:#ff5f27;"> 🎁  Wrapping things up </span>

We have now performed a simple training with training data that we have created in the feature store. This concludes the fisrt module and introduction to the core aspect of the Feature store. In the second module we will introduce streaming and external feature groups for a similar fraud use case.