## Model training and registration

In this notebook we will:

- Register a model to the model registry.
- Fetch the model from the model registry.

This will introduce the `hsml` (**H**opsworks **M**achine **L**earning) library, which contains functionality to keep track of models and deploy them.

In [None]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

### Load Best Hyperparameters

We will train a model using the best hyperparameters we found in the previous notebook.
Recall that we saved these as a pickled dictionary that we uploaded to our cluster.
If you don't have the file locally you can download it using the *hopsworks* library.

In [None]:
import hopsworks

hopsworks_conn = hopsworks.connection()
project = hopsworks_conn.get_project()
dataset_api = project.get_dataset_api()

uploaded_file_path = "Resources/best_params.pickle"
dataset_api.download(uploaded_file_path)

Retrieve the best hyperparameters from this file.

In [None]:
import pickle

with open("best_params.pickle", "rb") as f:
    best_params = pickle.load(f)

pos_class_weight = best_params["pos_class_weight"]

print(pos_class_weight)

### Load Training Data & Train Model

Next, we'll train a model in the same way as the previous notebook.

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder

# Load data.
td = fs.get_training_dataset("transactions_dataset_splitted")
X_train = td.read("train")
X_val = td.read("validation")

# One-hot encode categorical feature "category".
enc = OneHotEncoder(sparse=False)
one_hot_train = pd.DataFrame(enc.fit_transform(X_train[["category"]]))
one_hot_val = pd.DataFrame(enc.transform(X_val[["category"]]))
X_train = pd.concat([X_train.drop(columns="category"), one_hot_train], axis=1)
X_val = pd.concat([X_val.drop(columns="category"), one_hot_val], axis=1)

# Separate target feature from input features.
target = td.label[0]  # "fraud_label"
y_train = X_train.pop(target)
y_val = X_val.pop(target)

# Train model.
clf = LogisticRegression(class_weight={0: 1.0 - pos_class_weight, 1: pos_class_weight}, solver='liblinear')
clf.fit(X_train, y_train)

We will also compute some evaluation metrics, which we will register together with the model.

In [None]:
from sklearn.metrics import precision_recall_fscore_support, classification_report

preds = clf.predict(X_val)

precision, recall, fscore, _ = precision_recall_fscore_support(y_val, preds, average="binary")

metrics = {
    'precision': precision,
    'recall': recall,
    'fscore': fscore
}

print(classification_report(y_val, preds))


### Register model

One of the features in Hopsworks is the model registry. This is where we can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

Let's connect to the model registry using the [HSML library](https://docs.hopsworks.ai/machine-learning-api/latest) from Hopsworks.

In [None]:
import hsml

conn = hsml.connection()
mr = conn.get_model_registry()

Before registering the model we will export it as a pickle file using joblib.

In [None]:
import joblib

joblib.dump(clf, 'model.pkl')

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/machine-learning-api/latest/generated/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

With the schema in place, we can finally register our model.

In [None]:
model = mr.sklearn.create_model(
    name="fraud_tutorial_model",
    metrics=metrics,
    description="Logistic regression model trained with class weights.",
    input_example=X_train.sample(),
    model_schema=model_schema
)

model.save('model.pkl')

Here we have also saved an input example from the training data, which can be helpful for test purposes.

It's important to know that every time you save a model with the same name, a new version of the model will be saved, so nothing will be overwritten. In this way, you can compare several versions of the same model - or create a model with a new name, if you prefer that.

#### Finding the best performing model

Let's imagine you have trained and registered several versions of the same model. Now you can query the model registry for the best model according to your preferred criterion, say F1-score in our case.

The `direction` option is used to indicate if the metric should be high or low (max or min); in our case, it should be high (max).

In [None]:
best_model = mr.get_best_model(name="fraud_tutorial_model", metric="fscore", direction="max")
best_model.to_dict()

### Next Steps

In the next notebook, we'll look at model serving for the model we just registered to the Model Registry.