## Model training and registration

In this notebook we will:

- Load a training dataset from the feature store
- Train a model on the dataset
- Register the model in the model registry.

This will introduce a new library, `hsml`, which contains functionality to keep track of models and deploy them.

We will train the model using standard Python and Scikit-learn, although it could just as well be trained with other machine learning frameworks such as PySpark, TensorFlow and PyTorch.

In [1]:
import hsfs
import pandas as pd

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


### Load training data from feature store

First, we'll need to fetch the training dataset that we created in the previous notebook.

<!-- As you might remember, the feature store has mutable feature groups and immutable training datasets. The feature groups can get continuously updated, but a training dataset is "frozen" once created, including potential training validation splits.   -->

In [4]:
td = fs.get_training_dataset("transactions_dataset_splitted", version=1)
train_df = td.read('train')
val_df = td.read('validation')

train_df.head()



The next cell is just a type check to make sure we have Pandas data frames. The reason for the check is that if this notebook ever gets executed as a Hopsworks Job, it may be executed with a PySpark kernel and `train_df` and `val_df` would be PySpark data frames at this point, which would not work when training a Scikit-learn model.

In [7]:
# TODO remove??

if not type(train_df) == pd.core.frame.DataFrame: 
    train_df = train_df.toPandas()
    val_df = val_df.toPandas()

We will train a model to predict `fraud_label` given the rest of the features.

In [8]:
target = 'fraud_label'
features = list(set(train_df.columns) - set([target]))

X_train, y_train = train_df[features], train_df[target]
X_val, y_val = val_df[features], val_df[target]

Let's check the distribution of our target label.

In [None]:
y_train.value_counts(normalize=True)

Notice that the distribution is extremely skewed, which is natural considering that fraudulent transactions make up a tiny part of all transactions. Thus we should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data or modifying the decision threshold. In this example, we'll use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class, which in our case means that a higher importance will be placed on positive (fraudulent) samples.

### Train Logistic Regression Model

We will train a simple logistic regression model on the training split and assess performance on the validation split. Here, the focus will not be on training a good model; there are many ways to try to train one that has better performance than the one shown here. The emphasis is rather on showing how to train models and track them in the model registry.

In [6]:
from sklearn.linear_model import LogisticRegression

# Set class weights in proportion to the label ratio.
class_weight = 'balanced'

clf = LogisticRegression(class_weight=class_weight, solver='liblinear')

<!-- Another option is to manually specify numeric weights for each label by ourselves. -->

Let's go ahead and fit the model on our training data.

In [9]:
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Next we evaluate the model performance on the validation data. Since our dataset is unbalanced we will use *precision* and *recall* as evaluation metrics rather than *accuracy*.

In [10]:
from sklearn.metrics import precision_score, recall_score, classification_report

preds = clf.predict(X_val)
pos_precision = precision_score(y_true=y_val, y_pred=preds)
pos_recall = recall_score(y_true=y_val, y_pred=preds)

print(classification_report(y_true=y_val, y_pred=preds))

              precision    recall  f1-score   support

           0       1.00      0.91      0.95    208437
           1       0.02      0.80      0.03       413

    accuracy                           0.90    208850
   macro avg       0.51      0.85      0.49    208850
weighted avg       1.00      0.90      0.95    208850



We can see that our model performs poorly on the positive (fraud) class, which is the one we are mostly concerned about.

### Register model

One of the features in Hopsworks is the model registry. This is where we can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

Let's connect to the model registry using the [HSML library](https://docs.hopsworks.ai/machine-learning-api/latest) from Hopsworks.

In [12]:
import hsml

conn = hsml.connection()
mr = conn.get_model_registry()

Connected. Call `.close()` to terminate connection gracefully.


Before registering the model we will export it as a pickle file using joblib.

In [None]:
import joblib
import os

os.mkdir('tmp_model')
joblib.dump(clf, 'tmp_model/model.pkl')

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/machine-learning-api/latest/generated/model_schema/), which describes the inputs and outputs for a model. In our case, we need to 

A Model Schema can be automatically generated from training examples, as shown below.

In [13]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))


Exported model fraud_tutorial_model with version 1


<hsml.sklearn.model.Model at 0x7f1f48189c10>

With the schema in place we can finally register our model.

In [None]:
model = mr.sklearn.create_model("fraud_tutorial_model", 
                                metrics={'positive_precision': pos_precision, 'positive_recall': pos_recall},
                                input_example=X_train,
                                model_schema=model_schema)

model.save('tmp_model')

It's important to know that every time you save a model with the same name, a new version of the model will be saved, so nothing will be overwritten. In this way, you can compare several versions of the same model - or create a model with a new name, if you prefer that.

#### Finding the best performing model

Let's imagine you have trained and registered several versions of the same model. Now you can query the model registry for the best model according to your preferred criterion, say positive recall in our case.

The `direction` option is used to indicate if the metric should be high or low (max or min); in our case it should be high (max).

In [14]:
best_model = mr.get_best_model(name="fraud_tutorial_model", metric="positive_recall", direction="max")
best_model.to_dict()

{'id': 'fraud_tutorial_model_1',
 'experimentId': None,
 'projectName': 'clean_up',
 'experimentProjectName': 'clean_up',
 'name': 'fraud_tutorial_model',
 'modelSchema': {'href': 'https://hopsworks.glassfish.service.consul:8182/hopsworks-api/api/project/125/dataset/Projects/clean_up/Models/fraud_tutorial_model/1/model_schema.json',
  'zip_state': 'NONE'},
 'version': 1,
 'description': 'A collection of models for fraud_tutorial_model',
 'inputExample': {'href': 'https://hopsworks.glassfish.service.consul:8182/hopsworks-api/api/project/125/dataset/Projects/clean_up/Models/fraud_tutorial_model/1/input_example.json',
  'zip_state': 'NONE'},
 'framework': 'SKLEARN',
 'metrics': {'positive_precision': '0.016508378499328725',
  'positive_recall': '0.8038740920096852'},
 'trainingDataset': None,
 'environment': ['/Projects/clean_up/Models/fraud_tutorial_model/1/environment.yml'],
 'program': 'Models/fraud_tutorial_model/1/program.ipynb'}

### Next Steps

In the next notebook, we'll look at hyperparameter optimization!