# Model training and registration

In this notebook we will

- Load a training dataset from the feature store
- Train a model
- Register the model in the model registry.

This will introduce a new library, hsml, which contains functionality to keep track of models and deploy them.

In this notebook, we will train a model using standard Python and Scikit-learn. It could also have been done with e.g. PySpark, Tensorflow or PyTorch.

## (1) Load training data from feature store

First, we need to interface with our project's feature store and fetch the training dataset that we created in the previous step of the tutorial. As you might remember, the feature store has mutable feature groups and immutable training datasets. The feature groups can get continuously updated, but a training dataset is "frozen" once created, including potential training validation splits.  

Start by connecting to the feature store.

In [1]:
import hsfs
import pandas as pd

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


In [2]:
td = fs.get_training_dataset("transactions_dataset_splitted", version=2)
train_df = td.read('train')
val_df = td.read('validation')



Look briefly at the data to make sure everything looks all right (in particular, we should not have any categorical features left here, only numercial features).

In [None]:
train_df.head()

## (2) Train logistic regression model

Here we will train a predictive model on the training split and assess performance on the validation split. The focus will not be on training a good model; there are many ways to try to train one that has better performance than the one shown here. The emphasis is rather on showing how to train models and track them in the model registry.

With that being said, notice that the dataset is very skewed in terms of outcome, which is natural considering that fraudulent transactions make up a tiny part of all transactions. Thus we should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data or modifying the decision threshold. In this example, we'll use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class.

The `class_weight` parameter can either be set to "balanced", in which case the weights will just be set in proportion to the ration of labels, or we can specify numeric weights for each label by ourselves. Let's first use the "balanced" option.

When we save the trained model, we'll also want to attach a performance metric to it. In this scenario with unbalanced classes, it is more useful to look at precision and recall rather than accuracy.

In [18]:
# from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, precision_score, recall_score

The next cell is just a type check to make sure we have Pandas data frames. The reason for the check is that if this notebook ever gets executed as a Hopsworks Job, it may be executed with a PySpark kernel and `train_df` and `val_df` would be PySpark data frames at this point, which would not work with the `clf.fit()` function below.

In [4]:
if not type(train_df) == pd.core.frame.DataFrame: 
    train_df = train_df.toPandas()
    val_df = val_df.toPandas()

Separate the predictive features from the label to prepare for model fitting.

In [5]:
target = 'fraud_label'
features = list(set(train_df.columns) - set([target]))

X_train, y_train = train_df[features], train_df[target]
X_val, y_val = val_df[features], val_df[target]

Fit the model.

In [10]:
clf = LogisticRegression(class_weight='balanced', solver='liblinear')
# clf = RandomForestClassifier(class_weight='balanced', n_estimators=500)
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Evaluate model performance on the validation data. 

For human consumption, a `classification_report` is nice.

In [11]:
preds = clf.predict(X_val)

print(classification_report(y_true=y_val, y_pred=preds))

              precision    recall  f1-score   support

           0       1.00      0.57      0.72    207677
           1       0.00      0.85      0.01       456

    accuracy                           0.57    208133
   macro avg       0.50      0.71      0.37    208133
weighted avg       1.00      0.57      0.72    208133



For recording model performance in registered models, we can define some useful metrics for this particular problem. 

Since we are mostly interested in the rare positive class (fraud, i.e. label 1), the precision score (# true positives / # predicted positives) for the positive class seems like a good metric. Let's also use the recall for the same class. 

In [21]:
pos_precision = precision_score(y_true=y_val, y_pred=preds, pos_label=1)
pos_recall = recall_score(y_true=y_val, y_pred=preds, pos_label=1)

print(f'Fraud class precision: {pos_precision}, recall: {pos_recall}')

Fraud class precision: 0.004303430575376412, recall: 0.8530701754385965


## (3) Register model

One of the features in Hopsworks is the model registry. This is where we can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In order to talk to the model registry, we need to use the HSML library from Hopsworks. It should be pre-installed if you work through this tutorial in a Hopsworks Jupyter session, otherwise it is easy to pip install.

Let's connect to the model registry.

In [22]:
import hsml

conn = hsml.connection()

mr = conn.get_model_registry()

Connected. Call `.close()` to terminate connection gracefully.


To prepare for registering the model, we will export the classifier as a pickle file using joblib and then save it as a model. The model needs to be set up with a Schema, but this can be obtained automatically from training examples, as shown below.

It's important to know that every time you save a model with the same name, a new version of the model will be saved, so nothing will be overwritten. In this way, you can compare several versions of the same model - or create a model with a new name, if you prefer that.

In [33]:
import joblib
import os

from hsml.schema import Schema
from hsml.model_schema import ModelSchema

os.mkdir('tmp_model')
joblib.dump(clf, 'tmp_model/model.pkl')

MODEL_NAME = "fraud_tutorial_model"

input_schema = Schema(X_train)
output_schema = Schema(y_train)

model = mr.sklearn.create_model(MODEL_NAME, 
                                metrics={'positive_precision': pos_precision, 'positive_recall': pos_recall},
                                input_example=X_train,
                                model_schema=ModelSchema(input_schema=input_schema, output_schema=output_schema))

model.save('tmp_model')

HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))


Exported model fraud_tutorial_model with version 2


<hsml.sklearn.model.Model at 0x7f50a5dcce80>

### Finding the best performing model

Let's imagine you have trained and registered several versions of the same model. Now you can query the model registry for the best model according to your preferred criterion, say positive recall in our case.

The `direction` option is used to indicate if the metric should be high or low (max or min); in our case it should be high (max).

In [34]:
best_model = mr.get_best_model(name="fraud_tutorial_model", metric="positive_recall", direction="max")
best_model.to_dict()

{'id': 'fraud_tutorial_model_1',
 'experimentId': None,
 'projectName': 'fraud_tutorial',
 'experimentProjectName': 'fraud_tutorial',
 'name': 'fraud_tutorial_model',
 'modelSchema': {'href': 'https://hopsworks.glassfish.service.consul:8182/hopsworks-api/api/project/120/dataset/Projects/fraud_tutorial/Models/fraud_tutorial_model/1/model_schema.json',
  'zip_state': 'NONE'},
 'version': 1,
 'description': 'A collection of models for fraud_tutorial_model',
 'inputExample': {'href': 'https://hopsworks.glassfish.service.consul:8182/hopsworks-api/api/project/120/dataset/Projects/fraud_tutorial/Models/fraud_tutorial_model/1/input_example.json',
  'zip_state': 'NONE'},
 'framework': 'SKLEARN',
 'metrics': {'positive_precision': '0.004303430575376412',
  'positive_recall': '0.8530701754385965'},
 'trainingDataset': None,
 'environment': ['/Projects/fraud_tutorial/Models/fraud_tutorial_model/1/environment.yml'],
 'program': 'Models/fraud_tutorial_model/1/program.ipynb'}

## Next chapter

In the next chapter, we'll look at hyperparameter optimization!