In this notebook you will train a model on the dataset you created in the previous tutorial. You will train the model using standard Python and Scikit-learn, although it could just as well be trained with other machine learning frameworks such as PySpark, TensorFlow, and PyTorch. You will also perform some of the exploration that can be done in Hopsworks, notably the search functions and the lineage.

## This notebook is divided in 3 main sections:
1. **Loading the training data**
2. **Train the model**
3. **Explore feature groups and views** via the UI.

In [1]:
!pip install -U hopsworks --quiet

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.6/120.6 KB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 KB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.6/135.6 KB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.3/45.3 KB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.2/68.2 KB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 KB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
import hopsworks

project = hopsworks.login(api_key_value="<api key>")

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/28868
Connected. Call `.close()` to terminate connection gracefully.


---
## Load Training Data

First, you will need to fetch the training dataset that you created in the previous notebook.

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder

import warnings
warnings.filterwarnings('ignore')

# Load data.
feature_view = fs.get_feature_view(name = 'churn_feature_view', version = 1)

X_train, X_val, X_test, y_train, y_val, y_test = feature_view.get_train_validation_test_split(
    training_dataset_version = 1
)

X_train.drop('customerid', axis = 1, inplace = True)
X_val.drop('customerid', axis = 1, inplace = True)
X_test.drop('customerid', axis = 1, inplace = True)

In [8]:
X_train.head()

Unnamed: 0,contract,tenure,paymentmethod,paperlessbilling,monthlycharges,totalcharges,gender,seniorcitizen,dependents,partner,deviceprotection,onlinebackup,onlinesecurity,internetservice,multiplelines,phoneservice,techsupport,streamingmovies,streamingtv
0,0,0.013889,0,0,0.016418,0.002291,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0.013889,0,0,0.016915,0.002297,1,0,1,1,0,0,0,0,0,0,0,0,0
2,0,0.013889,0,0,0.026866,0.002412,1,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0.013889,0,0,0.121891,0.003512,0,0,1,1,1,1,1,1,1,1,1,1,1
4,0,0.013889,0,0,0.257711,0.005084,0,0,0,0,1,1,2,1,0,0,1,1,1


In [9]:
y_train.head()

Unnamed: 0,churn
0,0
1,1
2,0
3,1
4,0


In [10]:
y_train.value_counts(normalize=True)

churn
0        0.733077
1        0.266923
dtype: float64

Notice that the distribution is skewed, which is good news for the company considering that customers at risk of churning make up smaller part of customer base. However, as a data scientist should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data, or modifying the decision threshold. In this example, you will use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class, which in our case means that higher importance will be placed on positive (curn) samples.

Next you will train a model and set the bigger class weight for the positive class.

In [11]:
# Train model.
pos_class_weight = 0.9

clf = LogisticRegression(class_weight={0: 1.0 - pos_class_weight, 1: pos_class_weight}, solver='liblinear')

clf.fit(X_train, y_train)

Let's see how well it performs on our validation data.

In [12]:
from sklearn.metrics import precision_recall_fscore_support, classification_report

preds = clf.predict(X_val)

precision, recall, fscore, _ = precision_recall_fscore_support(y_val, preds, average="binary")

metrics = {
    'precision': precision,
    'recall': recall,
    'fscore': fscore
}

print(classification_report(y_val, preds))

              precision    recall  f1-score   support

           0       0.96      0.47      0.63      1039
           1       0.38      0.94      0.54       357

    accuracy                           0.59      1396
   macro avg       0.67      0.70      0.58      1396
weighted avg       0.81      0.59      0.61      1396



---
## <span style="color:#ff5f27;"> Model Registry</span>

One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance.


In [13]:
mr = project.get_model_registry()

Connected. Call `.close()` to terminate connection gracefully.


The model needs to be set up with a Model Schema, which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [14]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

{'input_schema': {'columnar_schema': [{'name': 'contract', 'type': 'int64'},
   {'name': 'tenure', 'type': 'float64'},
   {'name': 'paymentmethod', 'type': 'int64'},
   {'name': 'paperlessbilling', 'type': 'int64'},
   {'name': 'monthlycharges', 'type': 'float64'},
   {'name': 'totalcharges', 'type': 'float64'},
   {'name': 'gender', 'type': 'int64'},
   {'name': 'seniorcitizen', 'type': 'int64'},
   {'name': 'dependents', 'type': 'int64'},
   {'name': 'partner', 'type': 'int64'},
   {'name': 'deviceprotection', 'type': 'int64'},
   {'name': 'onlinebackup', 'type': 'int64'},
   {'name': 'onlinesecurity', 'type': 'int64'},
   {'name': 'internetservice', 'type': 'int64'},
   {'name': 'multiplelines', 'type': 'int64'},
   {'name': 'phoneservice', 'type': 'int64'},
   {'name': 'techsupport', 'type': 'int64'},
   {'name': 'streamingmovies', 'type': 'int64'},
   {'name': 'streamingtv', 'type': 'int64'}]},
 'output_schema': {'columnar_schema': [{'name': 'churn', 'type': 'int64'}]}}

In [15]:
import joblib

pkl_file_name = "churnmodel.pkl"

joblib.dump(clf, pkl_file_name)

model = mr.sklearn.create_model(
    name="churnmodel",
    description = "Churn Model",
    input_example = X_train.sample().to_numpy(),
    model_schema = model_schema
)

model.save(pkl_file_name)

  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://c.app.hopsworks.ai:443/p/28868/models/churnmodel/1


Model(name: 'churnmodel', version: 1)

---

## Fetch and test the model

Finally you can start making predictions with your model! To identify customers at risk of churn lets retrieve your churn prediction model from Hopsworks model registry  


In [16]:
model = mr.get_model("churnmodel", version = 1)

model_dir = model.download()
model = joblib.load(model_dir + "/churnmodel.pkl")

---
## Use trained model to identify customers at risk of churn


In [17]:
def transform_preds(predictions):
    return ['Churn' if pred == 1 else 'Not Churn' for pred in predictions]

In [18]:
batch_data = feature_view.get_batch_data()

batch_data.head()

Unnamed: 0,contract,tenure,paymentmethod,paperlessbilling,monthlycharges,totalcharges,gender,seniorcitizen,dependents,partner,customerid,deviceprotection,onlinebackup,onlinesecurity,internetservice,multiplelines,phoneservice,techsupport,streamingmovies,streamingtv
0,0,0.444444,0,0,0.429353,0.214703,0,0,1,1,5061-PBXFW,1,2,1,1,0,0,2,1,1
1,0,0.333333,1,1,0.568657,0.201254,0,0,0,1,8155-IBNHG,1,1,2,2,2,0,1,1,1
2,2,0.694444,1,1,0.5199,0.401466,1,0,1,1,0263-FJTQO,1,1,1,1,2,0,2,1,2
3,2,0.638889,0,0,0.0199,0.102518,0,0,1,1,3190-ITQXP,0,0,0,0,0,0,0,0,0
4,1,0.777778,2,1,0.793532,0.606876,0,0,0,1,6284-KMNUF,1,1,1,2,2,0,1,2,2


Let's predict the all for all customer data and then visualize predictions.

In [19]:
batch_data.drop('customerid',axis = 1, inplace = True)

predictions = model.predict(batch_data)
predictions = transform_preds(predictions)
predictions[:5]

['Not Churn', 'Churn', 'Not Churn', 'Not Churn', 'Churn']