# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span>

<span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Model training & UI Exploration</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/churn/3_model_training.ipynb)


**Note**: you may get an error when installing hopsworks on Colab, and it is safe to ignore it.


In this notebook you will train a model on the dataset you created in the previous tutorial. You will train the model using standard Python and Scikit-learn, although it could just as well be trained with other machine learning frameworks such as PySpark, TensorFlow, and PyTorch. You will also perform some of the exploration that can be done in Hopsworks, notably the search functions and the lineage.

## 🗒️ This notebook is divided in 3 main sections:
1. **Loading the training data**
2. **Train the model**
3. **Explore feature groups and views** via the UI.

In [None]:
!pip install -U hopsworks --quiet

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

---
## <span style="color:#ff5f27;"> ✨ Load Training Data </span>

First, you will need to fetch the training dataset that you created in the previous notebook.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder

import warnings
warnings.filterwarnings('ignore')

# Load data.
feature_view = fs.get_feature_view(
    name = 'churn_feature_view',
    version = 1
)

X_train, X_val, X_test, y_train, y_val, y_test = feature_view.get_train_validation_test_split(
    training_dataset_version = 1
)

X_train.drop('customerid', axis = 1, inplace = True)
X_val.drop('customerid', axis = 1, inplace = True)
X_test.drop('customerid', axis = 1, inplace = True)

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
y_train.value_counts(normalize=True)

Notice that the distribution is skewed, which is good news for the company considering that customers at risk of churning make up smaller part of customer base. However, as a data scientist should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data, or modifying the decision threshold. In this example, you will use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class, which in our case means that higher importance will be placed on positive (curn) samples.

---
## <span style="color:#ff5f27;"> 🏃 Train Model</span>

Next you will train a model and set the bigger class weight for the positive class.

In [None]:
# Train model.
pos_class_weight = 0.9

clf = LogisticRegression(class_weight={0: 1.0 - pos_class_weight, 1: pos_class_weight}, solver='liblinear')

clf.fit(X_train, y_train)

Let's see how well it performs on our validation data.

In [None]:
from sklearn.metrics import precision_recall_fscore_support, classification_report

preds = clf.predict(X_val)

precision, recall, fscore, _ = precision_recall_fscore_support(y_val, preds, average="binary")

metrics = {
    'precision': precision,
    'recall': recall,
    'fscore': fscore
}

print(classification_report(y_val, preds))

---
## <span style="color:#ff5f27;"> Model Registry</span>

One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance.


In [None]:
mr = project.get_model_registry()

The model needs to be set up with a Model Schema, which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

In [None]:
import joblib

pkl_file_name = "churnmodel.pkl"

joblib.dump(clf, pkl_file_name)

model = mr.sklearn.create_model(
    name="churnmodel",
    description = "Churn Model",
    input_example = X_train.sample().to_numpy(),
    model_schema = model_schema
)

model.save(pkl_file_name)

---

## <span style='color:#ff5f27'>🚀 Fetch and test the model</span>

Finally you can start making predictions with your model! To identify customers at risk of churn lets retrieve your churn prediction model from Hopsworks model registry  


In [None]:
model = mr.get_model("churnmodel", version = 1)

model_dir = model.download()
model = joblib.load(model_dir + "/churnmodel.pkl")

---
## <span style="color:#ff5f27;">🔮  Use trained model to identify customers at risk of churn </span>


In [None]:
def transform_preds(predictions):
    return ['Churn' if pred == 1 else 'Not Churn' for pred in predictions]

In [None]:
batch_data = feature_view.get_batch_data()

batch_data.head()

Let's predict the all for all customer data and then visualize predictions.

In [None]:
batch_data.drop('customerid',axis = 1, inplace = True)

predictions = model.predict(batch_data)
predictions = transform_preds(predictions)
predictions[:5]

---
## <span style="color:#ff5f27;">👨🏻‍🎨 Prediction Visualisation</span>

Now you got your predictions but you also would like to exlain predictions to make informed decisions. Lets visualise them and explain important features that influences on the risk of churning.

In [None]:
import inspect 

# Recall that you applied transformation functions, such as min max scaler and laber encoder. 
# Now you want to transform them back to human readable format.
df_all = batch_data.copy()
td_transformation_functions = feature_view._batch_scoring_server._transformation_functions
for feature_name in td_transformation_functions:
    td_transformation_function = td_transformation_functions[feature_name]
    sig, foobar_locals = inspect.signature(td_transformation_function.transformation_fn), locals()
    param_dict = dict([(param.name, param.default) for param in sig.parameters.values() if param.default != inspect._empty])
    if td_transformation_function.name == "label_encoder":
        rev_dict = {v: k for k, v in param_dict["value_to_index"].items()}
        df_all[feature_name] = df_all[feature_name].map(lambda x: rev_dict[x])
    if td_transformation_function.name == "min_max_scaler":
        df_all[feature_name] = df_all[feature_name].map(lambda x: x*(param_dict["max_value"]-param_dict["min_value"])+param_dict["min_value"])

            
df_all = df_all
df_all['Churn'] = predictions
df_all.head()

Lets plot feature importance 

In [None]:
import math

feature_names = batch_data.columns

feature_importance = pd.DataFrame(feature_names, columns = ["feature"])
feature_importance["importance"] = pow(math.e, model.coef_[0])
feature_importance = feature_importance.sort_values(by = ["importance"], ascending = False)

plt.figure(figsize = (16,6)) 
sns.barplot(x = 'feature', y = 'importance', data = feature_importance)

plt.title('Feature Importance Plot', fontsize = 20)
plt.xticks(rotation = 20)
plt.xlabel('Feature', fontsize = 13)
plt.ylabel('Importance', fontsize = 13)

plt.show()    

As you can see `internetservice` has the biggest effect on the risk of churn.

In [None]:
plt.figure(figsize = (13,6))

sns.countplot(
    data = df_all,
    x = 'internetservice',
    hue = 'Churn'
)

plt.title('Churn rate according to internet service subscribtion', fontsize = 20)
plt.xlabel("internetservice", fontsize = 13)
plt.ylabel('Number of customers', fontsize = 13)

plt.show()

Lets visualise couple of more imporant features such as `streamingtv` and `streamingmovies`

In [None]:
plt.figure(figsize = (13,6))

sns.countplot(
    data = df_all,
    x = 'streamingtv',
    hue = 'Churn'
)

plt.title('Churn rate according to internet streaming tv subscribtion', fontsize = 20)
plt.xlabel("streamingtv", fontsize = 13)
plt.ylabel('Number of customers', fontsize = 13)

plt.show()

In [None]:
plt.figure(figsize = (13,6))

sns.countplot(
    data = df_all,
    x = 'streamingtv',
    hue = 'Churn'
)

plt.title('Churn rate according to streaming movies service subscribtion', fontsize = 20)
plt.xlabel("streamingmovies", fontsize = 13)
plt.ylabel('Number of customers', fontsize = 13)

plt.show()

In [None]:
plt.figure(figsize = (13,6))

sns.countplot(
    data = df_all,
    x = 'gender',
    hue = 'Churn'
)

plt.title('Churn rate according to Gender', fontsize = 20)
plt.xlabel("Gender", fontsize = 13)
plt.ylabel('Count', fontsize = 13)

plt.show()

In [None]:
plt.figure(figsize = (13,6))

sns.histplot(
    data = df_all,
    x = 'totalcharges',
    hue = 'Churn'
)

plt.title('Amount of each Payment Method', fontsize = 20)
plt.xlabel("Charge Value", fontsize = 13)
plt.ylabel('Count', fontsize = 13)

plt.show()

In [None]:
plt.figure(figsize = (13,6))

sns.countplot(
    data = df_all,
    x = 'paymentmethod',
    hue = 'Churn'
)

plt.title('Amount of each Payment Method', fontsize = 20)
plt.xlabel("Payment Method", fontsize = 13)
plt.ylabel('Total Amount', fontsize = 13)

plt.show()

In [None]:
plt.figure(figsize = (13,6))

sns.countplot(
    data = df_all,
    x = 'partner',
    hue = 'Churn'
)

plt.title('Affect of having a partner on Churn/Not', fontsize = 20)
plt.xlabel("Have a partner", fontsize = 13)
plt.ylabel('Count', fontsize = 13)

plt.show()

---
## <span style="color:#ff5f27;">🧑🏻‍🔬 StreamLit App </span>

If you want to use an **interactive dashboards** - you can use a StreamLit App.

Use the following commands in terminal to run a Streamlit App:

> `cd {%path_to_hopsworks_tutorials%}/`  </br>
> `conda activate ./miniconda/envs/hopsworks` </br>
> `python -m streamlit run churn/streamlit_app.py`</br>

**⚠️** If you are running on Colab, you will need to follow a different procedure. As highlighted in this [notebook](https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Create_streamlit_app.ipynb). 

---
## <span style="color:#ff5f27;"> 👓  Exploration</span>
In the Hopsworks feature store, the metadata allows for multiple levels of explorations and review. Here we will show a few of those capacities. 

### <span style="color:#ff5f27;">🔎 <b>Search</b></span> 
Using the search function in the ui, you can query any aspect of the feature groups, feature_view and training data that was previously created.

### <span style="color:#ff5f27;">📊 <b>Statistics</b> </span>
We can also enable statistics in one or all the feature groups.

In [None]:
customer_info_fg = fs.get_feature_group("customer_info", version = 1)
customer_info_fg.statistics_config = {
    "enabled": True,
    "histograms": True,
    "correlations": True
}

customer_info_fg.update_statistics_config()
customer_info_fg.compute_statistics()

![fg-statistics](../churn/images/churn_statistics.gif)


### <span style="color:#ff5f27;">⛓️ <b> Lineage </b> </span>
In all the feature groups and feature view you can look at the relation between each abstractions; what feature group created which training dataset and that is used in which model.
This allows for a clear undestanding of the pipeline in relation to each element. 

---

### <span style="color:#ff5f27;">🥳 <b> Next Steps  </b> </span>
Congratulations you've now completed the churn risk prediction tutorial for Managed Hopsworks.

Check out our other tutorials on ➡ https://github.com/logicalclocks/hopsworks-tutorials

Or documentation at ➡ https://docs.hopsworks.ai