# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Training Dataset and Modeling</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/churn/2_training_dataset_and_modeling.ipynb)

This is the second part of the quick start series of tutorials about predicting customers that are at risk of churning with the Hopsworks Feature Store.

This notebook explains how to read from a feature group and create training dataset within the feature store.

You will train the model using XGBoost model, although it could just as well be trained with other machine learning frameworks such as Scikit-learn, PySpark, TensorFlow, and PyTorch. You will also perform some of the exploration that can be done in Hopsworks, notably the search functions and the lineage.

## 🗒️ This notebook is divided into the following sections:
1. Select the features you want to train the model on.
2. Preprocess of features.
3. Create a dataset split for training and validation data.
4. Load the training data.
5. Train the model.
6. Explore feature groups and views via the UI.

![tutorial-flow](../images/03_model.png)

In [None]:
import joblib
import os
from PIL import Image

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import confusion_matrix
from xgboost import plot_importance

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

---
## <span style="color:#ff5f27;"> 🔪 Feature Selection </span>

You will start by selecting all the features you want to include for model training/inference.

In [None]:
# Load feature groups
customer_info_fg = fs.get_feature_group(
    name="customer_info",
    version=1
)

demography_fg = fs.get_feature_group(
    name="customer_demography_info",
    version=1
)

subscriptions_fg = fs.get_feature_group(
    name="customer_subscription_info",
    version=1
)

In [None]:
# Select features for training data
query = customer_info_fg.select_except(["customerid"]).join(demography_fg.select_except(["customerid"])).join(subscriptions_fg.select_all())

# uncomment this if you would like to view query result
# query.show(5)

Recall that you created three feature groups in the previous notebook. If you had created multiple feature groups with identical schema and wanted to include them in the join you would need to include a prefix argument in the join to avoid feature name clash. See the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/query_api/#join) for more details.

---
## <span style="color:#ff5f27;">🤖 Transformation Functions </span>

You will preprocess the data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this you will simply define a mapping between features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [None]:
# Load transformation functions
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

numerical_features = ["tenure", "monthlycharges", "totalcharges"]
categorical_features = [
    "multiplelines", "internetservice", "onlinesecurity", "onlinebackup",
    "deviceprotection", "techsupport", "streamingmovies", "streamingtv",
    "phoneservice", "paperlessbilling", "contract", "paymentmethod", "gender", 
    "dependents", "partner"
]

# Map features to transformations
transformation_functions = {}
for feature in numerical_features:
    transformation_functions[feature] = min_max_scaler

for feature in categorical_features:
    transformation_functions[feature] = label_encoder

---
## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.
In order to create a Feature View you may use `fs.get_or_create_feature_view()`.

In [None]:
feature_view = fs.get_or_create_feature_view(
        name = 'churn_feature_view',
        version = 1,
        labels=["churn"],
        transformation_functions=transformation_functions,
        query=query   
)

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Dataset with train/test splits can be created using `fs.create_train_test_split()` method.
Dataset with train/valid/test splits can be created using `fs.create_train_validation_test_split()` method.

**You can use event time filters like `train_start`, `train_end`, `valid_start`, `valid_end`... Values can be either in unix, string and datetime format.** 

**Or, use `validation_size` and `test_size` parameters.**

In [None]:
td_version, td_job = feature_view.create_train_validation_test_split(
    description='churn_training_dataset_random_splitted',
    data_format='csv',
    validation_size=0.2,
    test_size=0.1,
    write_options={'wait_for_job': True},
    coalesce=True,
)

The feature view and training dataset are now visible in the UI

![fv-overview](../churn/images/churn_tutofv.gif)

---
## <span style="color:#ff5f27;"> ✨ Load Training Data </span>

Now you need to fetch the training dataset that you created.

In [None]:
X_train, X_val, X_test, y_train, y_val, y_test = feature_view.get_train_validation_test_split(
    training_dataset_version=td_version
)

X_train.drop('customerid', axis=1, inplace=True)
X_val.drop('customerid', axis=1, inplace=True)
X_test.drop('customerid', axis=1, inplace=True)

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
y_train.value_counts(normalize=True)

Notice that the distribution is skewed, which is good news for the company considering that customers at risk of churning make up smaller part of customer base. However, as a data scientist should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data, or modifying the decision threshold. In this example, you will use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class, which in our case means that higher importance will be placed on positive (curn) samples.

---
## <span style="color:#ff5f27;"> 🏃 Train Model</span>

Next you will train a model and set the bigger class weight for the positive class.

In [None]:
import xgboost as xgb

classifier = xgb.XGBClassifier(scale_pos_weight=3)

classifier.fit(X_train,y_train)

---
## <span style="color:#ff5f27;"> 👨🏻‍⚖️ Model Evaluation</span>

### <span style="color:#ff5f27;"> 📝 Imports</span>

In [None]:
conf_matrix = confusion_matrix(y_test, classifier.predict(X_test)).astype(int)

df_cm = pd.DataFrame(conf_matrix, ['Non Churn', 'Churn'],
                     ['Non Churn', 'Churn'])

figure_cm = plt.figure(figsize = (10,7))
figure_cm = sns.heatmap(df_cm, annot=True, annot_kws={"size": 14}, fmt='.10g')

plt.title('Confusion Matrix',fontsize=17)
plt.show()

---
## <span style="color:#ff5f27;">🗄 Model Registry</span>

One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance.

In [None]:
mr = project.get_model_registry()

### <span style="color:#ff5f27;">⚙️ Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/machine-learning-api/latest/generated/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

In [None]:
model_dir="churn_model"
if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)

pkl_file_name = model_dir + '/churnmodel.pkl'

joblib.dump(classifier, pkl_file_name)

figure_cm.figure.savefig(model_dir + '/confusion_matrix.png')

model = mr.python.create_model(
    name="churnmodel",
    description = "Churn Model",
    input_example = X_train.sample(),
    model_schema = model_schema
)

model.save(model_dir)

---

## <span style='color:#ff5f27'>🚀 Fetch and test the model</span>

Finally you can start making predictions with your model! To identify customers at risk of churn lets retrieve your churn prediction model from Hopsworks model registry.

In [None]:
retrieved_model = mr.get_model(
    name="churnmodel",
    version=1
)
saved_model_dir = retrieved_model.download()

In [None]:
retrieved_xgboost_model = joblib.load(saved_model_dir + "/churnmodel.pkl")
retrieved_xgboost_model

---
## <span style="color:#ff5f27;">🔮  Use trained model to identify customers at risk of churn </span>


In [None]:
def transform_preds(predictions):
    return ['Churn' if pred == 1 else 'Not Churn' for pred in predictions]

In [None]:
feature_view.init_batch_scoring(td_version)

batch_data = feature_view.get_batch_data()

batch_data.head()

Let's predict the all for all customer data and then visualize predictions.

In [None]:
batch_data.drop('customerid',axis = 1, inplace = True)

predictions = retrieved_xgboost_model.predict(batch_data)
predictions = transform_preds(predictions)
predictions[:5]

---
## <span style="color:#ff5f27;">👨🏻‍🎨 Prediction Visualisation</span>

Now you got your predictions but you also would like to exlain predictions to make informed decisions. Lets visualise them and explain important features that influences on the risk of churning.

In [None]:
import inspect 

# Recall that you applied transformation functions, such as min max scaler and laber encoder. 
# Now you want to transform them back to human readable format.
df_all = batch_data.copy()
td_transformation_functions = feature_view._batch_scoring_server._transformation_functions
for feature_name in td_transformation_functions:
    td_transformation_function = td_transformation_functions[feature_name]
    sig, foobar_locals = inspect.signature(td_transformation_function.transformation_fn), locals()
    param_dict = dict([(param.name, param.default) for param in sig.parameters.values() if param.default != inspect._empty])
    if td_transformation_function.name == "label_encoder":
        rev_dict = {v: k for k, v in param_dict["value_to_index"].items()}
        df_all[feature_name] = df_all[feature_name].map(lambda x: rev_dict[x])
    if td_transformation_function.name == "min_max_scaler":
        df_all[feature_name] = df_all[feature_name].map(lambda x: x*(param_dict["max_value"]-param_dict["min_value"])+param_dict["min_value"])

            
df_all = df_all
df_all['Churn'] = predictions
df_all.head()

Lets plot feature importance 

In [None]:
figure_imp = plot_importance(classifier, max_num_features=10, importance_type='weight')
plt.show()

In [None]:
plt.figure(figsize = (13,6))

sns.countplot(
    data = df_all,
    x = 'internetservice',
    hue = 'Churn'
)

plt.title('Churn rate according to internet service subscribtion', fontsize = 20)
plt.xlabel("internetservice", fontsize = 13)
plt.ylabel('Number of customers', fontsize = 13)

plt.show()

Lets visualise couple of more imporant features such as `streamingtv` and `streamingmovies`

In [None]:
plt.figure(figsize = (13,6))

sns.countplot(
    data = df_all,
    x = 'streamingtv',
    hue = 'Churn'
)

plt.title('Churn rate according to internet streaming tv subscribtion', fontsize = 20)
plt.xlabel("streamingtv", fontsize = 13)
plt.ylabel('Number of customers', fontsize = 13)

plt.show()

In [None]:
plt.figure(figsize = (13,6))

sns.countplot(
    data = df_all,
    x = 'streamingtv',
    hue = 'Churn'
)

plt.title('Churn rate according to streaming movies service subscribtion', fontsize = 20)
plt.xlabel("streamingmovies", fontsize = 13)
plt.ylabel('Number of customers', fontsize = 13)

plt.show()

In [None]:
plt.figure(figsize = (13,6))

sns.countplot(
    data = df_all,
    x = 'gender',
    hue = 'Churn'
)

plt.title('Churn rate according to Gender', fontsize = 20)
plt.xlabel("Gender", fontsize = 13)
plt.ylabel('Count', fontsize = 13)

plt.show()

In [None]:
plt.figure(figsize = (13,6))

sns.histplot(
    data = df_all,
    x = 'totalcharges',
    hue = 'Churn'
)

plt.title('Amount of each Payment Method', fontsize = 20)
plt.xlabel("Charge Value", fontsize = 13)
plt.ylabel('Count', fontsize = 13)

plt.show()

In [None]:
plt.figure(figsize = (13,6))

sns.countplot(
    data = df_all,
    x = 'paymentmethod',
    hue = 'Churn'
)

plt.title('Amount of each Payment Method', fontsize = 20)
plt.xlabel("Payment Method", fontsize = 13)
plt.ylabel('Total Amount', fontsize = 13)

plt.show()

In [None]:
plt.figure(figsize = (13,6))

sns.countplot(
    data = df_all,
    x = 'partner',
    hue = 'Churn'
)

plt.title('Affect of having a partner on Churn/Not', fontsize = 20)
plt.xlabel("Have a partner", fontsize = 13)
plt.ylabel('Count', fontsize = 13)

plt.show()

---
## <span style="color:#ff5f27;">🧑🏻‍🔬 StreamLit App </span>

If you want to use an **interactive dashboards** - you can use a StreamLit App.

Use the following commands in terminal to run a Streamlit App:

> `cd {%path_to_hopsworks_tutorials%}/`  </br>
> `conda activate ./miniconda/envs/hopsworks` </br>
> `python -m streamlit run churn/streamlit_app.py`</br>

**⚠️** If you are running on Colab, you will need to follow a different procedure. As highlighted in this [notebook](https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Create_streamlit_app.ipynb). 

---
## <span style="color:#ff5f27;"> 👓  Exploration</span>
In the Hopsworks feature store, the metadata allows for multiple levels of explorations and review. Here we will show a few of those capacities. 

### <span style="color:#ff5f27;">🔎 <b>Search</b></span> 
Using the search function in the ui, you can query any aspect of the feature groups, feature_view and training data that was previously created.

### <span style="color:#ff5f27;">📊 <b>Statistics</b> </span>
We can also enable statistics in one or all the feature groups.

In [None]:
customer_info_fg = fs.get_feature_group("customer_info", version = 1)
customer_info_fg.statistics_config = {
    "enabled": True,
    "histograms": True,
    "correlations": True
}

customer_info_fg.update_statistics_config()
customer_info_fg.compute_statistics()

![fg-statistics](../churn/images/churn_statistics.gif)


### <span style="color:#ff5f27;">⛓️ <b> Lineage </b> </span>
In all the feature groups and feature view you can look at the relation between each abstractions; what feature group created which training dataset and that is used in which model.
This allows for a clear undestanding of the pipeline in relation to each element. 

---

### <span style="color:#ff5f27;">🥳 <b> Next Steps  </b> </span>
Congratulations you've now completed the churn risk prediction tutorial for Managed Hopsworks.

Check out our other tutorials on ➡ https://github.com/logicalclocks/hopsworks-tutorials

Or documentation at ➡ https://docs.hopsworks.ai