# Machine Learning Models
We will now apply several machine learning models to our data. First of all we need a bunch of python packages to do the model building and the validation.

* [Data Import and Preparation](#Fetch-the-data)
* Data Exploration (see notebooks [churn-1](churn-1-exploration.ipynb) and [churn-2](churn-2-exploration-II.ipynb))
* [Feature Selection](#Feature-Selection) and Engineering
* ...
* [Exercise](#Exercise): It will be your tasked to finish the pipeline.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Fetch the data
You can now choose between different sets of clients. Each one with different issues to solve. We will start with a very basic sample. You can then try out the others.

In [None]:
input_file = '../../.assets/data/churn/churn_persona.pkl.zip'
try:
    df = pd.read_pickle(input_file)
    print(('SUCCESS: Everything seems fine, we are good to go.'))
except FileNotFoundError:
    print(Markdown(f'ERROR: File {input_file} not found. Did you forget to run the create_churn_persona notebook first?'))

In [None]:
#print the columns in the dataset
df.columns

## Data Preparation

This is to ensure data quality. Due to some operations on the datasets there might be some NaN values (e.g. from divide-by-zero operations). We have to get rid of them, as they might confuse our machine learning algorithms.

In [None]:
df.loc[np.isnan(df.mail_r), 'mail_r'] = 0
df.loc[np.isnan(df.mail_s), 'mail_s'] = 0
df.loc[np.isnan(df.bank_r), 'bank_r'] = 0
df.loc[np.isnan(df.bank_s), 'bank_s'] = 0
df.loc[np.isnan(df.contacts_r), 'contacts_r'] = 0
df.loc[np.isnan(df.contacts_s), 'contacts_s'] = 0

## Feature Selection
We have had a very close look into our data. You can select the relevant features from our dataset here. In this case, you might choose to take them all into account. In reality, you might want to select the most important ones, as in real life data is nearly infinite and ressources are limited.

In [None]:
# Just comment/uncomment the lines you like to select. 
# Keep the "churn" variable. It is needed for the training.

training_features = [
    'age',
    'amount',
    'churn', # we will delete it later from our data, as we want to predict it
    'contacts',
    'd_amount',
    'd_pay',
    'pay',
    'size',
    'year',
    #'bank_r',
    #'bank_s',
    'bank_n',
    #'mail_r',
    #'mail_s',
    'mail_n',
    #'contacts_r',
    #'contacts_s',
    'contacts_n'
]

## Variables and results
We now split our dataset into the variables used for our predictive model and the result that should be predicted (our churn state). We call the variables X and the results y.

In the last line of this block, all datasets with a NaN value are deleted.

In [None]:
X = df[training_features].dropna()
y = X.churn
X.drop('churn', axis=1, inplace=True)

# Exercise
Set up the machine learning pipeline.

1. Prepare the dataset for validation by performing a resonable `train-test-split` [Go to solution](#Train-Test-Split)
2. Define the ML model you want to use and set some standard hyperparameters. [Go to solution](#Model)
3. Perform the training by fitting the model to your train data. Try out to find a way to add a `sample_weight` in this step. [Go to solution](#Fiting)
4. Do a proper validation by using [hypothesis test](#Hypothesis-Test), [roc curves](#ROC-Curves), [confusion matrix](#Confusion-Matrix), [scores](#Scores-and-deduced-performance-indicators) and [feature importance](#Feature Importance)...
5. Save your model to disc. [Go to solution](#Import-and-export-trained-model)

# Sample solution

## Train Test Split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Is the data mixed well?

In [None]:
df['churn'].value_counts()

In [None]:
ratio_all_data = df['churn'].value_counts().values[1]/df['churn'].value_counts().values[0]
ratio_train_data = y_train.sum()/len(y_train)
ratio_test_data = y_test.sum()/len(y_test)

print(f'{ratio_all_data:.3}')
print(f'{ratio_train_data:.3}')
print(f'{ratio_test_data:.3}')

## Model
We choose our Model to be a GradientBoostingClassifier. You could play around with the parameters of the model.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier



# Defining the model with key parameters
#model = RandomForestClassifier(n_estimators=50, max_depth=7)
#model = MLPClassifier(hidden_layer_sizes=(20,), activation='relu')
model = GradientBoostingClassifier(n_estimators=50, max_depth=4)
#model = AdaBoostClassifier(n_estimators=50)
#model = GaussianNB()
#model = KNeighborsClassifier(n_neighbors=300)
#model = SVC(kernel='rbf', probability=True) #takes longtime, reduce dataset!!!


## Fitting

In [None]:
try:
    sample_weight = 1 + (y_train > 0) * 10
    model.fit(X_train, y_train, sample_weight = sample_weight)
except:
    print('Exception: sample weights are not supported')
    print('Exception: using no weights for fit')
    # For a MLPClassifier there is no sample weight directly implemented
    model.fit(X_train, y_train)

In [None]:
model

## Results
The actual machine learning training is done. Let's have a look at our results and measure how well our model performs on our training data and the test data sets. If we see the same performance on both sets we can take this as a strong indicator for a valid model. If the models performs much better on our training data set, there is something wrong (-> **overtraining**)!

### Hypothesis Test
There is an easy way to check the results by visualization. Each chart gives the probability of all samples to belong to one marble type. In addition, each color gives the true membership. A good classifier will show a good split: Ideally, all items of the current class will be on the right side and all others on the left.

In [None]:
# Getting the probability for a given data set
y_proba_test = model.predict_proba(X_test)
y_proba_train = model.predict_proba(X_train)

In [None]:
y_proba_test

In [None]:
y_proba_test.shape

In [None]:
for i in [0,1]:
    y_proba_test_i = y_proba_test[:,i]
    do_normalize = True
    
    plt.figure(figsize=(10, 3))
    
    plt.hist(y_proba_test_i[(y_test == 0).values], bins=30, alpha=0.5, density=do_normalize, label='Clients staying')
    plt.hist(y_proba_test_i[(y_test == 1).values], bins=30, alpha=0.5, density=do_normalize, label='Clients terminating')
    plt.legend()
    
    if i == 0:
        plt.title('Hypothesis: Client will stay with probability p')
    elif i == 1:
        plt.title('Hypothesis: Client will terminate with probability p')
    plt.xlabel('Probability p')
    
    #plt.yscale('log')
    plt.show();

### ROC Curves

The [**Receiver Operating Characteristics (ROC**](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html)) are a slightly more condensed way to validate a model. A ROC curve shows the **true positive rate** (TPR, $\frac{TP}{P} = \frac{TP}{TP+FN}$) as a function of the **false positive rate** (FPR, $\frac{FP}{N} = \frac{FP}{FP+TN}$) for each class. For each sample the class with the highest probability is chosen for the curve. When given a certain hypothesis and an acceptable false-positive rate, we see how many samples that truly fit the hypothesis we can select. Typically the ROC curve raises quickly and flattens to (1,1). The diagonal would reflect a *random guess*. Keep in mind that both axes show rates and the overall absolute sample size do (can) differ significantly. In addition, the ROC curve can be used to compare within one condensed plot 
* the performance of different data sets (e.g. training and test data set),
* different sets of hyper-parameter of one model 
* different models.

Here, we show the results for the train and test data set in comparison, to detect deviations. Are there significant deviations this could be an indice for overfitting to the train data! In addition, we plot a line for the rates describing the same amount of customers between true positives and false positives

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

In [None]:
#y_proba_test = model.predict_proba(X_test)
#y_proba_train = model.predict_proba(X_train)
   
for i in [0,1]:
    y_proba_test_i = y_proba_test[:,i]
    y_proba_train_i = y_proba_train[:,i]
    
    plt.figure(figsize=(6, 6))
    plt.plot(*roc_curve(y_test == i, y_proba_test_i)[:2], label='test')
    plt.plot(*roc_curve(y_train == i, y_proba_train_i)[:2], label='train')
    
    # Add line for same absolut size of customers
    if True:
        if i == 0:
            m = (y_test == True).sum()/(y_test == False).sum()
        else:
            m = (y_test == False).sum()/(y_test == True).sum()
        x = np.linspace(0,1,21)
        plt.plot(x, m*x, linestyle= '--', color='red', label='same number \nof customers')
        
    plt.plot([0, 1],[0, 1], color='black', linestyle=':')
    
    if i == 0:
        plt.title(f'Clients staying')
    if i == 1:
        plt.title(f'Clients terminating')
        
    plt.xlabel('false positive rate')
    plt.ylabel('true positive rate')
    plt.ylim(-0.1,1.1)
    plt.legend(loc='best')
    plt.show();   

### Confusion Matrix

The [**Confusion Matrix**](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) (Table of Confusion) gives for each class how many samples are classified correctly (principal diagonal) and how many classifications are false. In addition, it shows to which wrong class the samples were assigned. In our case we get a 2x2 matrix. The sum of a row either all customers who do not churn (*Negatives*) and all customers who churn (*Positives*). A _perfect_ classificator would have only entries on the pricipal diagonal. Keep in mind that each sample will be assigned to the class with the highest probability regardless how high it is (worst case: 100%/2 = ~50.001%).

The sum of the first row are all true members (**Positives**, P) consisting of **True Positives** (TP) and **False Negatives** (FN). The sum of the second row are all false members (**Negatives**, N) consisting of the **False Positives** (FP) and **True Negatives** (TN).

x | classified as Negatives | classified as Positives
-|-|-
**Negatives (N)** | True Negatives (TN) | False Positives (FP) 
 **Positives (P)** |  False Negatives (FN) | True Positives (TP)

In [None]:
from sklearn.metrics import confusion_matrix

y_pred_test = model.predict(X_test)
truth = y_test 
cm = confusion_matrix(truth,y_pred_test)

pd.DataFrame(data=cm, columns=['Predict as staying', 'Predict as terminating'], index=['Stays', 'Terminating'])

## Scores and deduced performance indicators

There are several performance indicators which only reflect single rates. For example the **True Positive Rate** (TPR, Sensitivity, Hit Rate, Recall) is the rate between True Positives and Positives. It's counterpart is the **True Negative Rate** (TNR, Specificity).

* True Positive Rate (TPR, Sensitivity, Recall) : $\frac{TP}{P}$

* True Negative Rate (TNR, Specificity) : $\frac{TN}{N}$ 

Thereby, we should always take both rates into account to get something like an average. In addition, the **Accuracy** (ACC) can give a hint for that purpose as it covers Positives and Negatives

* Accuracy (ACC): $\frac{TP + TN}{P + N}$ 

There are other meaningful indicators like **Positive Predicted Value** (PPV, Precision) which describes the rate of True Positives to all as positive classified samples (TP+FP). A good average (harmonic mean) of Precision and Sensitivity comes with the **F1 score**.

* Positive Predicted Value (PPV, Precision) : $\frac{TP}{TP + FP}$


* F1 score (harmonic mean of ACC & PPV) : $\frac{2*PPV * TPR}{PPV + TPR} = \frac{2 TP}{2 TP + FP + FN}$

* Area under the ROC Curve (**AUC**).

In [None]:
from sklearn.metrics import classification_report

y_pred_test = model.predict(X_test)
target_names = ['loyal customers', 'terminating customers']
report = classification_report(y_test, y_pred_test, target_names=target_names)
print(report)

y_pred_train = model.predict(X_train)
report = classification_report(y_train, y_pred_train, target_names=target_names)
print(report)

In [None]:
data=[]
i = 0
y_proba_test_i = y_proba_test[:,i]
y_proba_train_i = y_proba_train[:,i]

data.append(roc_auc_score(y_test.values == i, y_proba_test_i))
data.append(roc_auc_score(y_train.values == i, y_proba_train_i))
    
# Displaying
pd.DataFrame(np.array(data), columns=['AUC'], index = ['test', 'train'])

In [None]:
print(f'Mean Accuracy of train: {model.score(X_train, y_train):.3f}')
print(f'Mean Accuracy of test: {model.score(X_test, y_test):.3f}')

### More functionality
There are other tools in [**sklearn.metrics**](http://scikit-learn.org/stable/modules/model_evaluation.html) to perform general performance and validation analyses. With some models there comes a [**classification_report**](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report) which can be take into account. Another often applied strategy is a [cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html) in which the train-test-split is performed several times on a data set. Thereby averaged performance indicators can be estimated and we get some hints how stable the system is. 

## Feature Importance

Several machine learning models return a score for the feature importance within the classificator. This can be used to perform more training steps to improve the model, improve computing time or feedback this to the initial data acquisition. If we detect that one feature is very important for the classificator it maybe a good idea to improve the quality of this feature or engineer equivalent features. In addition, this step can highlight features which were not be be expected to be important and can lead to a rethinking of strategies.

In [None]:
 # Only works for GradientBoostingClassifier or RandomForestClassifier
try:
    print(model.feature_importances_)
    plt.figure(figsize=(10, 10))
    plt.barh(range(len(X.columns)), model.feature_importances_)
    plt.yticks(range(len(X.columns)), X.columns)
    plt.show()
except:
    print("Model does not support feature importances in this example.")

## Import and export trained model
It is possible to save a trained model or open it. Thereby the user can distribute or compare models from different states or types in another instance. More details are [availible online](http://scikit-learn.org/stable/modules/model_persistence.html).

In [None]:
import pickle
pickle.dump(model, open('model.pkl', 'wb'))
load_model = pickle.load(open('model.pkl', 'rb'))
load_model

In [None]:
import joblib
joblib.dump(model, 'model.pkl') 
load_model = joblib.load('model.pkl')
load_model

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_