# Capstone proposal by Mirko Salomon 

# BEVERAGE MACHINE CHURN PREDICTION

* ## [1) Introduction and preparation](#Introduction) 

    *   
        #### Data Load
        
        #### Hyperparameters
        
* ## [2) Random Forest model](#RF)
    
    *   
        #### Train the model
        
        #### Test data with best parameters 
        
        #### Confusion Matrix
        
        #### Tree Plot
        
        #### Prediction of churn

## 1) Introduction and preparation<a class="anchor" id="Introduction"></a>

Random forests are an ensemble learning method for classification that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees.

Based on the flowchart from Scikit Learn I should try a Random Forest model.

This model is a bit more a black box than a regression or decision tree

Has predict_proba that predicts class probabilities for X, can handle well categorical features, maintains accuracy when the data is missing

Seems to have good Performance on Imbalanced datasets and we can also modify the weights

Is a model working differently than a regression model and performs usually better than a decision tree

I will tune the number of estimators and the maximum number of features. I will also tune the class weigth and the bootstrap.

In [1]:
from sklearn.model_selection import GridSearchCV
import numpy as np

import pandas as pd
import os
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

import pickle

import xlrd

import datetime as dt
from datetime import datetime

import collections
from collections import Counter

# Import seaborn
import seaborn as sns


In [2]:
#conda update -c conda-forge scikit-learn

In [3]:
#pip install --upgrade scikit-learn

In [4]:
#libraries
from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split

#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import ParameterGrid

from sklearn import metrics 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

#from sklearn.metrics import plot_confusion_matrix

In [5]:
# Load the pickle file
with open('BM_noTickets_preprocess.p', 'rb') as file:
    BM_noTickets_preprocess = pickle.load(file)

In [6]:
# Load the pickle file
with open('X.p', 'rb') as file:
    X = pickle.load(file)

# Load the pickle file
with open('y.p', 'rb') as file:
    y = pickle.load(file)

# Load the pickle file
with open('X_tr.p', 'rb') as file:
    X_tr = pickle.load(file)

# Load the pickle file
with open('y_tr.p', 'rb') as file:
    y_tr = pickle.load(file)

# Load the pickle file
with open('X_val.p', 'rb') as file:
    X_val = pickle.load(file)

# Load the pickle file
with open('y_val.p', 'rb') as file:
    y_val = pickle.load(file)

# Load the pickle file
with open('X_te.p', 'rb') as file:
    X_te = pickle.load(file)

# Load the pickle file
with open('y_te.p', 'rb') as file:
    y_te = pickle.load(file)

In [7]:
BM_noTickets_preprocess.head()

Unnamed: 0,TA Contract Installation Date,Depreciation Start,TA Contract Start Date,TA Contract End Date,Churn,Service Category_Installation,Service Category_Removal,Service Category_Replacement,INCIDENT_CATEGORY_DESCRIPTION_Customer relocation,INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales,...,Generation_Gen. 2,Generation_Legacy,Blueprint Throughput_%23-N/A,Blueprint Throughput_High,Blueprint Throughput_Low,Blueprint Throughput_Medium,IP Ownership_Exclusive,IP Ownership_Non-Proprietary,IP Ownership_Propr. Comp.,IP Ownership_Proprietary
0,1751.0,2716.0,1744.0,103.19015,False,0.0,0.0,3.0,0.0,1.0,...,1,0,0,0,0,1,0,1,0,0
1,1751.0,485.0,1744.0,103.19015,False,1.0,0.0,0.0,0.0,0.0,...,0,1,1,0,0,0,0,1,0,0
2,1751.0,2191.0,1744.0,103.19015,False,1.0,0.0,0.0,0.0,0.0,...,0,1,0,0,1,0,0,1,0,0
3,1751.0,2191.0,1744.0,103.19015,False,1.0,0.0,0.0,0.0,0.0,...,0,1,0,0,1,0,0,1,0,0
4,1751.0,2191.0,1744.0,103.19015,False,1.0,0.0,0.0,0.0,0.0,...,0,1,0,0,1,0,0,1,0,0


In [8]:
BM_noTickets_preprocess.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 230786 entries, 0 to 230785
Columns: 1350 entries, TA Contract Installation Date to IP Ownership_Proprietary
dtypes: bool(1), float64(103), int32(2), int64(2), object(1), uint8(1241)
memory usage: 463.5+ MB


### Hyperparameters

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)

---

The number of trees in the forest. I will try some.

n_estimators = range(20,200,20)

---

The maximum depth of the tree. I will try some.

max_depth = (20, 50, 100, None)

---

The function to measure the quality of a split. I will try both.

criterion = ('gini', 'entropy')

---

Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
I will try both.

bootstrap = (True, False)

---

Weights associated with classes in the form.

class_weight = ('balanced', None)


## 2) Random Forest model<a class="anchor" id="RF"></a>

### Train the model

In [9]:
y_tr.shape

(129239,)

In [10]:
X_tr.shape

(129239, 1349)

In [11]:
np.shape(X_te)

(46158, 1349)

In [12]:
# Define a set of reasonable values
#n_estimators = range(20,160,20) # I tried till 200 but it was not in the top results, so I reduced to 160
n_estimators = range(20,120,60) # I tried till 200 but it was not in the top results, so I reduced to 160

#max_depth = (20, 50, 10, None) # I tried with 200, 300, but results were not better
max_depth = (20, 50, None) # I tried with 200, 300, but results were not better

criterion = ('gini', 'entropy')
#bootstrap = (True, False)
bootstrap = (True)
class_weight = ('balanced', None) # I tried with  'balanced_subsample', but I had lower scores.


# Define a parameter grid of values
grid = ParameterGrid({'rf__n_estimators' : n_estimators,
                    'rf__max_depth' : max_depth,
                    'rf__criterion' :  criterion,
                    'rf__bootstrap' : (True, False),
                    'rf__class_weight' : class_weight 
                   }
                  )
 

# Create pipeline, random forest classifier
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(random_state=1))
    ]
)
   
# Save accuracy on test set
test_scores = []

for params_dict in grid:
    # Set parameters
    pipe.set_params(**params_dict)

    # Fit a k-NN classifier
    pipe.fit(X_tr, y_tr)

    # Save accuracy on validation set
    params_dict['accuracy'] = pipe.score(X_val, y_val)
    # Save f1 score on validation set
    # predict test instances
    y_pred = pipe.predict(X_val)
    params_dict['f1_macro'] = metrics.f1_score(y_val, y_pred, average='macro')
    
    # Save result
    test_scores.append(params_dict)
    
# Create DataFrame with test scores
scores_df = pd.DataFrame(test_scores)

# Top five scores
scores_df.sort_values(by='f1_macro', ascending=False).head()

KeyboardInterrupt: 

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

rf = RandomForestClassifier(random_state=1)
rf.fit(X_tr, y_tr)
y_pred = rf.predict(X_te) # Predictions
y_true = y_te # True values

from sklearn.metrics import accuracy_score
print("Train accuracy:", np.round(accuracy_score(y_tr, 
                                                 rfc.predict(X_tr)), 2))
print("Test accuracy:", np.round(accuracy_score(y_true, y_pred), 2))

from sklearn.metrics import confusion_matrix
cf_matrix = confusion_matrix(y_true, y_pred)
print("\nTest confusion_matrix")
sns.heatmap(cf_matrix, annot=True, cmap='Blues')
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('True', fontsize=12)

estimator.get_params().keys()
param = {'rf__n_estimators' : n_estimators,
         'rf__max_depth' : max_depth,
         'rf__criterion' :  criterion,
         'rf__bootstrap' : (True, False),
         'rf__class_weight' : class_weight 
        }


# Let's measure execution time too
import time
start = time.time()

# Define a set of reasonable values
n_estimators = range(20,160,20) # I tried till 200 but it was not in the top results, so I reduced to 160
max_depth = (20, 50, 10, None) # I tried with 200, 300, but results were not better
criterion = ('gini', 'entropy')
bootstrap = (True, False)
class_weight = ('balanced', None) # I tried with  'balanced_subsample', but I had lower scores.


# Defining 3-dimensional hyperparameter space as a Python dictionary
#hyperparameter_space = {'rf__n_estimators' : n_estimators, 'rf__max_depth' : max_depth,
#                    'rf__criterion' :  criterion,
#                    'rf__bootstrap' : (True, False),
#                    'rf__class_weight' : class_weight 

#param = {'rf__n_estimators' : n_estimators,
#         'rf__max_depth' : max_depth,
#         'rf__max_depth' : max_depth,
#         'rf__criterion' :  criterion,
#         'rf__bootstrap' : (True, False),
#         'rf__class_weight' : class_weight 
#        }

random_grid = {'bootstrap': [True, False],
               'max_depth': [10, None],
               'min_samples_leaf': [1, 4],
               'min_samples_split': [2, 5],
               'n_estimators': [130, 230]}

from sklearn.model_selection import RandomizedSearchCV
#rs = RandomizedSearchCV(rfc, param_distributions=hyperparameter_space,
#                        n_iter=10, scoring= 'f1_macro', random_state=1,
#                        n_jobs=-1, cv=10, return_train_score=True)


#rs = RandomizedSearchCV(rf, param, n_iter =10, cv=9)
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, verbose=2, random_state=42, n_jobs = -1)

rf_random.fit(X_tr, y_tr)


#rs.fit(X_tr, y_tr)
print("Optimal hyperparameter combination:", rs.best_params_)
print()
print("Mean cross-validated training accuracy score:",
      rs.best_score_)
rs.best_estimator_.fit(X_tr, y_tr)
y_pred = rs.best_estimator_.predict(X_te) # Predictions
y_true = y_te # True values

print("Test accuracy:", np.round(accuracy_score(y_true, y_pred), 2))
cf_matrix = confusion_matrix(y_true, y_pred)
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(cf_matrix, annot=True, cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')

end = time.time()
diff = end - start
print('Execution time of Random Search (in Seconds):', diff)
print()

cv_results = rf_random.cv_results_
for mean_score, params in zip(cv_results["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

In [12]:
params_dict = {'rf__n_estimators': 80, 
 'rf__max_depth': 50, 
 'rf__criterion': 'entropy', 
 'rf__bootstrap': False, 
 'rf__class_weight': 'balanced'}

# Create pipeline, random forest classifier
pipe = Pipeline([
    #('oversample', SmoteSample_model),
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(random_state=1))
    ]
)
  
# Save accuracy on test set
test_scores = []

# Set parameters
pipe.set_params(**params_dict)

# Fit a k-NN classifier
pipe.fit(X_tr, y_tr)

# Save accuracy on validation set
params_dict['accuracy'] = pipe.score(X_te, y_te)
# Save f1 score on validation set
# predict test instances
y_pred = pipe.predict(X_te)
params_dict['f1_macro'] = metrics.f1_score(y_te, y_pred, average='macro')
    
# Save result
test_scores.append(params_dict)
    
# Create DataFrame with test scores
scores_df = pd.DataFrame(test_scores)

# Top five scores
scores_df.sort_values(by='f1_macro', ascending=False).head()

Unnamed: 0,rf__n_estimators,rf__max_depth,rf__criterion,rf__bootstrap,rf__class_weight,accuracy,f1_macro
0,80,50,entropy,False,balanced,0.818991,0.797082


We have an F1 Macro score of 83.8% and an accuracy of 93.7% with the test data. Which is close to the validation data results.

In [13]:
# F1_score
RF_F1Macro = params_dict['f1_macro']

# Accuracy
RF_accuracy = params_dict['accuracy']

In [14]:
# Load the npz file for results
with np.load('Results.npz', allow_pickle=False) as npz_file:
    F1Score = npz_file['test_F1Score']
    accuracy = npz_file['test_accuracy']
    models = npz_file['models']
    
# Fill the calculated result value
F1Score[4] = RF_F1Macro
accuracy[4] = RF_accuracy

print('F1_Results:', F1Score)
print('Accuracy:', accuracy)

F1_Results: [0.39728138 0.         0.         0.         0.79708213 0.
 0.         0.        ]
Accuracy: [0.65914901 0.         0.         0.         0.81899129 0.
 0.         0.        ]


In [15]:
# Modify the Numpy array
#Model = models #unchanged
Result = F1Score
Accuracy = accuracy

# Store the changes in the results npz file
np.savez('Results.npz', models = models, test_F1Score = Result,  test_accuracy = Accuracy)

In [16]:
# Check the refreshed results
# Load the npz file for results
with np.load('Results.npz', allow_pickle=False) as npz_file:
    # It's a dictionary-like object    
    print(list(npz_file.keys()))
    # Load the arrays
    print('Models:', npz_file['models'])
    print('F1_Results:', npz_file['test_F1Score'])
    print('Accuracy:', npz_file['test_accuracy'])

['models', 'test_F1Score', 'test_accuracy']
Models: ['Baseline' 'Logistic Regression' 'KNeighbors' 'Decision Tree'
 'Random Forest' 'XGBoost' 'SelectedModel_wTickets'
 'SelectedModel_wTickets&Telemetry']
F1_Results: [0.39728138 0.         0.         0.         0.79708213 0.
 0.         0.        ]
Accuracy: [0.65914901 0.         0.         0.         0.81899129 0.
 0.         0.        ]


### Confusion Matrix

In [17]:
#Classification reports and confusion matrices are commonly used for reporting the performance of classifiers when working with imbalanced datasets.

# Print the confusion matrix
print(metrics.confusion_matrix(y_te, y_pred))

# Print the precision and recall, among other metrics
print(metrics.classification_report(y_te, y_pred, digits=3))

[[26485  3940]
 [ 4415 11318]]
              precision    recall  f1-score   support

       False      0.857     0.871     0.864     30425
        True      0.742     0.719     0.730     15733

    accuracy                          0.819     46158
   macro avg      0.799     0.795     0.797     46158
weighted avg      0.818     0.819     0.818     46158



In [18]:
#Classification reports and confusion matrices are commonly used for reporting the performance of classifiers when working with imbalanced datasets.
ConfMat_df = pd.DataFrame(metrics.confusion_matrix(y_te, y_pred).T, columns=['False Condition', 'True Condition'],index=['Predicted False', 'Predicted True'])
ConfMat_df

Unnamed: 0,False Condition,True Condition
Predicted False,26485,4415
Predicted True,3940,11318


We have a good F1 Score for the False Condition but not as good for the True Condition.

In [19]:
# Save the Dataframe into a pickle file
with open('Random Forest_ConfMat_df.p', 'wb') as file:
    pickle.dump(ConfMat_df, file)

### Prediction of churn

Churn prediction from the model

In [20]:
# Need the one in the same format, so with removed Sales Org
# Load the pickle file
with open('BeverageMachine_withSerial.p', 'rb') as file:
    BeverageMachine_withSerial = pickle.load(file)

In [21]:
name = ['False', 'True']
No=BeverageMachine_withSerial['Serial ID']
predictions = pipe.predict_proba(X)
# With two column indices, values same  
# as dictionary keys 
df2 = pd.DataFrame(predictions, index=No ,columns = name) 

In [22]:
df3 = df2.reset_index(level=None)
df3

Unnamed: 0,Serial ID,False,True
0,MYBMB25820,0.858197,0.141803
1,22O0025413,0.955200,0.044800
2,22O0023817,0.601167,0.398833
3,22O0023735,0.695519,0.304481
4,22O0023729,0.640347,0.359653
...,...,...,...
230781,20O0017858,0.021987,0.978013
230782,20O0017859,0.006570,0.993430
230783,20O0017862,0.091315,0.908685
230784,20O0017861,0.066315,0.933685


In [23]:
df3.reset_index(inplace=True)

In [24]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230786 entries, 0 to 230785
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   index      230786 non-null  int64  
 1   Serial ID  230786 non-null  object 
 2   False      230786 non-null  float64
 3   True       230786 non-null  float64
dtypes: float64(2), int64(1), object(1)
memory usage: 7.0+ MB


In [25]:
df21 = pd.DataFrame(BeverageMachine_withSerial[['Serial ID','Churn', 'Sales Organisation']]).reset_index(level=None)
df21

Unnamed: 0,index,Serial ID,Churn,Sales Organisation
0,0,MYBMB25820,False,Malaysia
1,1,22O0025413,False,Nestlé India
2,2,22O0023817,False,Nestlé India
3,3,22O0023735,False,Nestlé India
4,4,22O0023729,False,Nestlé India
...,...,...,...,...
230781,230781,20O0017858,True,Singapore
230782,230782,20O0017859,True,Singapore
230783,230783,20O0017862,True,Singapore
230784,230784,20O0017861,True,Singapore


In [26]:
df21.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230786 entries, 0 to 230785
Data columns (total 4 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   index               230786 non-null  int64 
 1   Serial ID           230786 non-null  object
 2   Churn               230786 non-null  bool  
 3   Sales Organisation  230786 non-null  object
dtypes: bool(1), int64(1), object(2)
memory usage: 5.5+ MB


In [27]:
df3['Serial ID'] = df3['Serial ID'].astype('str')
df21['Serial ID'] = df21['Serial ID'].astype('str')
df3['index'] = df3['index'].astype('str')
df21['index'] = df21['index'].astype('str')

In [28]:
df3['KeyIndSer'] = df3['index'] + "-" + df3['Serial ID']
df21['KeyIndSer'] = df21['index'] + "-" + df21['Serial ID']

In [29]:
df4 = pd.merge(df3, df21, how='left', left_on = ['KeyIndSer'], right_on = ['KeyIndSer'])

In [30]:
df4

Unnamed: 0,index_x,Serial ID_x,False,True,KeyIndSer,index_y,Serial ID_y,Churn,Sales Organisation
0,0,MYBMB25820,0.858197,0.141803,0-MYBMB25820,0,MYBMB25820,False,Malaysia
1,1,22O0025413,0.955200,0.044800,1-22O0025413,1,22O0025413,False,Nestlé India
2,2,22O0023817,0.601167,0.398833,2-22O0023817,2,22O0023817,False,Nestlé India
3,3,22O0023735,0.695519,0.304481,3-22O0023735,3,22O0023735,False,Nestlé India
4,4,22O0023729,0.640347,0.359653,4-22O0023729,4,22O0023729,False,Nestlé India
...,...,...,...,...,...,...,...,...,...
230781,230781,20O0017858,0.021987,0.978013,230781-20O0017858,230781,20O0017858,True,Singapore
230782,230782,20O0017859,0.006570,0.993430,230782-20O0017859,230782,20O0017859,True,Singapore
230783,230783,20O0017862,0.091315,0.908685,230783-20O0017862,230783,20O0017862,True,Singapore
230784,230784,20O0017861,0.066315,0.933685,230784-20O0017861,230784,20O0017861,True,Singapore


In [31]:
df4.to_csv(r'C:\Users\msalomo\predictions-Churn-RandomForest2.csv', index = False, header=True)

'#months of data', 'Depreciation Start',
       'Industry (EC ID)_0614 InStore Food Service',
       'End Date in Local Time Zone_x', 
       'G/R/M TB_MTB (Market)', 'Position_#', 'Position_LOAN',
       'Incident Category_New Customer / Installation Point',
       'User Status_Installed', 'Model Group_Other',
       'User Status_To be removed', 'Service Category_Installation.',
       'Last_visit_diff_months', 'Model Vendor_SAI Vending',
       'Trading Partner_Direct', 'End Date in Local Time Zone_y',
       'Industry (EC ID)_0614 Convenience OOH',
       'System Brands_Nescafé Branded'

### Comments

Random Forest generally outperforms Decision Tree for accuracy, which is the case. 
I am focusing on the F1_macro score, and here Random Forest also outperforms decision tree for F1_macro.