# Capstone proposal by Mirko Salomon 

# BEVERAGE MACHINE CHURN PREDICTION

* ## [1) Introduction and preparation](#Introduction) 

    *   
        #### Data Load
        
        #### Hyperparameters

    
* ## [2) KNeighbors model](#Knn)
    
    *   
        #### Train the model
        
        #### Test data with best parameters 
        
        #### Confusion Matrix
        
        #### Prediction of churn

## 1) Introduction and preparation<a class="anchor" id="Introduction"></a>

Regression based on k-nearest neighbors.

The KNN algorithm assumes that similar things exist in close proximity. Source : towardsdatascience

K-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance for classification, normalizing the training data can improve its accuracy dramatically. Source: Wikipedia

Based on the flowchart from Scikit Learn I should try Kneighbors model. https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

This model is easy to understand and interpret, it does not requires a lot of training and is fast to implement
Since the algorithm requires no training before making predictions, new data can be added seamlessly.
The two main parameters to tune are the number of neighbors and the distance (e.g. Euclidean or Manhattan etc.)

It might not have the best performance with large number of dimensions and with categorical features, it becomes difficult for the algorithm to calculate distance in each dimension.


In [1]:
from sklearn.model_selection import GridSearchCV
import numpy as np

import pandas as pd
import os
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

import pickle

import xlrd

import datetime as dt
from datetime import datetime

import collections
from collections import Counter

# Import seaborn
import seaborn as sns


In [2]:
#libraries
from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

from sklearn.model_selection import ParameterGrid

from sklearn import metrics 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

### Data Load

In [3]:
# Load the pickle file
with open('BM_noTickets_preprocess.p', 'rb') as file:
    BM_noTickets_preprocess = pickle.load(file)

In [4]:
# Load the pickle file
with open('X.p', 'rb') as file:
    X = pickle.load(file)

# Load the pickle file
with open('y.p', 'rb') as file:
    y = pickle.load(file)

# Load the pickle file
with open('X_tr.p', 'rb') as file:
    X_tr = pickle.load(file)

# Load the pickle file
with open('y_tr.p', 'rb') as file:
    y_tr = pickle.load(file)

# Load the pickle file
with open('X_val.p', 'rb') as file:
    X_val = pickle.load(file)

# Load the pickle file
with open('y_val.p', 'rb') as file:
    y_val = pickle.load(file)

# Load the pickle file
with open('X_te.p', 'rb') as file:
    X_te = pickle.load(file)

# Load the pickle file
with open('y_te.p', 'rb') as file:
    y_te = pickle.load(file)

In [5]:
BM_noTickets_preprocess.head()

Unnamed: 0,TA Contract Installation Date,Depreciation Start,TA Contract Start Date,TA Contract End Date,#months of data,Churn,Service Category_Installation,Service Category_Installation.,Service Category_Removal,Service Category_Removal.,...,IP Ownership_Exclusive,IP Ownership_Non-Proprietary,IP Ownership_Propr. Comp.,IP Ownership_Proprietary,Trading Partner_%23-Unknown,Trading Partner_Direct,Trading Partner_EVS,G/R/M TB_GTB (Global),G/R/M TB_MTB (Market),G/R/M TB_RTB (Regional)
0,990.0,457.0,952.0,-79.931807,9.95503,False,0.0,0.0,0.0,0.0,...,0,0,1,0,0,0,1,1,0,0
1,990.0,1158.0,952.0,-79.931807,7.950882,False,0.0,0.0,0.0,0.0,...,0,0,1,0,0,0,1,1,0,0
2,990.0,1188.0,952.0,-79.931807,7.950882,False,0.0,0.0,0.0,0.0,...,0,0,1,0,0,0,1,1,0,0
3,990.0,1188.0,952.0,-79.931807,7.950882,False,0.0,0.0,0.0,0.0,...,0,0,1,0,0,0,1,1,0,0
4,990.0,1188.0,952.0,-79.931807,7.950882,False,0.0,0.0,0.0,0.0,...,0,0,1,0,0,0,1,1,0,0


### Hyperparameters

#class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)

---

Use several number of neighbors to use.
k_values = np.arange(1, 21)

---

‘uniform’ : uniform weights. All points in each neighborhood are weighted equally.

‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.

weights_functions = ['uniform', 'distance']

---

When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2

distance_types = [1, 2]

## 2) Kneighbors model<a class="anchor" id="Knn"></a>

### Train the model

In [6]:
# Define a set of reasonable values
k_values = np.arange(1, 21) # 1, 2, 3, .., 20
weights_functions = ['uniform', 'distance']
distance_types = [1, 2] #When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2

# Define a grid of values
grid = ParameterGrid({
    'knn__n_neighbors': k_values,
    'knn__weights': weights_functions,
    'knn__p': distance_types
})

In [7]:
# Print the number of combinations
print('Number of combinations:', len(grid))

# Iterate through each combination of parameters
for params_dict in grid:
    print(params_dict)

Number of combinations: 80
{'knn__n_neighbors': 1, 'knn__p': 1, 'knn__weights': 'uniform'}
{'knn__n_neighbors': 1, 'knn__p': 1, 'knn__weights': 'distance'}
{'knn__n_neighbors': 1, 'knn__p': 2, 'knn__weights': 'uniform'}
{'knn__n_neighbors': 1, 'knn__p': 2, 'knn__weights': 'distance'}
{'knn__n_neighbors': 2, 'knn__p': 1, 'knn__weights': 'uniform'}
{'knn__n_neighbors': 2, 'knn__p': 1, 'knn__weights': 'distance'}
{'knn__n_neighbors': 2, 'knn__p': 2, 'knn__weights': 'uniform'}
{'knn__n_neighbors': 2, 'knn__p': 2, 'knn__weights': 'distance'}
{'knn__n_neighbors': 3, 'knn__p': 1, 'knn__weights': 'uniform'}
{'knn__n_neighbors': 3, 'knn__p': 1, 'knn__weights': 'distance'}
{'knn__n_neighbors': 3, 'knn__p': 2, 'knn__weights': 'uniform'}
{'knn__n_neighbors': 3, 'knn__p': 2, 'knn__weights': 'distance'}
{'knn__n_neighbors': 4, 'knn__p': 1, 'knn__weights': 'uniform'}
{'knn__n_neighbors': 4, 'knn__p': 1, 'knn__weights': 'distance'}
{'knn__n_neighbors': 4, 'knn__p': 2, 'knn__weights': 'uniform'}
{'knn_

In [None]:
# Create k-NN classifier
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

# Save accuracy on test set
test_scores = []

for params_dict in grid:
    # Set parameters
    pipe.set_params(**params_dict)

    # Fit a k-NN classifier
    pipe.fit(X_tr, y_tr)

    # Save accuracy on validation set
    params_dict['accuracy'] = pipe.score(X_val, y_val)
    # Save f1 score on validation set
    # predict test instances
    y_pred = pipe.predict(X_val)
    params_dict['f1_macro'] = metrics.f1_score(y_val, y_pred, average='macro')
    
    # Save result
    test_scores.append(params_dict)

In [None]:
# Create DataFrame with test scores
scores_df = pd.DataFrame(test_scores)

# Top five scores
scores_df.sort_values(by='f1_macro', ascending=False).head()

### Test data with best parameters 

In [None]:
params_dict = {'knn__n_neighbors': 2, 
 'knn__p': 1, 
 'knn__weights': 'uniform'}

# Create pipeline, kNN classifier
pipe = Pipeline([
    #('oversample', SmoteSample_model),
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
    ]
)
   
# Save accuracy on test set
test_scores = []

# Set parameters
pipe.set_params(**params_dict)

# Fit a k-NN classifier
pipe.fit(X_tr, y_tr)

# Save accuracy on validation set
params_dict['accuracy'] = pipe.score(X_te, y_te)
# Save f1 score on validation set
# predict test instances
y_pred = pipe.predict(X_te)
params_dict['f1_macro'] = metrics.f1_score(y_te, y_pred, average='macro')
    
# Save result
test_scores.append(params_dict)
    
# Create DataFrame with test scores
scores_df = pd.DataFrame(test_scores)

# Top five scores
scores_df.sort_values(by='f1_macro', ascending=False).head()

We have an F1 Macro score of 82% and an accuracy of 93% with the test data. Which is a bit better than the validation data results.

In [None]:
# F1_score
knn_F1Macro = params_dict['f1_macro']

# Accuracy
knn_accuracy = params_dict['accuracy']

In [None]:
# Load the npz file for results
with np.load('Results.npz', allow_pickle=False) as npz_file:
    F1Score = npz_file['test_F1Score']
    accuracy = npz_file['test_accuracy']
    models = npz_file['models']
    
# Fill the calculated result value
F1Score[2] = knn_F1Macro
accuracy[2] = knn_accuracy

print('F1_Results:', F1Score)
print('Accuracy:', accuracy)

In [None]:
# Modify the Numpy array
#Model = models #unchanged
Result = F1Score
Accuracy = accuracy

# Store the changes in the results npz file
np.savez('Results.npz', models = models, test_F1Score = Result,  test_accuracy = Accuracy)

In [None]:
# Check the refreshed results
# Load the npz file for results
with np.load('Results.npz', allow_pickle=False) as npz_file:
    # It's a dictionary-like object    
    print(list(npz_file.keys()))
    # Load the arrays
    print('Models:', npz_file['models'])
    print('F1_Results:', npz_file['test_F1Score'])
    print('Accuracy:', npz_file['test_accuracy'])

### Confusion Matrix

In [None]:
#Classification reports and confusion matrices are commonly used for reporting the performance of classifiers when working with imbalanced datasets.

# Print the confusion matrix
print(metrics.confusion_matrix(y_te, y_pred))

# Print the precision and recall, among other metrics
print(metrics.classification_report(y_te, y_pred, digits=3))

In [None]:
#Classification reports and confusion matrices are commonly used for reporting the performance of classifiers when working with imbalanced datasets.
ConfMat_df = pd.DataFrame(metrics.confusion_matrix(y_te, y_pred).T, columns=['False Condition', 'True Condition'],index=['Predicted False', 'Predicted True'])
ConfMat_df

We have a good F1 Score for the False Condition but not as good for the True Condition. This explains why even if we have a good accuracy the F1 Macro Score is lower.

In [None]:
# Save the Dataframe into a pickle file
with open('KNeighbors_ConfMat_df.p', 'wb') as file:
    pickle.dump(ConfMat_df, file)

### Prediction of churn

Let's predict the churn for each Machine

In [None]:
# Need the one in the same format, so with removed Sales Org
# Load the pickle file
with open('BM_noTicketsWOSO.p', 'rb') as file:
    BM_noTicketsWOSO = pickle.load(file)

In [None]:
name = ['False', 'True']
No=BM_noTicketsWOSO['Serial ID']
predictions = pipe.predict_proba(X)
# With two column indices, values same  
# as dictionary keys 
df2 = pd.DataFrame(predictions, index=No ,columns = name) 
df2.head()