# Capstone proposal by Mirko Salomon 

# BEVERAGE MACHINE CHURN PREDICTION

* ## [1) Introduction and preparation](#Introduction) 

    *   
        #### Data Load
        
        #### Hyperparameters

    
* ## [2) Logistic Regression model](#LogReg)
    
    *   
        #### Train the model
        
        #### Test data with best parameters 
        
        #### Confusion Matrix
        
        #### Prediction of churn

## 1) Introduction and preparation<a class="anchor" id="Introduction"></a>

From the course we saw that we can convert the output of the linear regression model into probabilities using the logistic function, also called sigmoid function.

This function associates large negative values x with probabilities close to zero and large
positive ones with probabilities close to one.

The model makes predictions using a logistic function to find the dependency between the output and input variables.

Can be used to model the probability of a certain class or event existing

It can be explained and understand more easily than other models, model coefficients can be interpreted as indicators of feature importance.

I want to find a prediction and the model can return a prediction and it is effective for binary classification tasks.

The logistic regression is sensitive to scale.

I will have at least one version using regression.

In [1]:
from sklearn.model_selection import GridSearchCV
import numpy as np

import pandas as pd
import os
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

import pickle

import xlrd

import datetime as dt
from datetime import datetime

import collections
from collections import Counter

# Import seaborn
import seaborn as sns

In [2]:
#libraries
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import ParameterGrid

from sklearn import metrics 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

### Data Load

In [3]:
# Load the pickle file
with open('BM_noTickets_preprocess.p', 'rb') as file:
    BM_noTickets_preprocess = pickle.load(file)

In [4]:
# Load the pickle file
with open('X.p', 'rb') as file:
    X = pickle.load(file)

# Load the pickle file
with open('y.p', 'rb') as file:
    y = pickle.load(file)

# Load the pickle file
with open('X_tr.p', 'rb') as file:
    X_tr = pickle.load(file)

# Load the pickle file
with open('y_tr.p', 'rb') as file:
    y_tr = pickle.load(file)

# Load the pickle file
with open('X_val.p', 'rb') as file:
    X_val = pickle.load(file)

# Load the pickle file
with open('y_val.p', 'rb') as file:
    y_val = pickle.load(file)

# Load the pickle file
with open('X_te.p', 'rb') as file:
    X_te = pickle.load(file)

# Load the pickle file
with open('y_te.p', 'rb') as file:
    y_te = pickle.load(file)

I have already preprocessed the data in the Notebook 03 - Base and features importance

In [5]:
BM_noTickets_preprocess.head()

Unnamed: 0,TA Contract Installation Date,Depreciation Start,TA Contract Start Date,TA Contract End Date,#months of data,Churn,Service Category_Installation,Service Category_Installation.,Service Category_Removal,Service Category_Removal.,...,IP Ownership_Exclusive,IP Ownership_Non-Proprietary,IP Ownership_Propr. Comp.,IP Ownership_Proprietary,Trading Partner_%23-Unknown,Trading Partner_Direct,Trading Partner_EVS,G/R/M TB_GTB (Global),G/R/M TB_MTB (Market),G/R/M TB_RTB (Regional)
0,1522.0,2178.0,1522.0,412.785965,0.0,False,0.0,1.0,0.0,0.0,...,1,0,0,0,0,0,1,1,0,0
1,1522.0,2178.0,1522.0,412.785965,0.0,False,0.0,0.0,0.0,0.0,...,0,0,1,0,0,0,1,1,0,0
2,1522.0,2178.0,1522.0,412.785965,0.0,False,0.0,0.0,0.0,0.0,...,0,0,1,0,0,0,1,1,0,0
3,1522.0,2178.0,1522.0,412.785965,0.0,False,0.0,0.0,0.0,0.0,...,0,0,1,0,0,0,1,1,0,0
4,1522.0,2178.0,1522.0,412.785965,0.0,False,0.0,0.0,0.0,0.0,...,0,0,1,0,0,0,1,1,0,0


### Hyperparameters

Logistic Regression model and the Hyperparameters:

class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)[source]

---


A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression. I will try both.

'LnR__penalty' : ['l1', 'l2'], 

---

LogisticRegression object has a C regularization parameter. I will try different regularization values.

'LnR__C' : np.logspace(-4, 4, num=10),

---

I will use one-versus-rests strategy.

'LnR__multi_class' : ['ovr'],

---


Liblinear allows to use both L1 and L2 penalty.

'LnR__solver' : ['liblinear'],

---

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data. I will try with "balanced" and with None

'LnR__class_weight' : ['balanced', None]

## 2) Logistic Regression model<a class="anchor" id="LogReg"></a>

### Train the model

In [6]:
grid = ParameterGrid([
    #First with one-vs-rest (OvR) and liblinear
    {
        #'classifier' : [LogisticRegression()],
        'LnR__penalty' : ['l1', 'l2'],
        # LogisticRegression object has a C regularization parameter
        'LnR__C' : np.logspace(-4, 4, num=10),
        'LnR__multi_class' : ['ovr'],
        'LnR__solver' : ['liblinear'],
        'LnR__class_weight' : ['balanced', None]
    }
])

grid = ParameterGrid([
    #First with one-vs-rest (OvR) and liblinear
    {
        #'classifier' : [LogisticRegression()],
        'LnR__penalty' : ['l1', 'l2'],
        # LogisticRegression object has a C regularization parameter
        'LnR__C' : np.logspace(-4, 4, num=4),
        'LnR__multi_class' : ['ovr'],
        'LnR__solver' : ['liblinear'],
        'LnR__class_weight' : [None]
    }
])

The Logistic Regression was slower than other models, so I did not kept all my trials, below you can see some trials that I have removed: 

    #,
    #Results were not as good as with ovr strategy for multi_class
    #{
    #    #Second to test multinomial
    #    'LnR__penalty' : ['l2'],
    #    'LnR__C' : np.logspace(-4, 4, num=10),
    #    'LnR__multi_class' : ['multinomial'],
    #    'LnR__solver' : ['lbfgs'],
    #    #'LnR__class_weight' : ['balanced', None] 
    #    #results were worse with balanced class weight, they were removed to speed up the process
    #    'LnR__class_weight' : [None]
    #}
    
    #Results were bad and the program was really slow, I did not kept this possibility
    #,
    #{    
    #    #the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso 
    #    #and ridge methods. But i get an error because it is only available with scikit-Learn 0.21 and 
    #    #I have version 0.20.3. I will keep the course version so that we stay aligned
    #    'LnR__C' : np.logspace(-4, 4, num=10),
    #    'LnR__multi_class' : ['auto'],
    #    'LnR__solver' : ['saga'],
    #    'LnR__class_weight' : ['balanced', None]
    #}

# Print the number of combinations
print('Number of combinations:', len(grid))

# Iterate through each combination of parameters
for params_dict in grid:
    print(params_dict)

I had the warning :

    143: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
    
No F-score to calculate for this label, and thus the F-score for this case is considered to be 0.0.
Since I requested an average of the score, I must take into account that a score of 0 was included in the calculation, and this is why scikit-learn is showing me that warning.

And then also this Warning :

    977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
    
I could solve it sometimes when I increased the number of iterations but then it became even slower to solve.

Normally when an optimization algorithm does not converge, it is usually because the problem is not well-conditioned.

I can see it does not happen in every trial so maybe some results are bad and other are good and if I filter to keep the top results and then do my test on another test data and I do not have this error, then it should be fine.

In case I do not want to see it repeatedly, I can do this :
    #import warnings
    #warnings.filterwarnings('ignore')  # "error", "ignore", "always", "default", "module" or "once"

clf = LogisticRegression(random_state=0)
#.fit(X_tr, y_tr)

In [7]:
# Create pipeline, Logistic regression classifier
pipe = Pipeline([
    #('oversample', SmoteSample_model),
    ('scaler', StandardScaler()),
    ('LnR', LogisticRegression(random_state=0        
        # Trying to solve convergence error
        ,max_iter =2000
    ))
    ]
)

In [None]:
# Create pipeline, Logistic regression classifier
pipe = Pipeline([
    #('oversample', SmoteSample_model),
    ('scaler', StandardScaler()),
    ('LnR', LogisticRegression(random_state=0        
        # Trying to solve convergence error
        ,max_iter =2000
    ))
    ]
)
   
# Save accuracy on test set
test_scores = []

for params_dict in grid:
    # Set parameters
    pipe.set_params(**params_dict)

    # Fit a k-NN classifier
    pipe.fit(X_tr, y_tr)

    # Save accuracy on validation set
    params_dict['accuracy'] = pipe.score(X_val, y_val)
    # Save f1 score on validation set
    # predict test instances
    y_pred = pipe.predict(X_val)
    params_dict['f1_macro'] = metrics.f1_score(y_val, y_pred, average='macro')
        
    # Save result
    test_scores.append(params_dict)
    
# Create DataFrame with test scores
scores_df = pd.DataFrame(test_scores)

# Top five scores
scores_df.sort_values(by='f1_macro', ascending=False).head()

# Top five scores
scores_df.sort_values(by='f1_macro', ascending=False).head()

We have an F1 Macro score of 81% and an accuracy of 93% with the validation data.

### Test data with best parameters 

In [7]:
params_dict = {'LnR__C': 0.359381, 
 'LnR__class_weight': None, 
 'LnR__multi_class': 'ovr', 
 'LnR__penalty': 'l1', 
 'LnR__solver': 'liblinear'}

pipe = Pipeline([
    #('oversample', SmoteSample_model),
    ('scaler', StandardScaler()),
    ('LnR', LogisticRegression(random_state=0
    ))
    ]
)
   
# Save accuracy on test set
test_scores = []


pipe.set_params(**params_dict)

# Fit a k-NN classifier
pipe.fit(X_tr, y_tr)

# Save accuracy on validation set
params_dict['accuracy'] = pipe.score(X_te, y_te)
# Save f1 score on validation set
# predict test instances
y_pred = pipe.predict(X_te)
params_dict['f1_macro'] = metrics.f1_score(y_te, y_pred, average='macro')
        
# Save result
test_scores.append(params_dict)
    
# Create DataFrame with test scores
scores_df = pd.DataFrame(test_scores)

# Top five scores
scores_df.sort_values(by='f1_macro', ascending=False)

Unnamed: 0,LnR__C,LnR__class_weight,LnR__multi_class,LnR__penalty,LnR__solver,accuracy,f1_macro
0,0.359381,,ovr,l1,liblinear,0.849443,0.816219


We have an F1 Macro score of 83% and an accuracy of 94% with the test data. Which is close to the validation data results.

In [8]:
# F1_score
LogReg_F1Macro = params_dict['f1_macro']

# Accuracy
LogReg_accuracy = params_dict['accuracy']

In [9]:
# Load the npz file for results
with np.load('Results.npz', allow_pickle=False) as npz_file:
    F1Score = npz_file['test_F1Score']
    accuracy = npz_file['test_accuracy']
    models = npz_file['models']
    
# Fill the calculated result value
F1Score[1] = LogReg_F1Macro
accuracy[1] = LogReg_accuracy

print('F1_Results:', F1Score)
print('Accuracy:', accuracy)

F1_Results: [0.44459684 0.81621901 0.         0.         0.98837332 0.
 0.         0.        ]
Accuracy: [0.71013115 0.84944262 0.         0.         0.99029508 0.
 0.         0.        ]


In [10]:
# Modify the Numpy array
#Model = models #unchanged
Result = F1Score
Accuracy = accuracy

# Store the changes in the results npz file
np.savez('Results.npz', models = models, test_F1Score = Result,  test_accuracy = Accuracy)

In [11]:
# Check the refreshed results
# Load the npz file for results
with np.load('Results.npz', allow_pickle=False) as npz_file:
    # It's a dictionary-like object    
    print(list(npz_file.keys()))
    # Load the arrays
    print('Models:', npz_file['models'])
    print('F1_Results:', npz_file['test_F1Score'])
    print('Accuracy:', npz_file['test_accuracy'])

['models', 'test_F1Score', 'test_accuracy']
Models: ['Baseline' 'Logistic Regression' 'KNeighbors' 'Decision Tree'
 'Random Forest' 'XGBoost' 'SelectedModel_wTickets'
 'SelectedModel_wTickets&Telemetry']
F1_Results: [0.44459684 0.81621901 0.         0.         0.98837332 0.
 0.         0.        ]
Accuracy: [0.71013115 0.84944262 0.         0.         0.99029508 0.
 0.         0.        ]


### Confusion Matrix

In [12]:
#Classification reports and confusion matrices are commonly used for reporting the performance of classifiers when working with imbalanced datasets.

# Print the confusion matrix
print(metrics.confusion_matrix(y_te, y_pred))

# Print the precision and recall, among other metrics
print(metrics.classification_report(y_te, y_pred, digits=3))

[[19438  1936]
 [ 2656  6470]]
              precision    recall  f1-score   support

       False      0.880     0.909     0.894     21374
        True      0.770     0.709     0.738      9126

    accuracy                          0.849     30500
   macro avg      0.825     0.809     0.816     30500
weighted avg      0.847     0.849     0.848     30500



In [13]:
#Classification reports and confusion matrices are commonly used for reporting the performance of classifiers when working with imbalanced datasets.
ConfMat_df = pd.DataFrame(metrics.confusion_matrix(y_te, y_pred).T, columns=['False Condition', 'True Condition'],index=['Predicted False', 'Predicted True'])
ConfMat_df

Unnamed: 0,False Condition,True Condition
Predicted False,19438,2656
Predicted True,1936,6470


We have a good F1 Score for the False Condition but not as good for the True Condition. This explains why even if we have an accuracy of 94% we have a lower F1 Macro Score.

In [14]:
# Save the Dataframe into a pickle file
with open('Logistic Regression_ConfMat_df.p', 'wb') as file:
    pickle.dump(ConfMat_df, file)

### Prediction of churn

Let's predict the churn probability for each Machine

In [15]:
# Need the one in the same format, so with removed Sales Org
# Load the pickle file
with open('BeverageMachine_withSerial.p', 'rb') as file:
    BeverageMachine_withSerial = pickle.load(file)

In [16]:
name = ['False', 'True']
No=BeverageMachine_withSerial['Serial ID']
predictions = pipe.predict_proba(X)
# With two column indices, values same  
# as dictionary keys 
df2 = pd.DataFrame(predictions, index=No ,columns = name) 
df2.head()

Unnamed: 0_level_0,False,True
Serial ID,Unnamed: 1_level_1,Unnamed: 2_level_1
HK10020983,0.99967,0.00033
120604679,0.441897,0.558103
120604691,0.441897,0.558103
120604690,0.424862,0.575138
120604688,0.441897,0.558103


In [17]:
X

array([[1522.0, 2178.0, 1522.0, ..., 1, 0, 0],
       [1522.0, 2178.0, 1522.0, ..., 1, 0, 0],
       [1522.0, 2178.0, 1522.0, ..., 1, 0, 0],
       ...,
       [2167.0, 2178.0, 2167.0, ..., 1, 0, 0],
       [2407.0, 2178.0, 2566.0, ..., 1, 0, 0],
       [2040.0, 2178.0, 2040.0, ..., 1, 0, 0]], dtype=object)

In [18]:
predictions

array([[9.99670179e-01, 3.29821446e-04],
       [4.41896527e-01, 5.58103473e-01],
       [4.41896527e-01, 5.58103473e-01],
       ...,
       [1.93969285e-11, 1.00000000e+00],
       [2.78317196e-03, 9.97216828e-01],
       [1.35167499e-08, 9.99999986e-01]])

In [19]:
df3 = df2.reset_index(level=None)
df3

Unnamed: 0,Serial ID,False,True
0,HK10020983,9.996702e-01,0.000330
1,120604679,4.418965e-01,0.558103
2,120604691,4.418965e-01,0.558103
3,120604690,4.248624e-01,0.575138
4,120604688,4.418965e-01,0.558103
...,...,...,...
152494,143236219,4.441164e-09,1.000000
152495,121410696,3.750712e-07,1.000000
152496,131410911,1.939693e-11,1.000000
152497,114736158,2.783172e-03,0.997217


In [20]:
BeverageMachine_withSerial.iloc[:,:200]

Unnamed: 0,Serial ID,Machine Status Groupings,TA Contract Installation Date,Depreciation Start,TA Contract Start Date,TA Contract End Date,#months of data,Churn,Service Category_Installation,Service Category_Installation.,...,TA Usage Indicator_5 Monthly Rental,TA Usage Indicator_7 Annual / Periodic,TA Usage Indicator_Not assigned,TA Usage Indicator_Trial / Evaluation,Account ABC Classification (Account ID)_01 Planned Shopping,Account ABC Classification (Account ID)_02 Ad Hoc Convenience,Account ABC Classification (Account ID)_03 Spec Food & Drink,Account ABC Classification (Account ID)_04 Spec Non Food,Account ABC Classification (Account ID)_05 Service Types,Account ABC Classification (Account ID)_06 Out of Home
0,HK10020983,Deployed,1522.0,2178.0,1522.0,412.785965,0.0,False,0.0,1.0,...,0,0,1,0,0,0,0,0,0,1
1,120604679,Deployed,1522.0,2178.0,1522.0,412.785965,0.0,False,0.0,0.0,...,0,0,1,0,0,0,0,0,0,0
2,120604691,Deployed,1522.0,2178.0,1522.0,412.785965,0.0,False,0.0,0.0,...,0,0,1,0,0,0,0,0,0,0
3,120604690,Deployed,1522.0,2178.0,1522.0,412.785965,0.0,False,0.0,0.0,...,0,0,1,0,0,0,0,0,0,0
4,120604688,Deployed,1522.0,2178.0,1522.0,412.785965,0.0,False,0.0,0.0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
156994,143236219,Deployed,2458.0,2178.0,2457.0,-3652.000000,0.0,True,0.0,0.0,...,1,0,0,0,0,1,0,0,0,0
156995,121410696,Deployed,1729.0,2178.0,2321.0,-3652.000000,0.0,True,0.0,0.0,...,1,0,0,0,0,1,0,0,0,0
156996,131410911,Deployed,2167.0,2178.0,2167.0,-3652.000000,0.0,True,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
156997,114736158,Deployed,2407.0,2178.0,2566.0,580.000000,0.0,True,0.0,0.0,...,1,0,0,0,0,0,0,1,0,0


In [21]:
df21 = pd.DataFrame(BeverageMachine_withSerial[['Serial ID','Churn']]).reset_index(level=None)
df21

Unnamed: 0,index,Serial ID,Churn
0,0,HK10020983,False
1,1,120604679,False
2,2,120604691,False
3,3,120604690,False
4,4,120604688,False
...,...,...,...
152494,156994,143236219,True
152495,156995,121410696,True
152496,156996,131410911,True
152497,156997,114736158,True


In [22]:
df4 = pd.merge(df3, df21, how='left', left_on = ['Serial ID'], right_on = ['Serial ID'])
df4

Unnamed: 0,Serial ID,False,True,index,Churn
0,HK10020983,9.996702e-01,0.000330,0,False
1,120604679,4.418965e-01,0.558103,1,False
2,120604691,4.418965e-01,0.558103,2,False
3,120604690,4.248624e-01,0.575138,3,False
4,120604688,4.418965e-01,0.558103,4,False
...,...,...,...,...,...
152572,143236219,4.441164e-09,1.000000,156994,True
152573,121410696,3.750712e-07,1.000000,156995,True
152574,131410911,1.939693e-11,1.000000,156996,True
152575,114736158,2.783172e-03,0.997217,156997,True


In [23]:
df4.to_csv(r'C:\Users\msalomo\predictions-Churn-LogReg.csv', index = False, header=True)