# JupyterHub Notebook

### This notebook server is hosted on the OpenShift platform which provides a separate server for each individual user. The platform takes care of the provisioning of the server and allocating related to storage.

### First, install and import required libraries and watermark our file - to show what libraries and versions we're using. Then define utility functions to integrate with our Object storage and _Verta_ visualisation server.

In [1]:
import os
# os.environ["MODIN_ENGINE"] = "ray"


In [2]:
import matplotlib
import matplotlib.pyplot as plt

import numpy as np
# import pandas as pd
# import modin.pandas as pd

import watermark
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn import model_selection
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from datetime import datetime
import verta.integrations.sklearn
from minio import Minio
from verta import Client
from minio.error import ResponseError
import os
from sklearn.preprocessing import OneHotEncoder

from sklearn.pipeline import Pipeline


# import tools as tools
%matplotlib inline
%load_ext watermark

In [3]:
%watermark -n -v -m -g -iv


verta      0.16.0
watermark  2.0.2
pandas     1.1.5
numpy      1.19.4
seaborn    0.11.0
matplotlib 3.3.3
Wed Dec 16 2020 

CPython 3.6.8
IPython 7.16.1

compiler   : GCC 8.3.1 20191121 (Red Hat 8.3.1-5)
system     : Linux
release    : 4.18.0-193.29.1.el8_2.x86_64
machine    : x86_64
processor  : x86_64
CPU cores  : 32
interpreter: 64bit
Git hash   : df2a872e605c3b4fb5ba00b5e1363464b6f84b06


### In this next section, on the third line, change experiment_name by appending your username to _customerchurn_, e.g., if your username is user1: 
#### experiment_name = "customerchurn"+"user1"

In [4]:
dateTimeObj = datetime.now()
timestampStr = dateTimeObj.strftime("%d%Y%H%M%S%f")
experiment_name = "customerchurn"+"user29"
experiment_id = experiment_name + timestampStr

def get_s3_server():
    minioClient = Minio('minio-ml-workshop:9000',
                    access_key='minio',
                    secret_key='minio123',
                    secure=False)

    return minioClient

def get_verta():
    client = Client("http://v1-webapp:3000")
    return client

def get_meta_store():
    client = get_verta()
    proj = client.set_project("ml-workshop")
    client.set_experiment(experiment_name)
    run = client.set_experiment_run(experiment_id)
    return run




### In this next section, on the second line, insert the value you retrieved from Minio object storage earlier - representing the fully qualified name of your csv file in Minio. This is the file pushed by the data engineer in the format: full_data_csv{USERNAME}/{FILENAME}.csv. 
#### In my case this value is: full_data_csvuser29/part-00000-59149e08-583c-46a5-bfa0-0b3abecbf1a3-c000.csv (yours will be different)
### We refer to this fully qualified name in the Github instructions as CSV-FILE

In [5]:
minioClient = get_s3_server()
data_file = minioClient.fget_object("data", "full_data_csvuser29/part-00000-59149e08-583c-46a5-bfa0-0b3abecbf1a3-c000.csv", "/tmp/data.csv")
data_file_version = data_file.version_id
data = pd.read_csv('/tmp/data.csv')
data.head(5)


Unnamed: 0,customerID,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,...,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,gender,SeniorCitizen,Partner,Dependents,tenure
0,148,Yes,No,DSL,No,No,No,No,No,No,...,Yes,Electronic check,45.65,45.65,Yes,Male,0,No,No,1
1,463,Yes,Yes,Fiber optic,No,No,Yes,No,Yes,Yes,...,No,Electronic check,101.15,385.9,Yes,Male,0,Yes,Yes,4
2,471,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,...,No,Mailed check,20.65,330.6,No,Female,1,No,No,17
3,496,No,No phone service,DSL,No,Yes,Yes,No,No,Yes,...,Yes,Bank transfer (automatic),43.75,903.6,Yes,Male,0,No,No,22
4,833,Yes,No,DSL,Yes,Yes,Yes,Yes,No,Yes,...,No,Credit card (automatic),74.1,5222.3,No,Female,0,Yes,Yes,70


### Use pandas.DataFrame functions
- _shape_ to return the dimensionality
- _info_ to print a concise summary of the DataFrame
- _describe_ to generate descriptive statistics of the DataFrame's columns
- _isnull().sum()_ to sum the empty values
- finally determine Churn and Total Changes 


In [6]:
data.shape

(7043, 21)

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   int64  
 1   PhoneService      7043 non-null   object 
 2   MultipleLines     7043 non-null   object 
 3   InternetService   7043 non-null   object 
 4   OnlineSecurity    7043 non-null   object 
 5   OnlineBackup      7043 non-null   object 
 6   DeviceProtection  7043 non-null   object 
 7   TechSupport       7043 non-null   object 
 8   StreamingTV       7043 non-null   object 
 9   StreamingMovies   7043 non-null   object 
 10  Contract          7043 non-null   object 
 11  PaperlessBilling  7043 non-null   object 
 12  PaymentMethod     7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7032 non-null   float64
 15  Churn             7043 non-null   object 
 16  gender            7043 non-null   object 


In [8]:
data.describe()

Unnamed: 0,customerID,MonthlyCharges,TotalCharges,SeniorCitizen,tenure
count,7043.0,7043.0,7032.0,7043.0,7043.0
mean,3522.0,64.761692,2283.300441,0.162147,32.371149
std,2033.283305,30.090047,2266.771362,0.368612,24.559481
min,1.0,18.25,18.8,0.0,0.0
25%,1761.5,35.5,401.45,0.0,9.0
50%,3522.0,70.35,1397.475,0.0,29.0
75%,5282.5,89.85,3794.7375,0.0,55.0
max,7043.0,118.75,8684.8,1.0,72.0


In [9]:
data.isnull().sum()

customerID           0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
dtype: int64

In [10]:
# Convert binary variable into numeric so plotting is easier. We need to later take mean
data['Churn'] = data['Churn'].map({'Yes': 1, 'No': 0})

In [11]:
data.replace(" ", np.nan, inplace=True)

In [12]:
data.isna().sum()

customerID           0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
dtype: int64

In [13]:
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'])

In [14]:
mean = data['TotalCharges'].mean()
data.fillna(mean, inplace=True)
# Now we know that total charges has nan values
data.isna().sum()

customerID          0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
dtype: int64

## Feature Engineering pipeline
### Use category_encoder's Ordinal encoding method which uses a single column of integers to represent the classes - then fit that to our 2 dimensional data imported earlier. Then pickle it and transform it. Then use Onehot (or dummy) coding for categorical features, producing one feature per category, each binary.


In [15]:
import category_encoders as ce
import joblib

names = ['gender', 'Partner', 'Dependents', 'PhoneService', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'Churn']
# for column in names:
#     labelencoder(column)

enc = ce.ordinal.OrdinalEncoder(cols=names)
enc.fit(data)
joblib.dump(enc, 'enc.pkl')
labelled_set = enc.transform(data)
labelled_set.tail(5)

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,customerID,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,...,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,gender,SeniorCitizen,Partner,Dependents,tenure
7038,6490,1,No,No,No internet service,No internet service,No internet service,No internet service,3,3,...,1,Mailed check,18.85,18.85,1,1,0,1,1,1
7039,6634,1,Yes,Fiber optic,No,No,No,No,1,1,...,1,Electronic check,74.5,74.5,1,2,0,1,1,1
7040,6638,1,No,DSL,Yes,Yes,No,No,1,1,...,1,Credit card (automatic),53.65,3804.4,2,1,0,2,1,69
7041,6721,1,Yes,DSL,No,Yes,Yes,Yes,2,2,...,1,Electronic check,84.1,5979.7,2,1,0,2,2,70
7042,6819,1,No,DSL,Yes,Yes,No,Yes,1,2,...,2,Mailed check,71.1,213.35,2,2,0,1,1,3


In [16]:

names = ['MultipleLines', 'InternetService', 'Contract', 'PaymentMethod', 'OnlineSecurity', 'OnlineBackup',
         'DeviceProtection', 'TechSupport']

ohe = ce.OneHotEncoder(cols=names)
ohe.fit(labelled_set)
joblib.dump(ohe, 'ohe.pkl')
final_set = ohe.transform(labelled_set)
final_set.tail(5)

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,customerID,PhoneService,MultipleLines_1,MultipleLines_2,MultipleLines_3,InternetService_1,InternetService_2,InternetService_3,OnlineSecurity_1,OnlineSecurity_2,...,PaymentMethod_3,PaymentMethod_4,MonthlyCharges,TotalCharges,Churn,gender,SeniorCitizen,Partner,Dependents,tenure
7038,6490,1,1,0,0,0,0,1,0,1,...,0,0,18.85,18.85,1,1,0,1,1,1
7039,6634,1,0,1,0,0,1,0,1,0,...,0,0,74.5,74.5,1,2,0,1,1,1
7040,6638,1,1,0,0,1,0,0,0,0,...,0,1,53.65,3804.4,2,1,0,2,1,69
7041,6721,1,0,1,0,1,0,0,1,0,...,0,0,84.1,5979.7,2,1,0,2,2,70
7042,6819,1,1,0,0,1,0,0,0,0,...,0,0,71.1,213.35,2,2,0,1,1,3


### Now we use scikit-learn's 'train_test_split' function to randomly split our data into training and testing sets. Then remove the _Churn_ and _customerID_ fields from our training and testing datasets and output the shaope of our data.

In [17]:
labels = final_set['Churn']
X_train, X_test, y_train, y_test = train_test_split(final_set, labels, test_size=0.2)
X_train.pop('Churn')
X_train.pop('customerID')
X_test.pop('Churn')
X_test.pop('customerID')
print ('Training Data Shape',X_train.shape, y_train.shape)
print ('Testing Data Shape',X_test.shape, y_test.shape)

Training Data Shape (5634, 36) (5634,)
Testing Data Shape (1409, 36) (1409,)


In [18]:

# Data For cross validation and GridSearch
Y = final_set['Churn']
X = final_set.drop(['Churn', 'customerID'], axis=1)
print ('Training Data Shape', X.shape)
print ('Testing Data Shape', Y.shape)

Training Data Shape (7043, 36)
Testing Data Shape (7043,)


### Create DecisionTreeClassifier object, extract hyper parameters, and then GridSearch will best_model from the various inputs

In [19]:
# Create decision tree object
DT = DecisionTreeClassifier()
# List of parameters
# entropy
criterion = ['gini']
max_depth = [5,10,15]
min_samples_split = [2,4,6]
min_samples_leaf = [4,5,6,8]
# Save all the lists in the variable
hyperparameters = dict(max_depth=max_depth, criterion=criterion,min_samples_leaf = min_samples_leaf ,min_samples_split = min_samples_split)

In [20]:
model = GridSearchCV(DT, hyperparameters, cv=5, verbose=0)
best_model = model.fit(X,Y)

In [21]:
# Mean cross validated score
print('Mean Cross-Validated Score: ',best_model.best_score_)
print('Best Parameters',best_model.best_params_)
# You can also print the best penalty and C value individually from best_model.best_estimator_.get_params()
print('Best criteria:', best_model.best_estimator_.get_params()['criterion'])
print('Best depth:', best_model.best_estimator_.get_params()['max_depth'])

Mean Cross-Validated Score:  0.7922772235305504
Best Parameters {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 8, 'min_samples_split': 2}
Best criteria: gini
Best depth: 5


### Use K-Folds cross-validator to split data in train/test sets. Create a dictionary of hyperparameter candidates, train model using a DecisionTreeClassifier, assess results, print and store hyper parameters and accuracy and tag using 'DecisionTreeClassifier'


In [22]:
kfold = KFold(n_splits = 3)
hyperparameters = dict(max_depth=5, criterion='gini',min_samples_leaf = 3 ,min_samples_split = 10)
model = DecisionTreeClassifier(max_depth=5, criterion='gini',min_samples_leaf = 3 ,min_samples_split = 10)
model = model.fit(X_train, y_train)
joblib.dump(model, 'dct.pkl')
results = model_selection.cross_val_score(model,X,Y,cv = kfold)
print(results)
print('Accuracy',results.mean()*100)
store = get_meta_store()
store.log_hyperparameters(hyperparameters)
store.log_model(model)
store.log_metric('Accuracy',results.mean()*100)
store.log_tag("DecisionTreeClassifier")
# get_meta_store().log_dataset_version("raw_data", dataset_version)

[0.7802385  0.8032368  0.78909246]
Accuracy 79.08559188612233
connection successfully established
got existing Project: ml-workshop
created new Experiment: customerchurnuser29
created new ExperimentRun: customerchurnuser29162020184616830646
upload complete (custom_modules)
upload complete (model.pkl)
upload complete (model_api.json)


### Like before, in this next section, on the third line, change experiment_name by appending your username to _customerchurn_, e.g., if your username is user1: 
#### experiment_name = "customerchurn"+"user1"
### Create RandomForestClassifier object, extract hyper parameters, and then the best_model

In [23]:
dateTimeObj = datetime.now()
timestampStr = dateTimeObj.strftime("%d%Y%H%M%S%f")
experiment_name = "customerchurn"+"user29"
experiment_id = experiment_name + timestampStr


# Create random forest object
RF = RandomForestClassifier()
n_estimators = [18,22]
criterion = ['gini', 'entropy']
# Create a list of all of the parameters
max_depth = [30,40,50]
min_samples_split = [6,8]
min_samples_leaf = [8,10,12]
# Merge the list into the variable
hyperparameters = dict(n_estimators = n_estimators,max_depth=max_depth, criterion=criterion,min_samples_leaf = min_samples_leaf ,min_samples_split = min_samples_split)
# Fit your model using gridsearch
model = GridSearchCV(RF, hyperparameters, cv=5, verbose=0)
best_model = model.fit(X, Y)

### Extract best scores, params, criteria and depth from our model. 

In [24]:
# Mean cross validated score
print('Mean Cross-Validated Score: ',best_model.best_score_)
print('Best Parameters',best_model.best_params_)
# You can also print the best penalty and C value individually from best_model.best_estimator_.get_params()
print('Best criteria:', best_model.best_estimator_.get_params()['criterion'])
print('Best depth:', best_model.best_estimator_.get_params()['max_depth'])
print('Best estimator:', best_model.best_estimator_.get_params()['n_estimators'])


Mean Cross-Validated Score:  0.8020744281889154
Best Parameters {'criterion': 'entropy', 'max_depth': 40, 'min_samples_leaf': 10, 'min_samples_split': 8, 'n_estimators': 22}
Best criteria: entropy
Best depth: 40
Best estimator: 22


### As above, use K-Folds cross-validator to split data in train/test sets. Create a dictionary of hyperparameter candidates, train model using a RandomForestClassifier, assess results, print and store hyper parameters and accuracy and tag using 'RandomForestClassifier'

In [25]:
kfold = KFold(n_splits = 3)
hyperparameters = dict(max_depth=40, criterion='gini',min_samples_leaf = 12 ,min_samples_split = 8, n_estimators = 22)
model = RandomForestClassifier(max_depth=40, criterion='gini',min_samples_leaf = 12 ,min_samples_split = 8, n_estimators = 22)
model = model.fit(X_train, y_train)
joblib.dump(model, 'rft.pkl')
results = model_selection.cross_val_score(model,X,Y,cv = kfold)
print(results)
print('Accuracy',results.mean()*100)
store = get_meta_store()
store.log_hyperparameters(hyperparameters)
store.log_model(model)
store.log_metric('Accuracy',results.mean()*100)
store.log_tag("RandomForestClassifier")
store.log_attribute("data_file_location", "data/full_data_csv/a.csv")
store.log_attribute("data_file_version", data_file_version)

[0.79727428 0.80962521 0.79590967]
Accuracy 80.09363869494493
connection successfully established
got existing Project: ml-workshop
got existing Experiment: customerchurnuser29
created new ExperimentRun: customerchurnuser29162020190049967833
upload complete (custom_modules)
upload complete (model.pkl)
upload complete (model_api.json)


In [26]:
print('Notebook complete')

Notebook complete
