## Modeling phase


In this notebook, we will create a model to be sent to the production team. This model aims to evaluate the churn probability of a customer given several details. The data set used is freely inspired by the Kaggle dataset ([original version](https://www.kaggle.com/blastchar/telco-customer-churn?select=WA_Fn-UseC_-Telco-Customer-Churn.csv))

The elements we will track are the following:

* The context of the notebook
* The data sources information, including their file location, schema and quality metrics
* The lineages among those data sources
* The models trained and metrics about them

### Needed libraries and requirements

In [None]:
import pandas as pd
import numpy as np
import sklearn
import json
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import math
%matplotlib inline

For this example, we will use the Kensu public library which will allow the tracing of the data

In [None]:
from kensu_public import *

In this notebook, we will train our model on a moving window of time to see if the results are reliable over time

In [None]:
# simulation of an iteration on the churn model every year
from os import remove
from os import path
iter_file_name = "iter.txt"
global_fraction = 50
iter_fraction = 5
if not path.exists(iter_file_name):
    with open(iter_file_name, "w") as iter_file:
        iter_file.write("0")
else:
    with open(iter_file_name, "r") as iter_file:
        current_iter = int(iter_file.readline()) + 1
        global_fraction = global_fraction + current_iter * iter_fraction
    remove(iter_file_name)
    if global_fraction < 100:
        with open(iter_file_name, "w") as iter_file:
            iter_file.write(str(current_iter))

### Data Extraction

We provide a context to our notebook such as the process name, the project in which it fits, and the environment where the notebook is running.

In [None]:
process_name = "churn_train_12-24"
project_name = "Churn New Customers"
environment = "Lab"

We load a part of the dataset at each iteration. We will take about 10 years of data.

In [None]:
df = pd.read_csv('customer_data.csv', parse_dates=["date"],index_col='Unnamed: 0')


data_range_min = int(df.shape[0]*(global_fraction-50)/100)
data_range_max = int(df.shape[0]*global_fraction/100)
df = df.iloc[data_range_min:data_range_max]
timestamp = max(df.date)
global current_runtime
current_runtime = int(timestamp.timestamp()*1000)

We are reporting the context of the notebook, which involves the timestamp, project name, process name, and environment of the script. You can find all this information in the file `oreilly.log` which is created in the working directory

In [None]:
report_context(current_runtime,process_name,project_name,environment)

We are also reporting the data source metadata, such as its format and its schema, to the log file

In [None]:
report_data_source(df,'customer_data.csv')

### Data Preparation

Let's see if some customers have churned in our data set

In [None]:
#Get the number of customers that churned
df['Churn'].value_counts()

In [None]:
#Visualize the count of customer churn
sns.countplot(df['Churn'])

In [None]:
#What percentage of customers are leaving ?
retained = df[df.Churn == 'No']
churned = df[df.Churn == 'Yes']
num_retained = retained.shape[0]
num_churned = churned.shape[0]
#Print the percentage of customers that stayed and left
print( num_retained / (num_retained + num_churned) * 100 , "% of customers stayed with the company.")
#Print the percentage of customers that stayed and left
print( num_churned / (num_retained + num_churned) * 100,"% of customers left the company.")

Can we see a pattern in the churn rate in function of the gender?

In [None]:
#Visualize the churn count for both Males and Females
sns.countplot(x='gender', hue='Churn',data = df)

We can explore if we see a pattern in the churn rate vs. the tenure (Number of months the customer has stayed with the company) or the MonthlyCharges (The amount charged to the customer monthly)

In [None]:
numerical_features = ['tenure', 'MonthlyCharges']
fig, ax = plt.subplots(1, 2, figsize=(28, 8))
df[df.Churn == 'No'][numerical_features].hist(bins=20, color="blue", alpha=0.5, ax=ax)
df[df.Churn == 'Yes'][numerical_features].hist(bins=20, color="orange", alpha=0.5, ax=ax)

We will remove the unnecessary columns: customerID and date

In [None]:
cleaned_df = df = df.drop(['customerID','date'], axis=1)

We convert all the non-numeric columns to numerical data types

In [None]:
for column in cleaned_df.columns:
    if cleaned_df[column].dtype == np.number:
        continue
        
    cleaned_df[column] = LabelEncoder().fit_transform(cleaned_df[column])

We are now saving the dataset into a csv file. As we create a new file in the filesystem, we must registered its provenance (its lineage). The file `cleaned_data.csv` is created from the file `customer_data` where we deleted the columns date and customerID. For all those data sources, we will send metadata in the context of the process, such as data statistics. 

In [None]:
df.to_csv('cleaned_data.csv')

In [None]:
report_data_source(df,'cleaned_data.csv')
report_link(['customer_data'],'cleaned_data',current_runtime=current_runtime)

Starting from the cleaned data, we can now create the feature matrix X and the target Y. Then, we split the data in a train and a test set.

In [None]:
X = cleaned_df.drop('Churn', axis = 1) 
X.to_csv('Xmatrix.csv')
report_data_source(X,'Xmatrix.csv')
report_link(['cleaned_data'],'Xmatrix',current_runtime=current_runtime)
y = cleaned_df['Churn']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

report_data_source(x_train,'Xtrain.csv')
report_link(['Xmatrix'],'Xtrain',current_runtime=current_runtime)
report_data_source(x_test,'Xtest.csv')
report_link(['Xmatrix'],'Xtest',current_runtime=current_runtime)

### Modeling

We will first create a Logistic Regression model

In [None]:
#Create the model
model = LogisticRegression(max_iter=len(x_train))
#Train the model
model.fit(x_train, y_train)

In [None]:
predictions = model.predict(x_test)

We are saving the model as a joblib image. As we save an element, we must register it in the data lineage. The metadata of a model imply its schema, hyperparameters and performance metrics

In [None]:
from joblib import dump, load
dump(model, 'logisticreg.joblib') 

In [None]:
report_model("Xtrain","Xtest",x_test,y_test,model,'logisticreg.joblib',current_runtime=current_runtime)

Let's do the same with a Random Forest model

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
model = RandomForestClassifier()

In [None]:
model.fit(x_train,y_train)

In [None]:
predictions = model.predict(x_test)

In [None]:
from joblib import dump, load
dump(model, 'randomforest.joblib') 
report_model("Xtrain","Xtest",x_test,y_test,model,'randomforest.joblib',current_runtime=current_runtime)

You can find the created logs in the following file:

In [None]:
! cat oreilly.log