#### H.D.Nuwan Sameera ( Dissertation - MSc. in Data Science ( Batch 03 ) / Cardiff Met, ICBT )

# Develop more robust and precise models that can identify potential churners early and provide actionable insights for retention strategies.

## The steps of the process of prediction used in this project

- <b>Data collection</b>:

The dataset contains customer behavior, interactions, and historical churn.

- <b>Data preprocessing</b>:

Clean and preprocess the data by handling missing values, outliers, and encoding categorical variables.         Making the dataset suitable for ML.
     
     
- <b>Feature engineering</b>:

Create meaningful features that can help the model make accurate predictions.
    This involves feature scaling, normalization, or generating new features based on domain knowledge.
    
    
- <b>Data splitting</b>:

Split the dataset into training, validation, and test sets.
    A common split is 70% for training, 15% for     validation, and 15% for testing.
    The validation set is used for hyperparameter tuning.
    
    
- <b>Model selection</b>:

Choose the appropriate machine learning algorithms for churn prediction.
    Common choices include K-nearest neighbors, support vector machine, logistic regression, random forests, decision trees, ada boost, gradient boosting, voting, and neural networks.
    Consider using ensemble methods or stacking multiple models for better performance.


- <b>Model training</b>:

Train the selected models using the training dataset.
    Tune hyperparameters to optimize model performance on the validation set.
    This may involve techniques like cross-validation.


- <b>Model evaluation</b>:

Evaluate the model's performance using appropriate metrics such as accuracy, precision, recall,
    F1-score, ROC AUC, or customer-centric metrics like customer lifetime value (CLV).
    Compare the performance of different models and
    choose the one that best aligns with your business objectives.

- <b>Finalizing The Most Suitable Model</b>
    
- <b>Feature Importance Analysis</b>:

Understand which features are most important for making churn predictions.
    Feature importance analysis can help refine the model and provide insights into customer behavior.
    
    
- <b>Model deployment</b>:

Once the model with satisfactory performance is selected, it can be deployed to a production environment where it can make real-time predictions. Consider using APIs or containerization for deployment.

## Further steps to be followed

- <b>Monitoring and Maintenance</b>:

Continuously monitor the model's performance in a production environment. Re-train the model periodically with new data to ensure it remains accurate as customer behavior changes over time.
    
    
- <b>Interpretability and Explainability</b>:

Understand how the model is making predictions. Use techniques like SHAP values or LIME to interpret and explain model decisions to stakeholders.


- <b>Feedback loop</b>:

Incorporate feedback from business stakeholders, customer support, and other relevant sources to improve the model over time.
    
    
- <b>Scale and Iterate</b>:

As your business evolves and gathers more data, consider scaling the model and iterating on the process to improve prediction accuracy and reduce churn.

## Libraries installed

- pandas
- numpy
- matplotlib
- plotly
- ydata-profiling (!pip install -U ydata-profiling)
- seaborn
- sklearn
- jupyter

## Special Notes

##### Run the below code to upgrade '<b>threadpoolctl</b>' library if the training code of <b>KNN</b> model throws an exception.

    - Mac: !pip install threadpoolctl --upgrade   
    - Windows: pip install threadpoolctl --upgrade   


##### Run the below code to install ydata-profiling  library
    - Mac: !pip install -U ydata-profiling
    - Windows: pip install -U ydata-profiling

## Feature Identification

## =======================================================================

In [None]:
# Libraries related to date and time
from datetime import datetime

In [None]:
# Create directories if not exist
import os

listFolders = ['fig', 'boxplots']

for f in listFolders:

    isExist = os.path.exists(f)

    if not isExist:
        os.makedirs(f)

## Global Variables

In [None]:
# @title Global variables

# Initialize the program started time
mainStartTime = datetime.now()

# Initialize the ML start and end time.
startTime = datetime.now()
endTime = datetime.now()

# List of the column names
colNames = []

# Run visualization codes (1=Show Plots and other graphs)
showViz = 1

##  < Data Collection >

#### Two datasets are selected to the entire process.

    - telcomModelData.csv - The dataset taken to train and test the models
    - telcomHoldout.csv   - The dataset taken to predict the most accurate and appropriate model

#### Importing the required libraries for data manupulation and visualization

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### Reading the dataset into pandas DataFrame

In [None]:
df = pd.read_csv('telcomModelData.csv')

### Data exploration

In [None]:
df

In [None]:
# Show the number of columns and rows
df.shape

In [None]:
# Observing data types
df.dtypes

In [None]:
# Check for unique values in dtype 'object' (categorical)

def showUniqueValues(df):
  colNames.clear()
  for column in df:
    colNames.append(column)
    if df[column].dtype =='object':
      print(f'{column} : {df[column].unique()}')

showUniqueValues(df)

In [None]:
# List of column names
colNames

## < Data preprocessing >


In [None]:
# HandsetPrice should be numeric. A value as 'Unknown' is assigned in the column.
# Get the number of observations where the value 'Unknown' is assigned.
pd.to_numeric(df.HandsetPrice, errors='coerce').isnull().sum()

In [None]:
# Since there are observations with 'Unknown' value more than the half of the dataset, these observations cannot be dropped.
# HandsetPrice should be numeric. A value as 'Unknown' is assigned in the column. Convert the values from char to numeric ignoring the 'Unknown' values.
df[pd.to_numeric(df.HandsetPrice, errors='coerce').isnull()].shape

In [None]:
df[df.HandsetPrice == 'Unknown'].HandsetPrice.count()

#### Function to check for missing data or NA values

In [None]:
def checkNA():
  for x in df.columns:
      if df[x].isna().sum() != 0:
          print(x, df[x].isna().sum())

In [None]:
# Check NAs
checkNA()

### Drop columns

In [None]:
def dropColumns(df):

    # The 'CustomerID' is not important.
    df.drop('CustomerID', axis=1, inplace=True)

In [None]:
dropColumns(df)

In [None]:
# Show unique values
showUniqueValues(df)

### NA imputation

In [None]:
def imputeNA(df):

    df['MonthlyRevenue'].fillna(df['MonthlyRevenue'].median(), inplace=True)
    df['MonthlyMinutes'].fillna(df['MonthlyMinutes'].median(), inplace=True)
    df['TotalRecurringCharge'].fillna(df['TotalRecurringCharge'].median(),inplace=True)

    df['AgeHH1'].fillna(value=0, inplace=True)
    df['AgeHH2'].fillna(value=0, inplace=True)
    df['PercChangeRevenues'].fillna(value=0,inplace=True)
    df['PercChangeMinutes'].fillna(value=0,inplace=True)
    df['RoamingCalls'].fillna(value=0,inplace=True)
    df['OverageMinutes'].fillna(value=0,inplace=True)
    df['DirectorAssistedCalls'].fillna(value=0,inplace=True)

    #df['ServiceArea'].fillna(df['ServiceArea'].mode()[0], inplace=True)
    df['Handsets'].fillna(df['Handsets'].mode()[0],inplace=True)
    df['HandsetModels'].fillna(df['HandsetModels'].mode()[0],inplace=True)
    df['CurrentEquipmentDays'].fillna(df['CurrentEquipmentDays'].median(),inplace=True)

    # Replace 'Unknown' value with nan
    df['HandsetPrice'] = df['HandsetPrice'].replace('Unknown', np.nan)
    df['HandsetPrice'].fillna(df['HandsetPrice'].median(), inplace=True)
    df['HandsetPrice'] = pd.to_numeric(df['HandsetPrice'])

    return df

In [None]:
imputeNA(df)

In [None]:
# Check NAs
checkNA()

#### Show categorical unique values of the columns except 'ServiceArea'

In [None]:
showUniqueValues(df)

## Descriptive analysis

### Data distribution - Pie charts
#### Churn visualization

In [None]:
import plotly.offline as po
import plotly.graph_objects as go

if showViz == 1:
    churn_key = df['Churn'].value_counts().keys().tolist()
    churn_value = df['Churn'].value_counts().values.tolist()

    plot_data = [
        go.Pie(labels=churn_key, values=churn_value, marker=dict(colors=['Teal','Gray'], line=dict(color='white', width=1.5)),
        rotation=90,
        hoverinfo="label+value+text",
        hole=0.6)
    ]

    plot_layout = go.Layout(dict(title="Customer Churn", plot_bgcolor='rgb(243, 243,243)', paper_bgcolor='rgb(243, 243, 243)',))

    fig = go.Figure(data=plot_data, layout=plot_layout)
    po.iplot(fig)

#### Categorical type features distribution analysis

In [None]:
# The following function is used to visualize the distribution of the other columns

# Function to visualize the distribution
def distributionPie(column):
  labels = df[column].unique()
  values = df[column].value_counts()

  fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=0.6, rotation=90)])
  fig.update_layout(title_text=f"{column} Distribution", plot_bgcolor='rgb(243, 243,243)')
  fig.show()

if showViz == 1:
    # Loop the column names (categorical)
#     for column in df.drop(['ServiceArea', 'Churn', 'HandsetPrice'], axis=1):
    
    for column in df.drop(['Churn', 'HandsetPrice'], axis=1):
        if df[column].dtype =='object':
            distributionPie(column)

### Pandas profiling report

##### This report is an interactive data analysis report for quickly gaining insights into a dataset. The report provides a wide range of statistical and visual summaries of the data, helping data analysts and data scientists to understand the dataset's characteristics, identify potential issues, and make informed decisions about data preprocessing and analysis.

#### Generating the profiling report and saving as an HTML file.

In [None]:
if showViz == 1:
    # ProfileReport started time
    startTime = datetime.now()

    from ydata_profiling import ProfileReport

    # Create pandas profiling report
    profReport = ProfileReport(df)

    # Download pandas profiling report in html format
    profReport.to_file('Row Data Analysis.html')

    endTime = datetime.now()

    print(f'Profile Report processing time : {endTime-startTime}')

#### View the profiling report.

In [None]:
# View pandas profiling report (This will take a few minutes to load the report)
# Since this takes a considerable amount of time to load, the report is saved as 'Row Data Analysis.html' in the directory.
# Open the exported html file to view the plots and other statistics instead.

# Uncomment the below code to view the report along with the code.

#profReport

# Open profiling report in browser
import webbrowser
url = "Row Data Analysis.html"
webbrowser.open(url)

#### Visualizing the distribution of categorical features by 'Churn' 

In [None]:
import seaborn as sns

if showViz == 1:

    for column in df.drop('Churn', axis=1):
        
        if df[column].dtype !='object':
            fig, ax = plt.subplots(figsize=(6, 3))
            sns.set_context("paper",font_scale=1.1)
            ax = sns.kdeplot(df[column][(df["Churn"] == 'No') ],
                            color="Red", fill=True);
            ax = sns.kdeplot(df[column][(df["Churn"] == 'Yes') ],
                            ax =ax, color="Blue", fill=True);
            ax.legend(["Not Churn","Churn"],loc='upper right');
            ax.set_ylabel('Density');
            ax.set_xlabel(column);
            ax.set_title(f'Distribution of {column} by churn');

#### Comparing the features statistically against 'Churn' values 'Yes' and 'No'

In [None]:
import plotly.express as px

def compareStatsWithChurn(colName):

  fig = px.box(df, x='Churn', y = colName)

  # Update yaxis properties
  fig.update_yaxes(title_text=colName, row=1, col=1)
  # Update xaxis properties
  fig.update_xaxes(title_text='Churn', row=1, col=1)

  # Update size and title
  fig.update_layout(autosize=True, width=750, height=600,
      title_font=dict(size=25, family='Courier'),
      title=f'<b>{colName} vs Churn</b>',
  )

  fig.show()

if showViz == 1 :
    for c in colNames:
        compareStatsWithChurn(c)

#### Analyzing 'Churn' vs other features

In [None]:
# Function to analyze customer churn vs other features

def churnHist(colName):
  fig, ax = plt.subplots(figsize=(6, 4))

  churn_yes = df[df.Churn=='Yes'][colName]
  churn_no = df[df.Churn=='No'][colName]

  ax.hist([churn_yes, churn_no], color=['red','purple'], label=['Yes','No'])
  ax.legend()

  ax.set(title=f'Customer churn vs {colName} analysis', xlabel=colName, ylabel='Number of customers')

  plt.savefig(f'fig/{column}.png', dpi=100);

In [None]:
# Customer churn vs other features

if showViz == 1:
    for column in df.drop('Churn', axis=1):
      churnHist(column)

### Encoding

#### Encoding Type 1 : Label encoding

In [None]:
def labelEncoding(df):

    # Since 'Homeownership' column has two unique values such as 'Known' and 'Unknown', label encoding is done seperately as below.
    df['Homeownership'].replace({'Known':1, 'Unknown':0}, inplace=True)

    # Defining an array to store the names of the columns where the values contain 'Yes' and 'No' only.
    yes_no_columns = ['ChildrenInHH', 'HandsetRefurbished', 'HandsetWebCapable', 'TruckOwner', 'RVOwner', 'Homeownership', 'BuysViaMailOrder', 'RespondsToMailOffers', 'OptOutMailings',
                      'NonUSTravel' , 'OwnsComputer', 'HasCreditCard', 'NewCellphoneUser', 'NotNewCellphoneUser', 'OwnsMotorcycle', 'MadeCallToRetentionTeam']

    # Loop to do label encoding to all the columns with values 'Yes' and 'No'.

    for col in yes_no_columns:
      df[col].replace({'Yes':1, 'No':0}, inplace=True)

    return df

In [None]:
df = labelEncoding(df)

In [None]:
# Check shape of the dataframe
df.shape

In [None]:
# Show categorical unique values of the columns except 'ServiceArea'

showUniqueValues(df)

#### Encoding Type 2 : One hot encoding

In [None]:
def oneHotEncoding(df):
    # One hot encoding is done to the relevant columns at once.
    df = pd.get_dummies(data=df, columns=['CreditRating', 'PrizmCode', 'Occupation', 'MaritalStatus'], dtype=float)
    return df

df = oneHotEncoding(df)

# Check shape of the dataframe
df.shape

In [None]:
# View the dataframe
df

In [None]:
# Show unique values of the columns
showUniqueValues(df)

In [None]:
# Get the column names
colNames.clear()
for column in df:
    colNames.append(column)

In [None]:
# Show some statistics

df.describe()

In [None]:
df['Churn'].value_counts()

In [None]:
len(colNames)

# < Feature Engineering >

### Treating Outliers

#### Function to show outliers

In [None]:
# Boxplots
# Since the number of features is 73, the boxplots are grouped as 5 features per group.

def showBoxplots():
  l1 = []

  for i in range(round(len(colNames)/5)):
    l1.append(colNames[i*5:(i+1)*5])
    #df.boxplot(column=colNames[i*5:(i+1)*5])

  for i in l1:
    fig = plt.subplots()
    b_plot = df.boxplot(column=i)
    b_plot.plot()
    plt.xticks(rotation=45)
    plt.savefig(f'boxplots/{i}.png', dpi=100);
    plt.show()
    print('\n')

#### Show boxplots - Before removing outliers

In [None]:
colNames.remove('Churn')
showBoxplots()

#### Removing outliers

The outliers are replaced with the median

In [None]:
for column in colNames:

    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)

    iqr = q3 - q1
    upperLevel = q3 + (iqr * 1.5)
    lowerLevel = q1 - (iqr * 1.5)

    df[column][df[column] < lowerLevel] = df[column][df[column] > upperLevel] = df[column].median()

#### Show boxplots - After removing outliers

In [None]:
if showViz == 1:
    #colNames.remove('Churn')
    showBoxplots()

### Scaling

In [None]:
# Check for features with type 'object'

for column in df:
  if df[column].dtype =='object':
    print(column)

#### Min_Max scaler

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

In [None]:
def minMaxScale(df):
    # Get all the column names to a list.
    colNames = df.columns.tolist()[1:len(df.columns.tolist())]

    # Apply Min-Max scaler
    df[colNames] = scaler.fit_transform(df[colNames])

#### Apply Min-Max scaler

In [None]:
# Call min-max scaling function
minMaxScale(df)

# View sample
df.sample(5)

###### Show some statistics

In [None]:
df.describe()

#### Show boxplots - After scaling

In [None]:
if showViz == 1:
    #colNames.remove('Churn')
    showBoxplots()

### The boxplots generated after feature scaling indicates the distribution of some features are very low and near to zero.

## < Data Splitting >

### Train & Test Splits


In [None]:
from sklearn.metrics import recall_score, confusion_matrix, precision_score, f1_score, accuracy_score, classification_report
from sklearn.model_selection import train_test_split

# X_train = X_test = df.drop('Churn', axis=1)
# y_train = y_test = df['Churn']

X_train = X_test = df.drop(df.index)
y_train = y_test = df.drop(df.index)

# X_train.drop(X_train.index, inplace=True)
# X_test.drop(X_test.index, inplace=True)
# y_train.drop(y_train.index, inplace=True)
# y_test.drop(y_test.index, inplace=True)

def splitDataset(df):
    # All the columns except 'Churn'
    X = df.drop('Churn', axis=1)

    # 'Churn' column
    y = df['Churn']

    # Access dataframes declared globaly
    global X_train, X_test, y_train, y_test
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

In [None]:
X_train

In [None]:
splitDataset(df)

In [None]:
X_train.shape

In [None]:
y_test

## < Model Selection >

The below supervised learning machine learning algorithms are used to analyze the customer churn dataset.

- #### K-Nearest Neighbors (KNN)
- #### Support Vector Machine (SVM)
- #### Random Forest
- #### Logistic Regression
- #### Decision Tree Classifier
- #### Ada Boost Classifier
- #### Gradient Boosting Classifier
- #### Voting Classifier

### Customized functions used to calculate important parameters for ML selection

#### Calculating and storing the duration spent for ML processes

In [None]:
# DataFrame to select the appropriate ML algorithm for this customer churn prediction dataset
df_ml_eval = pd.DataFrame({'ATTEMPT':[], 'ALGORITHM':[], 'ALGOINDEX':[], 'TIME':[]})

attempt = 0

algoList = ['KNN', 'SVM', 'Random Forest', 'Logistic Regression', 'Decision Tree', 'Ada Boost', 'Gradient Boosting', 'Voting']

# Function to calculate ML processing time and results, and store records in the 'dfTimeML' DataFrame for analysis
def calculateTimeML(t, nameML=None):
  global startTime
  global endTime

  if t == 1:
     startTime = datetime.now()
  elif t == 2:
     endTime = datetime.now()


     # Print duration
     print(f'{nameML} time : {(endTime-startTime).total_seconds()}')

     # Store time in the DataFrame
     df_ml_eval.loc[len(df_ml_eval.index)+1] = [attempt, nameML, algoList.index(nameML), (endTime-startTime).total_seconds()]

#### Function to generate Confusion Matrix Graph

In [None]:
def generateConfusionMatrixGraph(algo, y_test, pred):

    plt.figure(figsize=(4,3))
    sns.heatmap(confusion_matrix(y_test, pred), cmap="Blues",
                    annot=True,fmt = "d",linecolor="k",linewidths=3)

    plt.title(f'{algo} Confusion Matrix' ,fontsize=14)
    plt.show()

#### Function to generate Confusion Matrix Array

In [None]:
def generateConfusionMatrixArray(prediction):
    # Assigned the actual and predicted values to a dictionary.
    dvalues = {'y_actual': y_test, 'y_predicted': prediction}

    # Create a dataframe from dvalues dictionary.
    dfcm = pd.DataFrame(dvalues)

    # Create the confusion matrix using the dfcm dataframe.
    cm = pd.crosstab(dfcm['y_actual'], dfcm['y_predicted'], rownames=['Actual'], colnames=['Predicted'])

    return cm

#### Replacing 'Yes' and 'No' values with 1 and 0 in y datasets

In [None]:
def replaceYesNo():
    # Number of model training attempts
    global attempt
    
    attempt = attempt + 1

    # Replacing 'Yes' and 'No' values with 1 and 0 to avoid python errors
    y_test.replace({'Yes':1, 'No':0}, inplace=True)
    y_train.replace({'Yes':1, 'No':0}, inplace=True)

replaceYesNo()

## < Model Training >

### 01 : K-Nearest Neighbors (KNN)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

def trainModel_knn():
    calculateTimeML(1)

    model_knn = KNeighborsClassifier(n_neighbors = 11)
    model_knn.fit(X_train,y_train)
    prediction_knn = model_knn.predict(X_test)
    accuracy_knn = model_knn.score(X_test,y_test)
    print("KNN accuracy :",accuracy_knn)

    calculateTimeML(2, 'KNN')

    print('\nClassification Report')
    print(classification_report(y_test, prediction_knn))
    
    return (model_knn, prediction_knn)

### 02 : Support Vector Machine - SVM

##### Since the Support Vector Machine takes a considerable amount of time compared to the other algorithms, SVM was not considered in this research form here onwards.

In [None]:
# from sklearn.svm import SVC

# calculateTimeML(1)

# model_svc = SVC(random_state = 1, probability=True)
# model_svc.fit(X_train,y_train)
# prediction_svc = model_svc.predict(X_test)
# accuracy_svc = model_svc.score(X_test,y_test)
# print('SVM accuracy :',accuracy_svc)

# calculateTimeML(2, 'SVM')

# print('\nClassification Report')
# print(classification_report(y_test, prediction_svc))

### 03 : Random Forest


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

def trainModel_rf():
    calculateTimeML(1)

    model_rf = RandomForestClassifier(n_estimators=500 , oob_score = True, n_jobs = -1,
                                      random_state =50, max_features = "sqrt",
                                      max_leaf_nodes = 30)
    model_rf.fit(X_train, y_train)
    prediction_rf = model_rf.predict(X_test)
    accuracy_rf = metrics.accuracy_score(y_test, prediction_rf)
    print('Random Forest accuracy :', accuracy_rf)

    calculateTimeML(2, 'Random Forest')

    print("\nClassification Report")
    print(classification_report(y_test, prediction_rf))
    
    return (model_rf, prediction_rf)

###  04 : Logistic Regression


In [None]:
from sklearn.linear_model import LogisticRegression

def trainModel_lr():
    calculateTimeML(1)

    model_lr = LogisticRegression()
    model_lr.fit(X_train,y_train)
    prediction_lr = model_lr.predict(X_test)
    accuracy_lr = model_lr.score(X_test,y_test)
    print("Logistic Regression accuracy :",accuracy_lr)

    calculateTimeML(2, 'Logistic Regression')

    print("\nClassification Report")
    print(classification_report(y_test,prediction_lr))
    
    return (model_lr, prediction_lr)

### 05 : Decision Tree Classifier


In [None]:
from sklearn.tree import DecisionTreeClassifier

def trainModel_dt():

    calculateTimeML(1)

    model_dt = DecisionTreeClassifier()
    model_dt.fit(X_train,y_train)
    prediction_dt = model_dt.predict(X_test)
    accuracy_dt = model_dt.score(X_test,y_test)
    print("Decision Tree accuracy is :",accuracy_dt)

    calculateTimeML(2, 'Decision Tree')

    print('\nClassification Report')
    print(classification_report(y_test, prediction_dt))
    
    return (model_dt, prediction_dt)

### 06 : Ada Boost Classifier


In [None]:
from sklearn.ensemble import AdaBoostClassifier

def trainModel_abc():

    calculateTimeML(1)

    model_abc = AdaBoostClassifier()
    model_abc.fit(X_train,y_train)
    prediction_abc = model_abc.predict(X_test)
    accuracy_abc = metrics.accuracy_score(y_test, prediction_abc)
    print("Ada Boost Classifier accuracy : ", accuracy_abc)

    calculateTimeML(2, 'Ada Boost')

    print('\nClassificatin Report')
    print(classification_report(y_test, prediction_abc))
    
    return (model_abc, prediction_abc)

### 07 : Gradient Boosting Classifier


In [None]:
from sklearn.ensemble import GradientBoostingClassifier

def trainModel_gbc():
    calculateTimeML(1)

    model_gbc = GradientBoostingClassifier()
    model_gbc.fit(X_train, y_train)
    prediction_gbc = model_gbc.predict(X_test)
    accuracy_gbc = accuracy_score(y_test, prediction_gbc)
    print("Gradient Boosting Classifier : ", accuracy_gbc)

    calculateTimeML(2, 'Gradient Boosting')

    print('\nClassification Report')
    print(classification_report(y_test, prediction_gbc))
    
    return (model_gbc, prediction_gbc)

### 08 : Voting Classifier


In [None]:
from sklearn.ensemble import VotingClassifier

def trainModel_vc():

    calculateTimeML(1)

    clf_gbc = GradientBoostingClassifier()
    clf_lr = LogisticRegression()
    clf_abc = AdaBoostClassifier()
    model_vc = VotingClassifier(estimators=[('gbc', clf_gbc), ('lr', clf_lr), ('abc', clf_abc)], voting='soft')
    model_vc.fit(X_train, y_train)
    prediction_vc = model_vc.predict(X_test)
    accuracy_vc = accuracy_score(y_test, prediction_vc)
    print(f"Final Accuracy Score {accuracy_vc}")

    calculateTimeML(2, 'Voting')

    print('\nClassification Report')
    print(classification_report(y_test, prediction_vc))
    
    return (model_vc, prediction_vc)

#### Function to run model training

In [None]:
def runModelTrainFunctions():
    
    model_knn, prediction_knn = trainModel_knn()
#     model_svc, prediction_knn = trainModel_knn()
    model_rf, prediction_rf = trainModel_rf()
    model_lr, prediction_lr = trainModel_lr()
    model_dt, prediction_dt = trainModel_dt()
    model_abc, prediction_abc = trainModel_abc()
    model_gbc, prediction_gbc = trainModel_gbc()
    model_vc, prediction_vc = trainModel_vc()
    
    return model_knn, model_rf, model_lr, model_dt, model_abc, model_gbc, model_vc, prediction_knn, prediction_lr, prediction_rf, prediction_dt, prediction_abc, prediction_gbc, prediction_vc

#### Calling models training function

In [None]:
model_knn, model_rf, model_lr, model_dt, model_abc, model_gbc, model_vc, prediction_knn, prediction_lr, prediction_rf, prediction_dt, prediction_abc, prediction_gbc,prediction_vc = runModelTrainFunctions()

#### The total processing time

In [None]:
print(f'Total processing time : {(endTime-mainStartTime)}')

## < Model Evaluation >

### Evaluation Type 1 - Confusion Matrix

#### Function to generate confusion matrix graphs

In [None]:
def generateConfusionMatrixGraphs_All():
    
   # global prediction_knn, prediction_rf, prediction_lr, prediction_dt, prediction_abc, prediction_gbc, prediction_vc

    generateConfusionMatrixGraph('KNN', y_test, prediction_knn)
    # generateConfusionMatrixGraph('SVM', y_test, prediction_svc)
    generateConfusionMatrixGraph('Random Forest', y_test, prediction_rf)
    generateConfusionMatrixGraph('Logistic Regression', y_test, prediction_lr)
    generateConfusionMatrixGraph('Decision Tree', y_test, prediction_dt)
    generateConfusionMatrixGraph('Ada Boost', y_test, prediction_abc)
    generateConfusionMatrixGraph('Gradient Boosting Classifier', y_test, prediction_gbc)
    generateConfusionMatrixGraph('Final', y_test, prediction_vc)   

#### Generate Confusion Matrix graphs

In [None]:
generateConfusionMatrixGraphs_All() 

#### Function to create Confusion Matrix Arrays - Accuracy, Sensivity, Specificity, Recall & F1_Score

In [None]:
    # Confusion Matrix Arrays of each ML model

def confusionMatrixArrays():   
        
        # K-Nearest Neighbors
        cm_knn = generateConfusionMatrixArray(prediction_knn)

        # Support vector machine
        # cm_svc = generateConfusionMatrixArray(prediction_svc)
        
        # Random forest
        cm_rf = generateConfusionMatrixArray(prediction_rf)
        
        # Logistic regression
        cm_lr = generateConfusionMatrixArray(prediction_lr)

        # Decision tree
        cm_dt = generateConfusionMatrixArray(prediction_dt)

        # Ada boost
        cm_abc = generateConfusionMatrixArray(prediction_abc)

        # Gradient boosting
        cm_gbc = generateConfusionMatrixArray(prediction_gbc)

        # Voting
        cm_vc = generateConfusionMatrixArray(prediction_vc)

        return (cm_knn, cm_rf, cm_lr, cm_dt, cm_abc, cm_gbc, cm_vc)

#### Call function to create Confusion Matrix Arrays - Accuracy, Sensivity, Specificity, Recall & F1_Score

In [None]:
cm_knn, cm_rf, cm_lr, cm_dt, cm_abc, cm_gbc, cm_vc = confusionMatrixArrays()

#### Function to append 'df_ml_eval' DataFrame with sensivity and specificity data

In [None]:
def appendConfusionMatrixResults():
    
    df_ml_eval['TP'] = [cm_knn[0][0], cm_rf[0][0], cm_lr[0][0], cm_dt[0][0], cm_abc[0][0], cm_gbc[0][0], cm_vc[0][0]]
    df_ml_eval['TN'] = [cm_knn[1][1], cm_rf[1][1], cm_lr[1][1], cm_dt[1][1], cm_abc[1][1], cm_gbc[1][1], cm_vc[1][1]]
    df_ml_eval['FP'] = [cm_knn[1][0], cm_rf[1][0], cm_lr[1][0], cm_dt[1][0], cm_abc[1][0], cm_gbc[1][0], cm_vc[1][0]]
    df_ml_eval['FN'] = [cm_knn[0][1], cm_rf[0][1], cm_lr[0][1], cm_dt[0][1], cm_abc[0][1], cm_gbc[0][1], cm_vc[0][1]]

#### Call function to append 'df_ml_eval' DataFrame with sensivity and specificity data

In [None]:
appendConfusionMatrixResults()

#### Function to calculating accuracy, sensitivity, specificity, recall & F1_score

In [None]:
def calculateAccSensiSpeciRecallF1Score():

    # The total of predicted positives and predicted negatives of all the predictions
    df_ml_eval['ACCURACY'] = round(((df_ml_eval['TP']+df_ml_eval['TN'])/(df_ml_eval['TP'] + df_ml_eval['TN'] + df_ml_eval['FP']+df_ml_eval['FN'])*100), 2)

    # Precision : Sensitivity - Predicted true positives by predicted total positives
    df_ml_eval['SENSITIVITY'] = round((df_ml_eval['TP']/(df_ml_eval['TP']+df_ml_eval['FP'])*100), 2)

    # Precision : Sepcificity - Predicted true negatives by predicted total negatives
    df_ml_eval['SPECIFICITY'] = round((df_ml_eval['TN']/(df_ml_eval['TN']+df_ml_eval['FN'])*100), 2)

    # Predicted positives of actual positve values by all the actual positives
    df_ml_eval['RECALL'] = round(((df_ml_eval['TP'])/(df_ml_eval['TP']+df_ml_eval['FN'])*100), 2)


    df_ml_eval['F1_SCORE'] = ((df_ml_eval['SENSITIVITY'] * df_ml_eval['RECALL']) / (df_ml_eval['SENSITIVITY'] + df_ml_eval['RECALL']))*2


#### Call function for calculating accuracy, sensitivity, specificity, recall & F1_score

In [None]:
calculateAccSensiSpeciRecallF1Score()

df_ml_eval

#### Function for plotting accuracy, sensitivity, specificity, recall, F1_score & time

In [None]:
def modelComparisionPlot(fName):

    fig, ((ax0, ax1), (ax2, ax3), (ax4, ax5)) = plt.subplots(ncols=2, nrows=3, figsize=(15, 12), sharex=True)
    fig.tight_layout(pad=5.0)

    bar_container0 = ax0.bar(df_ml_eval['ALGORITHM'], df_ml_eval['ACCURACY'])
    ax0.bar_label(bar_container0)
    ax0.scatter(df_ml_eval['ALGORITHM'], df_ml_eval['ACCURACY'])
    ax0.plot(df_ml_eval['ALGORITHM'], df_ml_eval['ACCURACY'])
    ax0.set(title='Accuracy Graph');
    ax0.grid(True)

    bar_container1 = ax1.bar(df_ml_eval['ALGORITHM'], df_ml_eval['SENSITIVITY'])
    ax1.bar_label(bar_container1)
    ax1.scatter(df_ml_eval['ALGORITHM'], df_ml_eval['SENSITIVITY'])
    ax1.plot(df_ml_eval['ALGORITHM'], df_ml_eval['SENSITIVITY'])
    ax1.set(title='Sensitivity Graph')
    ax1.grid(True)

    bar_container2 = ax2.bar(df_ml_eval['ALGORITHM'], df_ml_eval['SPECIFICITY'])
    ax2.bar_label(bar_container2)
    ax2.scatter(df_ml_eval['ALGORITHM'], df_ml_eval['SPECIFICITY'])
    ax2.plot(df_ml_eval['ALGORITHM'], df_ml_eval['SPECIFICITY'])
    ax2.set(title='Specificity Graph')
    ax2.grid(True)

    bar_container3 = ax3.bar(df_ml_eval['ALGORITHM'], df_ml_eval['RECALL'])
    ax3.bar_label(bar_container3)
    ax3.scatter(df_ml_eval['ALGORITHM'], df_ml_eval['RECALL'])
    ax3.plot(df_ml_eval['ALGORITHM'], df_ml_eval['RECALL'])
    ax3.set(title='Recall Graph')
    ax3.grid(True)

    bar_container4 = ax4.bar(df_ml_eval['ALGORITHM'], df_ml_eval['F1_SCORE'])
    ax4.bar_label(bar_container4)
    ax4.scatter(df_ml_eval['ALGORITHM'], df_ml_eval['F1_SCORE'])
    ax4.plot(df_ml_eval['ALGORITHM'], df_ml_eval['F1_SCORE'])
    ax4.set(title='F1-Score Graph')
    ax4.grid(True)

    bar_container5 = ax5.bar(df_ml_eval['ALGORITHM'], df_ml_eval['TIME'])
    ax5.bar(df_ml_eval['ALGORITHM'], df_ml_eval['TIME'])
    #ax5.bar_label(bar_container1)
    ax5.set(title='Processed Time', xlabel='', ylabel='Time (seconds)')
    ax5.grid(True)
    plt.xticks(rotation=70);

    fig.suptitle(fName, fontsize = 16, weight = 'extra bold', y=1)
    plt.savefig(f'{fName}.png', dpi=100)

#### Call function for plotting accuracy, sensitivity, specificity, recall, F1_score & time

In [None]:
modelComparisionPlot('Bar Graphs - Before Columns Removed')

### Evaluation Type 2 : Receiver Operating Characteristic (ROC)

#### Function to generating ROC graphs

In [None]:
from sklearn.metrics import roc_curve

def generateROCgraphs(fName):

    fig, ax = plt.subplots(ncols=2, nrows=4, figsize=(10, 12))
    fig.tight_layout(pad=5.0)

    models = [model_knn, model_rf, model_lr, model_dt, model_abc, model_gbc, model_vc]
    modelN = ['KNN', 'Random Forest', 'Logistic Regression', 'Decision Tree', 'Ada Boost', 'Gradient Boosting', 'VC']

    j = 0
    k = 0

    for i in range(len(models)):

        y_pred_prob = models[i].predict_proba(X_test)[:,1]
        fpr_rf, tpr_rf, thresholds = roc_curve(y_test, y_pred_prob)

        ax[j][k].plot([0, 1], [0, 1], 'k--' )
        ax[j][k].plot(fpr_rf, tpr_rf, label=modelN[i],color = "r")
        ax[j][k].set(title=f'{modelN[i]} ROC Curve', xlabel='False Positive Rate', ylabel='True Positive Rate')

        if k == 1:
            j = j + 1
            k = 0
            continue

        k = k + 1
    
    fig.suptitle(fName, fontsize = 16, weight = 'extra bold', y=1)
    plt.savefig(f'{fName}.png', dpi=100)

#### Call function to generate ROC graphs

In [None]:
generateROCgraphs('ROC Curves - Before Columns Removed')

# ===========================================================

# Retrain the models after removing some features : Analiyzed

#### Columns to be removed

In [None]:
unwantedColumns = ['ThreewayCalls', 'CallForwardingCalls', 'HandsetRefurbished', 'HandsetWebCapable', 'TruckOwner', 
                   'RVOwner', 'Homeownership', 'OptOutMailings', 'NonUSTravel', 'OwnsComputer', 'RetentionCalls', 
                   'RetentionOffersAccepted', 'NewCellphoneUser', 'NotNewCellphoneUser', 'ReferralsMadeBySubscriber', 
                   'OwnsMotorcycle', 'AdjustmentsToCreditRating', 'HandsetPrice', 'MadeCallToRetentionTeam', 
                   'CreditRating_1-Highest', 'CreditRating_3-Good', 'CreditRating_4-Medium', 'CreditRating_5-Low', 
                   'CreditRating_6-VeryLow', 'CreditRating_7-Lowest', 'PrizmCode_Rural', 'PrizmCode_Town', 
                   'Occupation_Clerical', 'Occupation_Crafts', 'Occupation_Homemaker', 'Occupation_Professional', 
                   'Occupation_Retired', 'Occupation_Self', 'Occupation_Student', 'MaritalStatus_No']

#### Remove columns

In [None]:
df.drop(unwantedColumns, axis=1, inplace=True)

#### Machine learning results dataframe re-initializing

In [None]:
df_ml_eval = pd.DataFrame({'ATTEMPT':[], 'ALGORITHM':[], 'ALGOINDEX':[], 'TIME':[]})

#### Re-initializing the X_train, X_test, y_train, y_test objects

In [None]:
X_train = X_test = df.drop(df.index)
y_train = y_test = df.drop(df.index)

#### Splitting the new dataframe

In [None]:
splitDataset(df)
replaceYesNo()

#### Calling models training function

In [None]:
model_knn, model_rf, model_lr, model_dt, model_abc, model_gbc, model_vc, prediction_knn, prediction_lr, prediction_rf, prediction_dt, prediction_abc, prediction_gbc,prediction_vc = runModelTrainFunctions()

#### Generate Confusion Matrix graphs

In [None]:
generateConfusionMatrixGraphs_All()

#### Call function to create Confusion Matrix Arrays - Accuracy, Sensivity, Specificity, Recall & F1_Score

In [None]:
cm_knn, cm_rf, cm_lr, cm_dt, cm_abc, cm_gbc, cm_vc = confusionMatrixArrays()

#### Call function to append 'df_ml_eval' DataFrame with sensivity and specificity data

In [None]:
appendConfusionMatrixResults()

#### Call function for calculating accuracy, sensitivity, specificity, recall & F1_score¶

In [None]:
calculateAccSensiSpeciRecallF1Score()

#### Call function for plotting accuracy, sensitivity, specificity, recall, F1_score & time

In [None]:
modelComparisionPlot('BarGraph_AfterColumnsRemoved')

#### Call function to generate ROC graphs

In [None]:
generateROCgraphs('ROC_AfterColumnsRemoved')

# ===========================================================

## < Finalizing The Most Suitable Model >

## < Model Deployment >

### Importing the 'joblib' library

In [None]:
import joblib

### Save the model as a file

This exported trained model can be used to predict customer churns in the telcom organization.

In [None]:
joblib.dump(model_lr, 'model.pkl')

### Load the trained model

In [None]:
model = joblib.load('model.pkl')

### Function to predict

In [None]:
def predict(df):

    # Drop 'Churn' column from the dataset
    dfHoldout.drop('Churn', axis=1, inplace=True)

    # Apply the custom functions to prepare the dataset for the deployed ML model
    dropColumns(df)
    imputeNA(df)
    labelEncoding(df)
    df = oneHotEncoding(df)
    minMaxScale(df)
    df.drop(unwantedColumns, axis=1, inplace=True)
    # Make predictions
    predictions = model.predict(df)

    # Move predicitions in a new column 'Churn'
    df['Churn'] = predictions

    # Replace 0 and 1 with 'No' and 'Yes'
    df['Churn'].replace({0:'No', 1:'Yes'}, inplace=True)

    return df

### Read the data to be prodicted

In [None]:
# There are 20,000 number of records in the dataset to be predicted
dfHoldout = pd.read_csv('telcomHoldout.csv')

### Call the 'predict' function and assign the new dataset with the new 'Churn' column

In [None]:
dfHoldout = predict(dfHoldout)

### View the predicted churn column of the  dataset

In [None]:
dfHoldout.Churn

In [None]:
print(f'Total time : {(datetime.now()-mainStartTime)}')