# Develop more robust and precise models that can identify potential churners early and provide actionable insights for retention strategies.

## The steps of the process of prediction used in this project

- <b> < Data Collection > </b>:

The dataset contains customer behavior, interactions, and historical churn.

- <b> < Data Preprocessing > </b>:

Clean and preprocess the data by handling missing values, outliers, and encoding categorical variables.         Making the dataset suitable for ML.
     
     
- <b> < Feature Engineering ></b>:

Create meaningful features that can help the model make accurate predictions.
    This involves feature scaling, normalization, or generating new features based on domain knowledge.
    
    
- <b> < Data Splitting > </b>:

Split the dataset into training, validation, and test sets.
    A common split is 80% for training, 20% for testing. The validation set is taken as a separate dataset and is used for hyperparameter tuning.
    
    
- <b> < Model Selection > </b>:

Choose the appropriate machine learning algorithms for churn prediction.
    Common choices include K-nearest neighbors, support vector machine, logistic regression, random forests, decision trees, ada boost, gradient boosting, and voting.
    Consider using ensemble methods or stacking multiple models for better performance.


- <b> < Model Training > </b>:

Train the selected models using the training dataset.
    Tune hyperparameters to optimize model performance on the validation set.
    This may involve techniques like cross-validation.


- <b> < Model Evaluation > </b>:

Evaluate the model's performance using appropriate metrics such as accuracy, precision, recall,
    F1-score, ROC AUC, or customer-centric metrics like customer lifetime value (CLV).
    Compare the performance of different models and
    choose the one that best aligns with the business objectives.

- <b> < Finalizing The Most Suitable Model > </b>

Finalize the best model based on the metrics calculated.
    
- <b> < Feature Importance Analysis > </b>:

Understand which features are most important for making churn predictions.
    Feature importance analysis can help refine the model and provide insights into customer behavior.
    
    
- <b> < Model Deployment > </b>:

Once the model with satisfactory performance is selected, it can be deployed to a production environment where it can make real-time predictions. Consider using APIs or containerization for deployment.


- <b> < Predicting the churn using the hold-out dataset > </b>:

Feed in the hold-out dataset to the deployed model conduting all the activities applied during the model training.

## Further steps to be followed after the model deployment

- <b>Monitoring and Maintenance</b>:

Continuously monitor the model's performance in a production environment. Re-train the model periodically with new data to ensure it remains accurate as customer behavior changes over time.
    
    
- <b>Interpretability and Explainability</b>:

Understand how the model is making predictions. Use techniques like SHAP values or LIME to interpret and explain model decisions to stakeholders.


- <b>Feedback loop</b>:

Incorporate feedback from business stakeholders, customer support, and other relevant sources to improve the model over time.
    
    
- <b>Scale and Iterate</b>:

As your business evolves and gathers more data, consider scaling the model and iterating on the process to improve prediction accuracy and reduce churn.

## Libraries installed

- pandas
- numpy
- matplotlib
- plotly
- kaleido
- ydata-profiling (!pip install -U ydata-profiling)
- seaborn
- sklearn
- jupyter

## Special Notes

##### Run the below code to upgrade '<b>threadpoolctl</b>' library if the training code of <b>KNN</b> model throws an exception.

    - Mac: !pip install threadpoolctl --upgrade   
    - Windows: pip install threadpoolctl --upgrade   


##### Run the below code to install ydata-profiling  library
    - Mac: !pip install -U ydata-profiling
    - Windows: pip install -U ydata-profiling
    
##### Download plotly plots
    - pip install -U kaleido

## Feature Identification

## =======================================================================

In [None]:
# Libraries related to date and time
from datetime import datetime

## Global Variables

In [None]:
import os

# Initialize the program started time
mainStartTime = datetime.now()

# Initialize the ML start and end time.
startTime = datetime.now()
endTime = datetime.now()

# List of the column names
colNames = []

# Run visualization codes (1=Show Plots and other graphs)
showViz = 1

# Create directories if not exist

listFolders = ['distribution_bar', 'distribution_pie']

for f in listFolders:

    isExist = os.path.exists(f)

    if not isExist:
        os.makedirs(f)

##  < Data Collection >

#### Two datasets are selected to the entire process.

    - CustomerChurnDataset_TrainTest.csv - The dataset taken to train and test the models
    - CustomerChurnDataset_Holdout.csv   - The dataset taken to predict the most accurate and appropriate model

#### Importing the required libraries for data manupulation and visualization

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### Reading the dataset into pandas DataFrame

In [None]:
df = pd.read_csv('CustomerChurnDataset_TrainTest.csv')

### Data exploration

In [None]:
# Show the number of columns and rows
df.shape

In [None]:
# View random sample of the dataset (5 observations)
df.sample(5)

#### Statistics

In [None]:
df.describe()

In [None]:
# View columns
df.columns

In [None]:
# Data types
def showDataTypes():

    dfDtypes = pd.DataFrame({'Data Type': [], 'Count': []})
    count = 0
    lDtypes = []
    lCount = []

    for x in df.dtypes.unique():
        for y in df.dtypes:
            if x == y:
                count = count + 1
        lDtypes.append(x)
        lCount.append(count)

        count = 0

    dfDtypes['Data Type'] = lDtypes
    dfDtypes['Count'] = lCount

    return dfDtypes

In [None]:
# Observing data types
showDataTypes()

#### Columns to be dropped

In [None]:
# Check 'City' column
df['City'].nunique()

In [None]:
# Since there are 1129 unique categorical values in the 'City' column, one-hot coding cannot be applied.
# In order to avoid complications in the process, the 'City' column shoudl be dropped.

In [None]:
# 'CustomerID', 'Churn Score' and 'Churn Reason' are not considered dependent variables
df.drop(['CustomerID', 'City', 'Churn Score', 'Churn Reason'], axis=1, inplace=True)
df

In [None]:
# Observing data types
showDataTypes()

In [None]:
# Check the unique values in 'Churn Label' and 'Churn Value' features 

In [None]:
df['Churn Label'].unique()

In [None]:
df['Churn Value'].unique()

In [None]:
# Validate 'Yes' and 'No' values in 'Churn Label' match with 1 and 0 in 'Churn Value'

In [None]:
# If the total = 7043 (total of observations), one features can be dropped. (In this case, the 'Churn Label' is to be dropped)

count1 = len(df.loc[(df['Churn Label'] == 'Yes') & (df['Churn Value'] == 1)])
count2 = len(df.loc[(df['Churn Label'] == 'No') & (df['Churn Value'] == 0)])

count1 + count2

In [None]:
# Consider the 'Churn Value' as the dependent variable and to be predicted value. For the sake of ease, the
# name of the column get changed as 'Churn'

df.rename(columns = {'Churn Value':'Churn'}, inplace = True)

# Drop the 'Churn Label'
df.drop('Churn Label', axis=1, inplace=True)

In [None]:
# Check for unique values in dtype 'object' (categorical)

def showUniqueValues(df):
  colNames.clear()
  for column in df:
    colNames.append(column)
    if df[column].dtype =='object':
      print(f'{column} : {df[column].unique()}')

showUniqueValues(df)

In [None]:
# List of column names
colNames

## < Data preprocessing >


#### Function to check for missing data or NA values

In [None]:
null_rows_selector = df.isnull().any(axis=1)
null_row_count = df[null_rows_selector].shape[0]

df_null = df.isnull().groupby(df['Churn']).sum().transpose()
df_null['total'] = df.isnull().sum()
df_null['percent'] = (df_null['total']/len(df))*100
df_null = df_null[df_null.total!=0]

print("rows with null values:",null_row_count,", {:.2f}%".format((null_row_count/len(df))*100))
print('columns with null values:',df_null.shape[0])

df_null

### Check the correlation with the 'Churn' 

In [None]:
# Chi square independence test to see if the difference in distributions is statistically significant

contingency_table = pd.concat([df['Churn'].value_counts().rename("Overall"), df[null_rows_selector]['Churn'].value_counts().rename("within_null_rows")],axis=1).transpose()
contingency_table

#### Check for Null values

In [None]:
for x in df.columns:
    if df[x].isna().sum() != 0:
        print(x, df[x].isna().sum())

In [None]:
# Only 'Churn Reason' column contains the null values. By observing the above results, it is identified that 
# the number of null values equals the number of rows where 'Churn Value' = 0. Since the 'Churn Reason' is not considered
# in this research, there are no null values in the dataset after dropping the 'Churn Reason' column.

In [None]:
from scipy.stats import chi2_contingency,ttest_ind

# chi square independence test
# Null Hypothesis HO: Distribution of Churn is independent of presence of null values


stat, p, dof, expected = chi2_contingency(contingency_table.values)
  
# interpret p-value
alpha = 0.05 # significance value for test
print('p value is ' + str(p))

print('Dependent (reject H0)') if p <= alpha else print('Independent (H0 holds true)')

In [None]:
lowCorrelationFeatures = []
# Categorical low correlation features

lowCorrelationFeaturesC = []

# Numerical low correlation features
lowCorrelationFeaturesN = []

In [None]:
# Chi squared Independence test between categorical values and Churn. This will help to identify important variables on which Churn depends.
# Based on the test, we can recommend to drop/include them in while training
def chi2test(X,y,alpha=0.05):
    '''
        X : dataframe 
        y : series
    '''
    target = y.name
    print('ch2test with alpha',alpha)
    test_df = []
    for index,col in X.select_dtypes(include=['object']).columns.to_series().items():
        df = pd.concat([y,X[col]],axis=1)
        contingency_table = df.value_counts().rename("counts").reset_index().pivot(index=target,columns=col,values='counts').fillna(0)
        stat, p, dof, expected = chi2_contingency(contingency_table.values)
        test_df.append([target,col,stat,p,'Dependent (reject H0)' if p <= alpha else 'Independent (H0 holds true)','include' if p <= alpha else 'drop'])
        
    test_df = pd.DataFrame(test_df,columns=["variable1","variable2","chi2-stat","p-value","result","recommendation"])
    return test_df

df_chi = chi2test(df.drop('Churn',axis=1),df['Churn'])

lowCorrelationFeaturesC.extend(df_chi[df_chi.recommendation == 'drop']['variable2'].to_list())

df_chi

In [None]:
# Print values of lowCorrelationFeatures in one line.

def fnPrintLowCorrelationFeatures(l1):
    
    n = len(l1)

    for i in l1:
        if i != l1[n-1]:
            print(i, end= ', ')
        else:
            print(i, end= ' ')

In [None]:
# View the list of categorical features with low correlation
fnPrintLowCorrelationFeatures(lowCorrelationFeaturesC)

In [None]:
# t test to check if means of a numerical variable differ significantly if Churn is different. 

def t_test(X,y,alpha=0.05):   
    target = y.name
    print('t_test with alpha',alpha)
    test_df = []
    for index,col in X.select_dtypes(exclude=['object']).columns.to_series().items():
        df = pd.concat([y,X[col]],axis=1)
        ttest_df = df.set_index(target,drop=True).fillna(0)
        stat, p = ttest_ind(ttest_df.loc[1],ttest_df.loc[0],equal_var=False)
        test_df.append([target,col,stat,p,'Dependent (reject H0)' if p <= alpha else 'Independent (H0 holds true)','include' if p <= alpha else 'drop'])
        
    test_df = pd.DataFrame(test_df,columns=["variable1","variable2","t-stat","p-value","result","recommendation"])
    return test_df

df_t_test = t_test(df.drop('Churn',axis=1),df['Churn'])

lowCorrelationFeaturesN.extend(df_t_test[df_t_test.recommendation == 'drop']['variable2'].to_list())

df_t_test

In [None]:
# View the list of columns to be dropped
lowCorrelationFeatures

In [None]:
# View the list of numerical features with low correlation
fnPrintLowCorrelationFeatures(lowCorrelationFeaturesN)

In [None]:
# Merge categorical and numerical features with low correlation
lowCorrelationFeatures = lowCorrelationFeaturesC + lowCorrelationFeaturesN

In [None]:
# View the list of all the features with low correlation
fnPrintLowCorrelationFeatures(lowCorrelationFeatures)

#### Drop columns

In [None]:
df.drop(lowCorrelationFeatures, axis=1, inplace=True)

In [None]:
df

In [None]:
# Analyize categorical type variables
showUniqueValues(df)

In [None]:
# Replace 'No internet service' and 'No phone service' values with 'No' in the entire dataset
df.replace('No internet service', 'No', inplace=True)
df.replace('No phone service', 'No', inplace=True)

In [None]:
# Analyize categorical type variables
showUniqueValues(df)

### NA imputation

In [None]:
# NA imputation is not required for this dataset because there are now null values avaiable except for -
# - the 'Churn Reason'

## Descriptive analysis

### Five-Number Summary

In [None]:
df.describe()

### Data distribution - Pie charts
#### Churn visualization

In [None]:
import plotly.offline as po
import plotly.graph_objects as go
import plotly.io as pio

if showViz == 1:
    churn_key = df['Churn'].value_counts().keys().tolist()
    churn_value = df['Churn'].value_counts().values.tolist()

    plot_data = [
        go.Pie(labels=churn_key, values=churn_value, marker=dict(colors=['Teal','Gray'], line=dict(color='white', width=1.5)),
        rotation=90,
        hoverinfo="label+value+text",
        hole=0.6)
    ]

    plot_layout = go.Layout(dict(title="Customer Churn", plot_bgcolor='rgb(243, 243,243)', paper_bgcolor='rgb(243, 243, 243)',))

    fig = go.Figure(data=plot_data, layout=plot_layout)
    
    po.plot(fig, filename = f'{listFolders[1]}/Customer Churn.html', auto_open=False)
    
    po.iplot(fig)

#### Categorical type features distribution analysis

In [None]:
# The following function is used to visualize the distribution of the other columns

# Function to visualize the distribution
def distributionPie(column):
  labels = df[column].unique()
  values = df[column].value_counts()

  fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=0.6, rotation=90)])
  fig.update_layout(title_text=f"{column} Distribution", plot_bgcolor='rgb(243, 243,243)')
  po.plot(fig, filename = f'{listFolders[1]}/{column}.html', auto_open=False)
  fig.show()

if showViz == 1:
    # Loop the column names (categorical)
#     for column in df.drop(['ServiceArea', 'Churn', 'HandsetPrice'], axis=1):
    
#     for column in df.drop(['Churn', 'HandsetPrice'], axis=1):
#         if df[column].dtype =='object':
#             distributionPie(column)
            
    for column in df.drop(['Churn'], axis=1):
        if df[column].dtype =='object':
            distributionPie(column)

### Pandas profiling report

##### This report is an interactive data analysis report for quickly gaining insights into a dataset. The report provides a wide range of statistical and visual summaries of the data, helping data analysts and data scientists to understand the dataset's characteristics, identify potential issues, and make informed decisions about data preprocessing and analysis.

#### Generating the profiling report and saving as an HTML file.

In [None]:
import webbrowser

def runProfilingReport(fName):
    
    # ProfileReport started time
    startTime = datetime.now()

    from ydata_profiling import ProfileReport

    # Create pandas profiling report
    profReport = ProfileReport(df)

    # Download pandas profiling report in html format
    profReport.to_file(f'{fName}.html')

    endTime = datetime.now()
    
    # Open the html file
    filename = 'file:///'+os.getcwd()+'/' + f'{fName}.html'
    webbrowser.open(filename , new=2)

    print(f'Profile Report processing time : {endTime-startTime}')

In [None]:
if showViz == 1:
    runProfilingReport('Row Data Analysis')

#### View the profiling report.

In [None]:
# View pandas profiling report (This will take a few minutes to load the report)
# Since this takes a considerable amount of time to load, the report is saved as 'Row Data Analysis.html' in the directory.
# Open the exported html file to view the plots and other statistics instead.

# Uncomment the below code to view the report along with the code.

#profReport

#### Visualizing the distribution of numerical features by 'Churn' 

In [None]:
import seaborn as sns

if showViz == 1:

    for column in df.drop('Churn', axis=1):
        
        if df[column].dtype !='object':
            fig, ax = plt.subplots(figsize=(6, 3))
            sns.set_context("paper",font_scale=1.1)
            ax = sns.kdeplot(df[column][(df["Churn"] == 0) ],
                            color="Red", fill=True);
            ax = sns.kdeplot(df[column][(df["Churn"] == 1) ],
                            ax =ax, color="Blue", fill=True);
            ax.legend(["Not Churn","Churn"],loc='upper right');
            ax.set_ylabel('Density');
            ax.set_xlabel(column);
            ax.set_title(f'Distribution of {column} by churn');

#### Comparing the features statistically against 'Churn' values 'Yes' and 'No'

In [None]:
import plotly.express as px

def compareStatsWithChurn(colName):

  fig = px.box(df, x='Churn', y = colName)

  # Update yaxis properties
  fig.update_yaxes(title_text=colName, row=1, col=1)
  # Update xaxis properties
  fig.update_xaxes(title_text='Churn', row=1, col=1)

  # Update size and title
  fig.update_layout(autosize=True, width=750, height=600,
      title_font=dict(size=25, family='Courier'),
      title=f'<b>{colName} vs Churn</b>',
  )

  fig.show()

if showViz == 1 :
    for c in colNames:
        compareStatsWithChurn(c)

#### Analyzing 'Churn' vs other features

In [None]:
# Function to analyze customer churn vs other features

def churnHist(colName):
  fig, ax = plt.subplots(figsize=(6, 4))

  churn_yes = df[df.Churn==1][colName]
  churn_no = df[df.Churn==0][colName]

  ax.hist([churn_yes, churn_no], color=['red','purple'], label=['Yes','No'])
  ax.legend()

  ax.set(title=f'Customer churn vs {colName} analysis', xlabel=colName, ylabel='Number of customers')

  plt.savefig(f'{listFolders[0]}/{column}.png', dpi=100);

In [None]:
# Customer churn vs other features

if showViz == 1:
    for column in df.drop('Churn', axis=1):
      churnHist(column)

### Encoding

In [None]:
# Show categorical unique values
showUniqueValues(df)

#### Encoding Type 1 : Label encoding

In [None]:
from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()

labelEncodedColumns = ['Senior Citizen','Partner','Dependents','Multiple Lines','Online Security',
                       'Online Backup','Device Protection','Tech Support','Streaming TV','Streaming Movies',
                       'Contract','Paperless Billing']

def labelEncoding(df):
    
    # Contract feature contains ordinal categorical values. In order to give an order, the 'Contract' values 
    # 'Month-to-month', 'Two year' and 'One year' are replaced with 0, 1 and 2

#     df['Contract'] = df['Contract'].map({'Month-to-month':0, 'One year':1, 'Two year':2})

    # Defining an array to store the names of the columns where the values contain 'Yes' and 'No' only.
    
    # Loop to do label encoding
    for col in labelEncodedColumns:
        df[col]= label_encoder.fit_transform(df[col])  

    return df

In [None]:
labelEncoding(df)

In [None]:
# Show categorical unique values
showUniqueValues(df)

In [None]:
# Check shape of the dataframe
df.shape

In [None]:
for c in labelEncodedColumns:
    print(f'{c} : ', df[c].unique())

#### Encoding Type 2 : One hot encoding

In [None]:
def oneHotEncoding(df):
    # One hot encoding is done to the relevant columns at once.
    df = pd.get_dummies(data=df, columns=['Internet Service', 'Payment Method'], dtype=float)
    return df

df = oneHotEncoding(df)

In [None]:
# Check shape of the dataframe
df.shape

In [None]:
# View the dataframe
df

In [None]:
# Show unique values of the columns
showUniqueValues(df)

In [None]:
# Get the column names
colNames.clear()
for column in df:
    colNames.append(column)

In [None]:
# Show some statistics

df.describe()

In [None]:
df['Churn'].value_counts()

In [None]:
len(colNames)

### Treating Outliers

#### Function to visualize outliers

In [None]:
# Boxplots
# Since the number of features is 73, the boxplots are grouped as 5 features per group.
import math

def showBoxplots2():
    
  l1 = []
  j = 0

  for i in range(math.ceil(len(colNames)/20)):
    l1.append(colNames[i*20:(i+1)*20])
    #df.boxplot(column=colNames[i*5:(i+1)*5])

  for i in l1:
    j = j + 1
    fig = plt.subplots()
    b_plot = df.boxplot(column=i, vert=False)
    #df.T.boxplot(vert=False)
    #plt.subplots_adjust(left=0.25)
    b_plot.plot()
    #plt.xticks(rotation=90)
    plt.savefig(f'boxplots/{j}.png', dpi=100);
    plt.show()
    print('\n')

In [None]:
def showBoxplots(title):
    
    fig = plt.subplots(figsize=(10, 8))
    b_plot = df.drop('Churn', axis=1).boxplot(vert=False)
    b_plot.plot()
    plt.title(f'Boxplots - {title}', fontsize = 16, weight = 'extra bold')
    #plt.savefig(f'Boxplots - {title}.png', dpi=200);
    plt.show()

#### Show boxplots - Before treating outliers

In [None]:
if showViz == 1:
    showBoxplots('Before Treating Outliers')

In [None]:
df.dtypes

#### Removing outliers

The outliers are replaced with the median
In does not display having outliers in this dataset.

In [None]:
colNames.remove('Churn')

for column in colNames:

    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)

    iqr = q3 - q1
    upperLevel = q3 + (iqr * 1.5)
    lowerLevel = q1 - (iqr * 1.5)

    df[column][df[column] < lowerLevel] = df[column][df[column] > upperLevel] = df[column].median()

#### Show boxplots - After treating outliers

In [None]:
if showViz == 1:
    #colNames.remove('Churn')
    showBoxplots('After Treating Outliers')

In [None]:
#df.fillna(df.median(), inplace=True)

In [None]:
df.dtypes

### Lasso Coefficient

#### Function to generate lasso coefficient plot

In [None]:
from sklearn.linear_model import Lasso

# df['Churn'].replace({'Yes':1, 'No':0}, inplace=True)

def lassoCoefficientGraph():
    
#     df['Churn'].replace({'Yes':1, 'No':0}, inplace=True)

    # Create a Lasso regression model with a specific alpha (regularization strength)
    lasso_model = Lasso(alpha=1.0)

    lasso_model.fit(df, df['Churn'])

    # Access the Lasso coefficients
    lasso_coefficients = lasso_model.coef_
    
    fig, ax = plt.subplots(figsize=(30, 10))

    ax.bar(df.columns, lasso_coefficients)
    ax.set(title='Lasso Coefficient Graph', xlabel='Features', ylabel='')
    
    ax.grid(True)
    plt.xticks(rotation=70);

#### Lasso coefficient graph

In [None]:
lassoCoefficientGraph()

### Pearson Correlation Coefficient

In [None]:
# List of categoricals
categoricals = list()
for x in df.columns:
    if df[x].dtype == 'object':
        categoricals.append(x)
df[categoricals].nunique()

In [None]:
numericals = [x for x in df.columns if x not in categoricals]

plt.figure(figsize=(15,8))
df[numericals].corr(method='pearson')['Churn'].sort_values(ascending = False).plot(kind='bar');

## < Feature Engineering >

### Scaling

In [None]:
# Check for features with type 'object'

for column in df:
  if df[column].dtype =='object':
    print(column)

#### Min_Max scaler

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

In [None]:
def minMaxScale(df):
    # Get all the column names to a list.
    colNames = df.columns.tolist()[0:len(df.columns.tolist())]

    # Apply Min-Max scaler
    df[colNames] = scaler.fit_transform(df[colNames])

#### Apply Min-Max scaler

In [None]:
# Call min-max scaling function
minMaxScale(df)

# View sample
df.sample(5)

###### Show some statistics

In [None]:
df.describe()

#### Show boxplots - After scaling

In [None]:
if showViz == 1:
    #colNames.remove('Churn')
    showBoxplots('After Scaling')

In [None]:
df.columns

### The boxplots generated after feature scaling indicates the distribution of some features are very low and near to zero. To make more sense of these columns, the first model is training process is done with all the columns.

## < Data Splitting >

### Train & Test Splits


In [None]:
from sklearn.metrics import recall_score, confusion_matrix, precision_score, f1_score, accuracy_score, classification_report
from sklearn.model_selection import train_test_split

X_train = X_test = df.drop(df.index)
y_train = y_test = df.drop(df.index)

def splitDataset(df):
    # All the columns except 'Churn'
    X = df.drop('Churn', axis=1)

    # 'Churn' column
    y = df['Churn']

    # Access dataframes declared globaly
    global X_train, X_test, y_train, y_test
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

In [None]:
splitDataset(df)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

In [None]:
X_train

In [None]:
X_test

In [None]:
y_train

In [None]:
y_test

## < Model Selection >

The below supervised learning machine learning algorithms are used to analyze the customer churn dataset.

- #### K-Nearest Neighbors (KNN)
- #### Support Vector Machine (SVM)
- #### Random Forest
- #### Logistic Regression
- #### Decision Tree Classifier
- #### Ada Boost Classifier
- #### Gradient Boosting Classifier
- #### Voting Classifier

### Customized functions used to calculate important parameters for ML selection

#### Calculating and storing the duration spent for ML processes

In [None]:
# DataFrame to select the appropriate ML algorithm for this customer churn prediction dataset
df_ml_eval = pd.DataFrame({'ATTEMPT':[], 'ALGORITHM':[], 'TIME':[]})

attempt = 0

algoList = ['KNN', 'SVM', 'Random Forest', 'Logistic Regression', 'Decision Tree', 'Ada Boost', 'Gradient Boosting', 'Voting']

# Function to calculate ML processing time and results, and store records in the 'dfTimeML' DataFrame for analysis
def calculateTimeML(t, nameML=None, accuracy=None):
  global startTime
  global endTime

  if t == 1:
     startTime = datetime.now()
  elif t == 2:
     endTime = datetime.now()


     # Print duration
     print(f'{nameML} time : {(endTime-startTime).total_seconds()}')

     # Store time in the DataFrame
     df_ml_eval.loc[len(df_ml_eval.index)+1] = [attempt, nameML, (endTime-startTime).total_seconds()]

#### Function to generate Confusion Matrix Graph

In [None]:
def generateConfusionMatrixGraph(algo, y_test, pred, attempt):

    fig, ax = plt.subplots(figsize=(4,3))
#     plt.figure(figsize=(4,3))
    sns.heatmap(confusion_matrix(y_test, pred, labels=[0, 1]), cmap="Blues",
                    annot=True,fmt = "d",linecolor="k",linewidths=3)

    ax.set_xlabel('Actual')
    ax.set_ylabel('Predicted')
    ax.set(title=f'{algo} Confusion Matrix')
    
    plt.savefig(f'Confusion Matrix {attempt} - {algo}.png', dpi=200)
    
    
    plt.show()

#### Function to generate Confusion Matrix Array

In [None]:
def generateConfusionMatrixArray(prediction):
    # Assigned the actual and predicted values to a dictionary.
    dvalues = {'y_actual': y_test, 'y_predicted': prediction}

    # Create a dataframe from dvalues dictionary.
    dfcm = pd.DataFrame(dvalues)

    # Create the confusion matrix using the dfcm dataframe.
    cm = pd.crosstab(dfcm['y_actual'], dfcm['y_predicted'], rownames=['Actual'], colnames=['Predicted'])

    return cm

#### Replacing 'Yes' and 'No' values with 1 and 0 in y datasets

In [None]:
def replaceYesNo():
    # Number of model training attempts
    global attempt
    
    attempt = attempt + 1

    # Replacing 'Yes' and 'No' values with 1 and 0 to avoid python errors
    y_test.replace({'Yes':1, 'No':0}, inplace=True)
    y_train.replace({'Yes':1, 'No':0}, inplace=True)

replaceYesNo()

## < Model Training >

### 01 : K-Nearest Neighbors (KNN)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

def trainModel_knn():
    
    calculateTimeML(1)

    model_knn = KNeighborsClassifier(n_neighbors = 11)
    model_knn.fit(X_train,y_train)
    prediction_knn = model_knn.predict(X_test)
    accuracy_knn = model_knn.score(X_test,y_test)
    print("KNN accuracy :",accuracy_knn)

    calculateTimeML(2, 'KNN', accuracy_knn)

    print('\nClassification Report')
    print(classification_report(y_test, prediction_knn))
    
    return (model_knn, prediction_knn)

### 02 : Support Vector Machine - SVM

##### Since the Support Vector Machine takes a considerable amount of time compared to the other algorithms, SVM was not considered in this research form here onwards.

In [None]:
from sklearn.svm import SVC

def trainModel_svc():
    
    calculateTimeML(1)

    model_svc = SVC(random_state = 1, probability=True)
    model_svc.fit(X_train,y_train)
    prediction_svc = model_svc.predict(X_test)
    accuracy_svc = model_svc.score(X_test,y_test)
    print('SVM accuracy :',accuracy_svc)

    calculateTimeML(2, 'SVM', accuracy_svc)

    print('\nClassification Report')
    print(classification_report(y_test, prediction_svc))

    return (model_svc, prediction_svc)

### 03 : Random Forest


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

def trainModel_rf():
    calculateTimeML(1)

    model_rf = RandomForestClassifier(n_estimators=500 , oob_score = True, n_jobs = -1,
                                      random_state =50, max_features = "sqrt",
                                      max_leaf_nodes = 30)
    model_rf.fit(X_train, y_train)
    prediction_rf = model_rf.predict(X_test)
    accuracy_rf = metrics.accuracy_score(y_test, prediction_rf)
    print('Random Forest accuracy :', accuracy_rf)

    calculateTimeML(2, 'Random Forest', accuracy_rf)

    print("\nClassification Report")
    print(classification_report(y_test, prediction_rf))
    
    return (model_rf, prediction_rf)

###  04 : Logistic Regression


In [None]:
from sklearn.linear_model import LogisticRegression

def trainModel_lr():
    calculateTimeML(1)

    model_lr = LogisticRegression()
    model_lr.fit(X_train,y_train)
    prediction_lr = model_lr.predict(X_test)
    accuracy_lr = model_lr.score(X_test,y_test)
    print("Logistic Regression accuracy :",accuracy_lr)

    calculateTimeML(2, 'Logistic Regression', accuracy_lr)

    print("\nClassification Report")
    print(classification_report(y_test,prediction_lr))
    
    return (model_lr, prediction_lr)

### 05 : Decision Tree Classifier


In [None]:
from sklearn.tree import DecisionTreeClassifier

def trainModel_dt():

    calculateTimeML(1)

    model_dt = DecisionTreeClassifier()
    model_dt.fit(X_train,y_train)
    prediction_dt = model_dt.predict(X_test)
    accuracy_dt = model_dt.score(X_test,y_test)
    print("Decision Tree accuracy is :",accuracy_dt)

    calculateTimeML(2, 'Decision Tree', accuracy_dt)

    print('\nClassification Report')
    print(classification_report(y_test, prediction_dt))
    
    return (model_dt, prediction_dt)

### 06 : Ada Boost Classifier


In [None]:
from sklearn.ensemble import AdaBoostClassifier

def trainModel_abc():

    calculateTimeML(1)

    model_abc = AdaBoostClassifier()
    model_abc.fit(X_train,y_train)
    prediction_abc = model_abc.predict(X_test)
    accuracy_abc = metrics.accuracy_score(y_test, prediction_abc)
    print("Ada Boost Classifier accuracy : ", accuracy_abc)

    calculateTimeML(2, 'Ada Boost', accuracy_abc)

    print('\nClassificatin Report')
    print(classification_report(y_test, prediction_abc))
    
    return (model_abc, prediction_abc)

### 07 : Gradient Boosting Classifier


In [None]:
from sklearn.ensemble import GradientBoostingClassifier

def trainModel_gbc():
    calculateTimeML(1)

    model_gbc = GradientBoostingClassifier()
    model_gbc.fit(X_train, y_train)
    prediction_gbc = model_gbc.predict(X_test)
    accuracy_gbc = accuracy_score(y_test, prediction_gbc)
    print("Gradient Boosting Classifier : ", accuracy_gbc)

    calculateTimeML(2, 'Gradient Boosting', accuracy_gbc)

    print('\nClassification Report')
    print(classification_report(y_test, prediction_gbc))
    
    return (model_gbc, prediction_gbc)

### 08 : Voting Classifier


In [None]:
from sklearn.ensemble import VotingClassifier

def trainModel_vc():

    calculateTimeML(1)

    clf_gbc = GradientBoostingClassifier()
    clf_lr = LogisticRegression()
    clf_abc = AdaBoostClassifier()
    model_vc = VotingClassifier(estimators=[('gbc', clf_gbc), ('lr', clf_lr), ('abc', clf_abc)], voting='soft')
    model_vc.fit(X_train, y_train)
    prediction_vc = model_vc.predict(X_test)
    accuracy_vc = accuracy_score(y_test, prediction_vc)
    print(f"Final Accuracy Score {accuracy_vc}")

    calculateTimeML(2, 'Voting', accuracy_vc)

    print('\nClassification Report')
    print(classification_report(y_test, prediction_vc))
    
    return (model_vc, prediction_vc)

#### Function to run model training

In [None]:
def runModelTrainFunctions():
    
    model_knn, prediction_knn = trainModel_knn()
    model_svc, prediction_svc = trainModel_svc()
    model_rf, prediction_rf = trainModel_rf()
    model_lr, prediction_lr = trainModel_lr()
    model_dt, prediction_dt = trainModel_dt()
    model_abc, prediction_abc = trainModel_abc()
    model_gbc, prediction_gbc = trainModel_gbc()
    model_vc, prediction_vc = trainModel_vc()
    
    return model_knn, model_svc, model_rf, model_lr, model_dt, model_abc, model_gbc, model_vc, prediction_knn, prediction_svc, prediction_lr, prediction_rf, prediction_dt, prediction_abc, prediction_gbc, prediction_vc


#### Calling models training function

In [None]:
model_knn, model_svc, model_rf, model_lr, model_dt, model_abc, model_gbc, model_vc, prediction_knn, prediction_svc, prediction_lr, prediction_rf, prediction_dt, prediction_abc, prediction_gbc,prediction_vc = runModelTrainFunctions()

#### The total processing time

In [None]:
print(f'Total processing time : {(endTime-mainStartTime)}')

In [None]:
df_ml_eval

## < Model Evaluation >

### Evaluation Type 1 - Confusion Matrix

#### Function to generate confusion matrix graphs

In [None]:
def generateConfusionMatrixGraphs_All(attempt):
    
   # global prediction_knn, prediction_rf, prediction_lr, prediction_dt, prediction_abc, prediction_gbc, prediction_vc

    generateConfusionMatrixGraph('KNN', y_test, prediction_knn, attempt)
    generateConfusionMatrixGraph('SVM', y_test, prediction_svc, attempt)
    generateConfusionMatrixGraph('Random Forest', y_test, prediction_rf, attempt)
    generateConfusionMatrixGraph('Logistic Regression', y_test, prediction_lr, attempt)
    generateConfusionMatrixGraph('Decision Tree', y_test, prediction_dt, attempt)
    generateConfusionMatrixGraph('Ada Boost', y_test, prediction_abc, attempt)
    generateConfusionMatrixGraph('Gradient Boosting Classifier', y_test, prediction_gbc, attempt)
    generateConfusionMatrixGraph('VC Classifier', y_test, prediction_vc, attempt)   

#### Generate Confusion Matrix graphs

In [None]:
if showViz == 1:
    generateConfusionMatrixGraphs_All(attempt) 

#### Function to create Confusion Matrix Arrays - Accuracy, Sensivity, Specificity, Recall & F1_Score

In [None]:
# Confusion Matrix Arrays of each ML model

def confusionMatrixArrays():   
        
        # K-Nearest Neighbors
        cm_knn = generateConfusionMatrixArray(prediction_knn)

        # Support vector machine
        cm_svc = generateConfusionMatrixArray(prediction_svc)
        
        # Random forest
        cm_rf = generateConfusionMatrixArray(prediction_rf)
        
        # Logistic regression
        cm_lr = generateConfusionMatrixArray(prediction_lr)

        # Decision tree
        cm_dt = generateConfusionMatrixArray(prediction_dt)

        # Ada boost
        cm_abc = generateConfusionMatrixArray(prediction_abc)

        # Gradient boosting
        cm_gbc = generateConfusionMatrixArray(prediction_gbc)

        # Voting
        cm_vc = generateConfusionMatrixArray(prediction_vc)

        return (cm_knn, cm_svc, cm_rf, cm_lr, cm_dt, cm_abc, cm_gbc, cm_vc)

#### Call function to create Confusion Matrix Arrays - Accuracy, Sensivity, Specificity, Recall & F1_Score

In [None]:
cm_knn, cm_svc, cm_rf, cm_lr, cm_dt, cm_abc, cm_gbc, cm_vc = confusionMatrixArrays()

#### Function to append 'df_ml_eval' DataFrame with sensivity and specificity data

In [None]:
def appendConfusionMatrixResults():
    
    df_ml_eval['TP'] = [cm_knn[0][0], cm_svc[0][0], cm_rf[0][0], cm_lr[0][0], cm_dt[0][0], cm_abc[0][0], cm_gbc[0][0], cm_vc[0][0]]
    df_ml_eval['TN'] = [cm_knn[1][1], cm_svc[1][1], cm_rf[1][1], cm_lr[1][1], cm_dt[1][1], cm_abc[1][1], cm_gbc[1][1], cm_vc[1][1]]
    df_ml_eval['FP'] = [cm_knn[1][0], cm_svc[1][0], cm_rf[1][0], cm_lr[1][0], cm_dt[1][0], cm_abc[1][0], cm_gbc[1][0], cm_vc[1][0]]
    df_ml_eval['FN'] = [cm_knn[0][1], cm_svc[0][1], cm_rf[0][1], cm_lr[0][1], cm_dt[0][1], cm_abc[0][1], cm_gbc[0][1], cm_vc[0][1]]

#### Call function to append 'df_ml_eval' DataFrame with sensivity and specificity data

In [None]:
appendConfusionMatrixResults()

In [None]:
df_ml_eval

#### Function to calculating accuracy, sensitivity, specificity, recall & F1_score

In [None]:
def calculateAccSensiSpeciRecallF1Score():

    # The total of predicted positives and predicted negatives of all the predictions
    df_ml_eval['ACCURACY'] = round(((df_ml_eval['TP']+df_ml_eval['TN'])/(df_ml_eval['TP'] + df_ml_eval['TN'] + df_ml_eval['FP']+df_ml_eval['FN'])*100), 2)

    # Precision : Sensitivity - Predicted true positives by predicted total positives
    df_ml_eval['SENSITIVITY-RECALL-CHURN_NO_IDENTIFIED'] = round((df_ml_eval['TP']/(df_ml_eval['TP']+df_ml_eval['FN']))*100, 2)

    # Precision : Sepcificity - Predicted true negatives by predicted total negatives
    df_ml_eval['SPECIFICITY-CHURN_YES_IDENTIFIED'] = round((df_ml_eval['TN']/(df_ml_eval['TN']+df_ml_eval['FP']))*100, 2)
    
    # Precision
    df_ml_eval['PRECISION'] = round(df_ml_eval['TP'] / (df_ml_eval['TP'] + df_ml_eval['FP']) * 100, 2)

    # F1-Score
    df_ml_eval['F1_SCORE'] = round(((df_ml_eval['SPECIFICITY-CHURN_YES_IDENTIFIED'] * df_ml_eval['SENSITIVITY-RECALL-CHURN_NO_IDENTIFIED']) / (df_ml_eval['SPECIFICITY-CHURN_YES_IDENTIFIED'] + df_ml_eval['SENSITIVITY-RECALL-CHURN_NO_IDENTIFIED']))*2, 2)


#### Call function for calculating accuracy, sensitivity, specificity, recall & F1_score

In [None]:
calculateAccSensiSpeciRecallF1Score()

df_ml_eval

#### Function for plotting accuracy, sensitivity, specificity, recall, F1_score & time

In [None]:
def modelComparisionPlot(fName):

    fig, ((ax0, ax1), (ax2, ax3), (ax4, ax5)) = plt.subplots(ncols=2, nrows=3, figsize=(25, 13), sharex=True)
    fig.tight_layout(pad=5.0)

    bar_container0 = ax0.bar(df_ml_eval['ALGORITHM'], df_ml_eval['ACCURACY'])
    ax0.bar_label(bar_container0)
    ax0.scatter(df_ml_eval['ALGORITHM'], df_ml_eval['ACCURACY'])
    ax0.plot(df_ml_eval['ALGORITHM'], df_ml_eval['ACCURACY'])
    ax0.set(title='Accuracy Graph');
    ax0.grid(True)

    bar_container1 = ax1.bar(df_ml_eval['ALGORITHM'], df_ml_eval['SENSITIVITY-RECALL-CHURN_NO_IDENTIFIED'])
    ax1.bar_label(bar_container1)
    ax1.scatter(df_ml_eval['ALGORITHM'], df_ml_eval['SENSITIVITY-RECALL-CHURN_NO_IDENTIFIED'])
    ax1.plot(df_ml_eval['ALGORITHM'], df_ml_eval['SENSITIVITY-RECALL-CHURN_NO_IDENTIFIED'])
    ax1.set(title='Sensitivity Graph - Churn NO Identified' )
    ax1.grid(True)

    bar_container2 = ax2.bar(df_ml_eval['ALGORITHM'], df_ml_eval['SPECIFICITY-CHURN_YES_IDENTIFIED'])
    ax2.bar_label(bar_container2)
    ax2.scatter(df_ml_eval['ALGORITHM'], df_ml_eval['SPECIFICITY-CHURN_YES_IDENTIFIED'])
    ax2.plot(df_ml_eval['ALGORITHM'], df_ml_eval['SPECIFICITY-CHURN_YES_IDENTIFIED'])
    ax2.set(title='Specificity Graph - Churn YES Identified')
    ax2.grid(True)

    bar_container3 = ax3.bar(df_ml_eval['ALGORITHM'], df_ml_eval['PRECISION'])
    ax3.bar_label(bar_container3)
    ax3.scatter(df_ml_eval['ALGORITHM'], df_ml_eval['PRECISION'])
    ax3.plot(df_ml_eval['ALGORITHM'], df_ml_eval['PRECISION'])
    ax3.set(title='Recall Graph')
    ax3.grid(True)

    bar_container4 = ax4.bar(df_ml_eval['ALGORITHM'], df_ml_eval['F1_SCORE'])
    ax4.bar_label(bar_container4)
    ax4.scatter(df_ml_eval['ALGORITHM'], df_ml_eval['F1_SCORE'])
    ax4.plot(df_ml_eval['ALGORITHM'], df_ml_eval['F1_SCORE'])
    ax4.set(title='F1-Score Graph')
    ax4.grid(True)

    bar_container5 = ax5.bar(df_ml_eval['ALGORITHM'], df_ml_eval['TIME'])
    ax5.bar(df_ml_eval['ALGORITHM'], df_ml_eval['TIME'])
    #ax5.bar_label(bar_container1)
    ax5.set(title='Processed Time', xlabel='', ylabel='Time (seconds)')
    ax5.grid(True)
    plt.xticks(rotation=70);

    fig.suptitle(fName, fontsize = 16, weight = 'extra bold', y=1)
    plt.savefig(f'{fName}.png', dpi=100)

#### Call function for plotting accuracy, sensitivity, specificity, recall, F1_score & time

In [None]:
modelComparisionPlot('Bar Graphs - Before Columns Removed')

In [None]:
df_ml_eval['CUMULATIVE'] = (df_ml_eval['ACCURACY']+df_ml_eval['SENSITIVITY-RECALL-CHURN_NO_IDENTIFIED']+
                                      df_ml_eval['SPECIFICITY-CHURN_YES_IDENTIFIED']+df_ml_eval['PRECISION']+
                                      df_ml_eval['F1_SCORE'])/5

In [None]:
df_ml_eval

In [None]:

def cumulativeModelAccuracyGraph():

    fig, ax = plt.subplots(figsize=(15,8))
    bar_container = ax.bar(df_ml_eval['ALGORITHM'], df_ml_eval['CUMULATIVE'])
    ax.bar_label(bar_container)
    ax.scatter(df_ml_eval['ALGORITHM'], df_ml_eval['CUMULATIVE'])
    ax.plot(df_ml_eval['ALGORITHM'], df_ml_eval['CUMULATIVE'])
    ax.set(title='Cumulative Graph')
    ax.grid(True)

In [None]:
cumulativeModelAccuracyGraph()

#### Function to create heatmap

In [None]:
def fnGenerateHeatMap():
    
    plt.rcParams["figure.figsize"] = (30,20)
    sns.heatmap(df[:].corr(),annot = True,fmt='.1g',linecolor='white',cmap="YlGnBu",linewidths=.5)
    plt.title("Heatmap",fontsize= 18)
    plt.show()

In [None]:
fnGenerateHeatMap()

### Evaluation Type 2 : Receiver Operating Characteristic (ROC)

#### Function to generating ROC graphs

In [None]:
from sklearn.metrics import roc_curve

def generateROCgraphs(fName):

    fig, ax = plt.subplots(ncols=2, nrows=4, figsize=(10, 12))
    fig.tight_layout(pad=5.0)

    models = [model_knn, model_svc, model_rf, model_lr, model_dt, model_abc, model_gbc, model_vc]
    modelN = ['KNN', 'SVM', 'Random Forest', 'Logistic Regression', 'Decision Tree', 'Ada Boost', 'Gradient Boosting', 'VC']

    j = 0
    k = 0
    
    colors = ['red','green','blue','yellow', 'brown', 'black', 'orange', 'pink']

    for i in range(len(models)):

        y_pred_prob = models[i].predict_proba(X_test)[:,1]
        fpr_rf, tpr_rf, thresholds = roc_curve(y_test, y_pred_prob)

        ax[j][k].plot([0, 1], [0, 1], 'k--' )
        ax[j][k].plot(fpr_rf, tpr_rf, label=modelN[i],color = "r")
        ax[j][k].set(title=f'{modelN[i]} ROC Curve', xlabel='False Positive Rate', ylabel='True Positive Rate')

        if k == 1:
            j = j + 1
            k = 0
            continue

        k = k + 1

    
    fig.suptitle(fName, fontsize = 16, weight = 'extra bold', y=0.9)
    plt.savefig(f'{fName}.png', dpi=100)
 
    fig.tight_layout(pad=5.0)

    l = 0    

    fig, ax = plt.subplots(figsize=(15, 10))
    
    for i in range(len(models)):

        y_pred_prob = models[i].predict_proba(X_test)[:,1]
        fpr_rf, tpr_rf, thresholds = roc_curve(y_test, y_pred_prob)

        ax.plot([0, 1], [0, 1], 'k--' )
        ax.plot(fpr_rf, tpr_rf, label=modelN[i],color = colors[l])
        ax.set(xlabel='False Positive Rate', ylabel='True Positive Rate')       
        ax.legend()

        l = l + 1
    
    fig.suptitle('All ROC Curves', fontsize = 16, weight = 'extra bold', y=1)
    plt.savefig(f'{fName}.png', dpi=100)

#### Call function to generate ROC graphs

In [None]:
generateROCgraphs('ROC Curves')

#### Selecting the most important features

In [None]:
# Calculate feature importances
feature_importances = model_abc.feature_importances_

# Sort and rank features by importance
sorted_indices = np.argsort(feature_importances)[::-1]

# Important columns
colNamesToBeRemoved = []

# Print the ranked list of feature names and their importances
for idx in sorted_indices:
    if feature_importances[idx] == 0.0:
        colNamesToBeRemoved.append(df.columns[idx])
        print(f"Feature: {df.columns[idx]}, Importance: {feature_importances[idx]}")

#### By analyzing the importance of the features, the below columns can be removed from the dataset and build the models for the second time.

In [None]:
colNamesToBeRemoved

In [None]:
df.drop(colNamesToBeRemoved, axis=1, inplace=True)

In [None]:
fnGenerateHeatMap()

#### Dropping columns that have low correlation

In [None]:
# colNamesToBeRemoved.clear()
# colNamesToBeRemoved = ['CallForwardingCalls', 'RVOwner', 'ReferralsMadeBySubscriber', 'MadeCallToRetentionTeam',
#                        'PrizmCode_Rural', 'Occupation_Homemaker', 'MaritalStatus_No']

# df.drop(colNamesToBeRemoved, axis=1, inplace=True)

In [None]:
df.columns

# ===========================================================

# Retrain the models after removing some features : Analiyzed

#### View final profiling report

In [None]:
if showViz == 1:
    runProfilingReport('Final Data Analysis')

#### Machine learning results dataframe re-initializing

In [None]:
df_ml_eval.drop(df_ml_eval.index, inplace=True)

In [None]:
df_ml_eval = pd.DataFrame({'ATTEMPT':[], 'ALGORITHM':[], 'TIME':[]})

#### Re-initializing the X_train, X_test, y_train, y_test objects

In [None]:
X_train = X_test = df.drop(df.index)
y_train = y_test = df.drop(df.index)

#### Splitting the new dataframe

In [None]:
splitDataset(df)
replaceYesNo()

#### Calling models training function

In [None]:
model_knn, model_svc, model_rf, model_lr, model_dt, model_abc, model_gbc, model_vc, prediction_knn, prediction_svm, prediction_lr, prediction_rf, prediction_dt, prediction_abc, prediction_gbc, prediction_vc = runModelTrainFunctions()


#### Generate Confusion Matrix graphs

In [None]:
generateConfusionMatrixGraphs_All(attempt)

#### Call function to create Confusion Matrix Arrays - Accuracy, Sensivity, Specificity, Recall & F1_Score

In [None]:
cm_knn, cm_svc, cm_rf, cm_lr, cm_dt, cm_abc, cm_gbc, cm_vc = confusionMatrixArrays()

In [None]:
cm_knn, cm_svc, cm_rf, cm_lr, cm_dt, cm_abc, cm_gbc, cm_vc

#### Call function to append 'df_ml_eval' DataFrame with sensivity and specificity data

In [None]:
appendConfusionMatrixResults()

#### Call function for calculating accuracy, sensitivity, specificity, recall & F1_score¶

In [None]:
calculateAccSensiSpeciRecallF1Score()

#### Call function for plotting accuracy, sensitivity, specificity, recall, F1_score & time

In [None]:
modelComparisionPlot('Bar Graph - After Columns Removed')

In [None]:
df_ml_eval['CUMULATIVE'] = (df_ml_eval['ACCURACY']+df_ml_eval['SENSITIVITY-RECALL-CHURN_NO_IDENTIFIED']+df_ml_eval['SPECIFICITY-CHURN_YES_IDENTIFIED']+df_ml_eval['PRECISION']+df_ml_eval['F1_SCORE'])/5
df_ml_eval

In [None]:
cumulativeModelAccuracyGraph()

#### Call function to generate ROC graphs

In [None]:
generateROCgraphs('ROC Curves - After Columns Removed')

# Analyzing the correlations of the final dataset

In [None]:
# List of categoricals
categoricals = list()
for x in df.columns:
    if df[x].dtype == 'object':
        categoricals.append(x)
df[categoricals].nunique()

### Pearson Correlation Coefficient

In [None]:
numericals = [x for x in df.columns if x not in categoricals]

plt.figure(figsize=(15,8))
df[numericals].corr(method='pearson')['Churn'].sort_values(ascending = False).plot(kind='bar');

# *****************************************************************************************

## < Finalizing The Most Suitable Model >

## < Model Deployment >

### Importing the 'joblib' library

In [None]:
import joblib

### Save the model as a file

This exported trained model can be used to predict customer churns in the telcom organization. By comparing the analyzed data, 
the 'Ada Boosting Classifier' model has perfomed well in this dataset. Therefore, 'model_abc' is exported as the selected model.

In [None]:
joblib.dump(model_abc, 'model.pkl')

# *****************************************************************************************

# Validation : Predicting the Churns of a Given Dataset Using the Deployed Model

### Load the trained model

In [None]:
model = joblib.load('model.pkl')

### Function to predict

In [None]:
def fnPredict():

    global dfHoldout

    # Drop 'Churn' column from the dataset
    dfHoldout.drop('Churn Value', axis=1, inplace=True)

    # Drop 'City' column
    dfHoldout.drop('City', axis=1, inplace=True)
    
    # Apply the custom functions to prepare the dataset for the deployed ML model
    
    # Encoding
    dfHoldout = labelEncoding(dfHoldout)
    dfHoldout = oneHotEncoding(dfHoldout)
    
    # Drop Low correlation features
    dfHoldout.drop(['Churn Label','CustomerID','Country','State','Lat Long','Gender','Phone Service','Total Charges',
                    'Churn Reason','Count','Zip Code','Latitude','Longitude', 'Churn Score'], axis=1, inplace=True)
    
    # Low importance features
    dfHoldout.drop(['Partner',  'Dependents',  'Payment Method_Electronic check', 'Online Backup', 'Device Protection',
                    'Internet Service_Fiber optic', 'Internet Service_No', 'Payment Method_Bank transfer (automatic)', 
                    'Senior Citizen'], axis=1, inplace=True)
    # Scaling
    minMaxScale(dfHoldout)

    # Make predictions
    predictions = model.predict(dfHoldout)

    # Move predicitions in a new column 'Churn'
    dfHoldout['Churn'] = predictions

### Read the data to be prodicted

In [None]:
# There are 502 number of records in the dataset to be predicted
dfHoldout = pd.read_csv('CustomerChurnDataset_Holdout.csv')

In [None]:
df.columns

In [None]:
dfHoldout.columns

### Call the 'fnPredict' function and assign the new dataset with the new 'Churn' column

In [None]:
fnPredict()

In [None]:
df.columns

In [None]:
dfHoldout.columns

### View the predicted churn column of the  dataset

In [None]:
dfHoldout

In [None]:
print(f'Total time : {(datetime.now()-mainStartTime)}')

# END OF THE CODE