# Problem Statement

Customer Churn is a burning problem for Telecom companies. Almost every telecom company pays a premium to get a customer on-board. Customer churn is a directly impacts company’s revenue.
  
In this case-study, we simulate one such case of customer churn where we work on a data of post-paid customers with a contract. The data has information about customer usage behaviour, contract details, and payment details. The data also indicates which were the customers who cancelled their service.  
  
Based on this past data, Perform an EDA and build a model which can predict whether a customer will cancel their service in the future or not.

# Data Dictionary

* <b>Churn</b> - 1 if customer cancelled service, 0 if not
* <b>AccountWeeks</b> - number of weeks customer has had active account
* <b>ContractRenewal</b> - 1 if customer recently renewed contract, 0 if not
* <b>DataPlan</b> - 1 if customer has data plan, 0 if not
* <b>DataUsage</b> - gigabytes of monthly data usage
* <b>CustServCalls</b> - number of calls into customer service
* <b>DayMins</b> - average daytime minutes per month
* <b>DayCalls</b> - average number of daytime calls
* <b>MonthlyCharge</b> - average monthly bill
* <b>OverageFee</b> - largest overage fee in last 12 months
* <b>RoamMins</b> - average number of roaming minutes


In [None]:
#Import all necessary modules
import pandas as pd  
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import metrics,model_selection
from sklearn.preprocessing import scale

In [None]:
cell_df = pd.read_excel("Cellphone.xlsx")

## EDA

In [None]:
cell_df.head()

In [None]:
cell_df.info()

There are missing values in some coumns.  
All variables are of numeric type and does not contain any data inconsistencies (causing numeric variables to be object due to some special characters present in the data).  
Churn is the target variable.   
Churn, ContractRenewal and DataPlan are binary variables.

In [None]:
cell_df[['AccountWeeks','DataUsage','CustServCalls','DayMins','DayCalls','MonthlyCharge','OverageFee','RoamMins']].describe()

### Check for Missing values

In [None]:
cell_df.isnull().sum()

### Imputing missing values

Since, ContractRenewal and DataPlan are binary, we cannot substitute with mean values for these 2 variables. We will impute these two variables with their respective modal values.

In [None]:
# Compute Mode for the 'Contract Renewal' and 'DataPlan' columns and impute the missing values
cols = ['ContractRenewal','DataPlan']
for column in cols:
    
    
    
cell_df.isnull().sum()

Now let us impute the rest of the continuous variables with the median. For that we are going to use the SimpleImputer sub module from sklearn.

In [None]:
from sklearn.impute import SimpleImputer
SI = SimpleImputer(strategy='median')

In [None]:
#Now we need to fit and transform our respective data set to fill the missing values with the corresponding 'median' values



### Checking for Duplicates

In [None]:
# Are there any duplicates ?
dups = cell_df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))

### Proportion in the Target classes

In [None]:
cell_df.Churn.value_counts(normalize=True)

### Distribution of the variables Check using Histogram

In [None]:
cell_df[['AccountWeeks','DataUsage','CustServCalls','DayMins','DayCalls','MonthlyCharge','OverageFee','RoamMins']].hist();

### Outlier Check using boxplots

In [None]:
cols=['AccountWeeks','DataUsage','CustServCalls','DayMins','DayCalls','MonthlyCharge','OverageFee','RoamMins'];

for i in cols:
    sns.boxplot(cell_df[i])
    plt.show()


### Bi-Variate Analysis with Target variable

<b>Account Weeks and Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['AccountWeeks'])

<b>Data Usage against Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['DataUsage'])

<b>DayMins against Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['DayMins'])

<b>DayCalls against Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['DayCalls'])

<b>MonthlyCharge against Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['MonthlyCharge'])

<b>OverageFee against Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['OverageFee'])

<b>RoamMins against Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['RoamMins'])

<b>CustServCalls against Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['CustServCalls'])

<b>Contract Renewal against Churn</b>

In [None]:
sns.countplot(cell_df['ContractRenewal'],hue=cell_df['Churn'])

<b>Data Plan against Churn</b>

In [None]:
sns.countplot(cell_df['DataPlan'],hue=cell_df['Churn'])

### Train (70%) - Test(30%) Split 

In [None]:
# Creating a copy of the original data frame
df = cell_df.copy()

In [None]:
df.head()

In [None]:
X = df.drop('Churn',axis=1)
Y = df.pop('Churn')

In [None]:
#Train Test Split

''' <<<< Replace Code Here >>>>>  '''

In [None]:
print('Number of rows and columns of the training set for the independent variables:',X_train.shape)
print('Number of rows and columns of the training set for the dependent variable:',Y_train.shape)
print('Number of rows and columns of the test set for the independent variables:',X_test.shape)
print('Number of rows and columns of the test set for the dependent variable:',Y_test.shape)

### LDA Model

In [None]:
#Build LDA Model and fit the data

''' <<<< Replace Code Here >>>>>  '''



### Prediction

In [None]:
# Training Data Class Prediction with a cut-off value of 0.5 as default
''' <<<< Replace Code Here >>>>>  '''
# Test Data Class Prediction with a cut-off value of 0.5 as default
''' <<<< Replace Code Here >>>>>  '''

### Training Data and Test Data Confusion Matrix Comparison

In [None]:
#Plotting confusion matrix for the different models for the Training Data

''' <<<< Replace Code Here >>>>>  '''
#Plotting confusion matrix for the different models for the Test Data

''' <<<< Replace Code Here >>>>>  '''

### Training Data and Test Data Classification Report Comparison

In [None]:
''' <<<< Replace Code Here >>>>>  '''

### Probability prediction for the training and test data

In [None]:
# Training Data Probability Prediction
''' <<<< Replace Code Here >>>>>  '''

# Test Data Probability Prediction
''' <<<< Replace Code Here >>>>>  '''

### AUC-ROC 

In [None]:
# AUC and ROC for the training data

# calculate AUC
''' <<<< Replace Code Here >>>>>  '''
print('AUC for the Training Data: %.3f' % auc)

#  calculate roc curve
''' <<<< Replace Code Here >>>>>  '''
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.',label = 'Training Data')


# AUC and ROC for the test data

# calculate AUC
''' <<<< Replace Code Here >>>>>  '''
print('AUC for the Test Data: %.3f' % auc)

#  calculate roc curve
''' <<<< Replace Code Here >>>>>  '''
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.',label='Test Data')
# show the plot
plt.legend(loc='best')
plt.show()

### How to change the cut-off values for maximum accuracy?

In [None]:
for j in np.arange(0.1,1,0.1):
    custom_prob = j #defining the cut-off value of our choice
    custom_cutoff_data=[]#defining an empty list
    for i in range(0,len(Y_train)):#defining a loop for the length of the test data
        #issuing a condition for our probability values to be 
            #greater than the custom cutoff value
            #if the probability values are greater than the custom cutoff then the value should be 1
               ''' <<<< Replace Code Here >>>>>  '''
            #if the probability values are less than the custom cutoff then the value should be 0
        #adding either 1 or 0 based on the condition to the end of the list defined by us
    
    print(round(j,3),'\n')
    print('Accuracy Score',''' <<<< Replace Code Here >>>>>  ''')
    print('F1 Score',''' <<<< Replace Code Here >>>>>  ''')
    plt.figure(figsize=(6,4))
    print('Confusion Matrix')
    sns.heatmap(metrics.confusion_matrix(Y_train,custom_cutoff_data),annot=True,fmt='.4g'),'\n\n'
    plt.show();

In [None]:
#Predicting the classes on the test data

data_pred_custom_cutoff=[]
for i in range(0,len(pred_prob_test[:,1])):
    
        ''' <<<< Replace Code Here >>>>>  '''
        
    data_pred_custom_cutoff.append(a)

In [None]:
#Build Cofusion Matrix
''' <<<< Replace Code Here >>>>>  '''

In [None]:
#Comparing the Classification reports (default cut off) vs (custom cut-off)
''' <<<< Replace Code Here >>>>>  '''