# Problem Statement

Customer Churn is a burning problem for Telecom companies. Almost every telecom company pays a premium to get a customer on-board. Customer churn is a directly impacts company’s revenue.
  
In this case-study, we simulate one such case of customer churn where we work on a data of post-paid customers with a contract. The data has information about customer usage behaviour, contract details, and payment details. The data also indicates which were the customers who cancelled their service.  
  
Based on this past data, Perform an EDA and build a model which can predict whether a customer will cancel their service in the future or not.

# Data Dictionary

* <b>Churn</b> - 1 if customer cancelled service, 0 if not
* <b>AccountWeeks</b> - number of weeks customer has had active account
* <b>ContractRenewal</b> - 1 if customer recently renewed contract, 0 if not
* <b>DataPlan</b> - 1 if customer has data plan, 0 if not
* <b>DataUsage</b> - gigabytes of monthly data usage
* <b>CustServCalls</b> - number of calls into customer service
* <b>DayMins</b> - average daytime minutes per month
* <b>DayCalls</b> - average number of daytime calls
* <b>MonthlyCharge</b> - average monthly bill
* <b>OverageFee</b> - largest overage fee in last 12 months
* <b>RoamMins</b> - average number of roaming minutes


In [None]:
#Import all necessary modules
import pandas as pd  
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import metrics,model_selection
from sklearn.preprocessing import scale

In [None]:
cell_df = pd.read_excel("Cellphone.xlsx")

## EDA

In [None]:
cell_df.head()

In [None]:
cell_df.info()

There are missing values in some coumns.  
All variables are of numeric type and does not contain any data inconsistencies (causing numeric variables to be object due to some special characters present in the data).  
Churn is the target variable.   
Churn, ContractRenewal and DataPlan are binary variables.

In [None]:
cell_df[['AccountWeeks','DataUsage','CustServCalls','DayMins','DayCalls','MonthlyCharge','OverageFee','RoamMins']].describe()

### Check for Missing values

In [None]:
cell_df.isnull().sum()

### Imputing missing values

Since, ContractRenewal and DataPlan are binary, we cannot substitute with mean values for these 2 variables. We will impute these two variables with their respective modal values.

In [None]:
# Compute Mode for the 'Contract Renewal' and 'DataPlan' columns and impute the missing values
cols = ['ContractRenewal','DataPlan']
for column in cols:
    
    
    
cell_df.isnull().sum()

Now let us impute the rest of the continuous variables with the median. For that we are going to use the SimpleImputer sub module from sklearn.

In [None]:
from sklearn.impute import SimpleImputer
SI = SimpleImputer(strategy='median')

In [None]:
#Now we need to fit and transform our respective data set to fill the missing values with the corresponding 'median' values



### Checking for Duplicates

In [None]:
# Are there any duplicates ?
dups = cell_df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))

### Proportion in the Target classes

In [None]:
cell_df.Churn.value_counts(normalize=True)

### Distribution of the variables Check using Histogram

In [None]:
cell_df[['AccountWeeks','DataUsage','CustServCalls','DayMins','DayCalls','MonthlyCharge','OverageFee','RoamMins']].hist();

### Outlier Check using boxplots

In [None]:
cols=['AccountWeeks','DataUsage','CustServCalls','DayMins','DayCalls','MonthlyCharge','OverageFee','RoamMins'];

for i in cols:
    sns.boxplot(cell_df[i])
    plt.show()


### Bi-Variate Analysis with Target variable

<b>Account Weeks and Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['AccountWeeks'])

<b>Data Usage against Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['DataUsage'])

<b>DayMins against Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['DayMins'])

<b>DayCalls against Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['DayCalls'])

<b>MonthlyCharge against Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['MonthlyCharge'])

<b>OverageFee against Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['OverageFee'])

<b>RoamMins against Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['RoamMins'])

<b>CustServCalls against Churn</b>

In [None]:
sns.boxplot(cell_df['Churn'],cell_df['CustServCalls'])

<b>Contract Renewal against Churn</b>

In [None]:
sns.countplot(cell_df['ContractRenewal'],hue=cell_df['Churn'])

<b>Data Plan against Churn</b>

In [None]:
sns.countplot(cell_df['DataPlan'],hue=cell_df['Churn'])

### Train (70%) - Test(30%) Split 

### LDA Model

In [None]:
#Build LDA Model and fit the data





### Prediction and Evaluation on both Training and Test Set using Confusion Matrix, Classification Report and AUC-ROC.

In [None]:
# Predict it




In [None]:
# Evaluation



