## Telecom Churn Case study
- The objetive is to identify the customer which are expected to Churn in the high value customers
- Second Objective is to idenitfy the factores impacting the Churn of these customers

Steps to be followed
- Derieve new features based on the data available to identify the Churn
- Based on the dataset provided identify the high value customers and remove all the other records - 29.9k records. Customer who have recharged more than 70th percentile of the averge recharg in the first 2 months.
- Remove the attributes of the Churn month from the dataset
- Preprocess data (convert columns to appropriate formats, handle missing values, etc.)
- Conduct appropriate exploratory analysis to extract useful insights (whether directly useful for business or for eventual modelling/feature engineering).
- Derive new features.
- Reduce the number of variables using PCA.
- Train a variety of models, tune model hyperparameters, etc. (handle class imbalance using appropriate techniques).
- Evaluate the models using appropriate evaluation metrics. Note that is is more important to identify churners than the non-churners accurately - choose an appropriate evaluation metric which reflects this business goal.
- Finally, choose a model based on some evaluation metric.



In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

In [None]:
#from google.colab import files
#files.upload()

In [None]:
#pip install -U pandas_profiling

In [None]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn import metrics
from sklearn.metrics import accuracy_score ,confusion_matrix ,f1_score ,recall_score,precision_score ,classification_report
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
#import pandas_profiling 

pd.set_option('display.max_columns', 999)


In [None]:
df_org = pd.read_csv("telecom_churn_data.csv",encoding='cp1252')
#df_org = pd.read_csv('/content/sample_data/telecom_churn_data.csv' ,encoding='cp1252')
df_org.info()

## 1 Data preparation
- Missing value treatment
- Managing the Categorical date variables
- Create new features such a total recharge etc.
- Identify the High Value Customers
- Outliers Treatment

In [None]:
#Drop the columns - last_date_of_month_7,last_date_of_month_6,last_date_of_month_8,last_date_of_month_9 , circle_id
#as these have constant values which is not of much use in modeling
df_org.drop(['last_date_of_month_6','last_date_of_month_7','last_date_of_month_8','last_date_of_month_9','circle_id']
            ,inplace=True
           ,axis=1)


df_org.date_of_last_rech_6 = df_org.date_of_last_rech_6.apply(lambda x:1 if not pd.isnull(x) else 0)
df_org.date_of_last_rech_7 = df_org.date_of_last_rech_7.apply(lambda x:1 if not pd.isnull(x) else 0)
df_org.date_of_last_rech_8 = df_org.date_of_last_rech_8.apply(lambda x:1 if not pd.isnull(x) else 0)
df_org.date_of_last_rech_9 = df_org.date_of_last_rech_9.apply(lambda x:1 if not pd.isnull(x) else 0)

df_org.date_of_last_rech_data_6 = df_org.date_of_last_rech_data_6.apply(lambda x:1 if not pd.isnull(x) else 0)
df_org.date_of_last_rech_data_7 = df_org.date_of_last_rech_data_7.apply(lambda x:1 if not pd.isnull(x) else 0)
df_org.date_of_last_rech_data_8 = df_org.date_of_last_rech_data_8.apply(lambda x:1 if not pd.isnull(x) else 0)
df_org.date_of_last_rech_data_9 = df_org.date_of_last_rech_data_9.apply(lambda x:1 if not pd.isnull(x) else 0)

Convert the Date field to an integer since this will only store a 0 or 1 value.

In [None]:
#Convert the Date columns to Int
col_dict = {'date_of_last_rech_6': int,
            'date_of_last_rech_7': int, 
            'date_of_last_rech_8': int,
            'date_of_last_rech_9': int,
            'date_of_last_rech_data_6': int, 
            'date_of_last_rech_data_7': int,
            'date_of_last_rech_data_8': int,
            'date_of_last_rech_data_9': int
           } 

df_org = df_org.astype(col_dict) 


- Looking at the Data , it is clear that above 74% of custmers do not have any information on the Data usage (around 40 columns). It is likely that these customer do not use the data service.For such cols , there are few relatd to data which are needed to identify the high value customer. These columns can be retained and the other ones dropped. 
- There are columns which have more than 7 percent of data issing which are majorly features for the month 9. these could be due to customer who have churned. We will impute the data as 0 for the null values. It will not make much of a difference since this columsn will be dropped later
- There are few columns which around 5.378 prcent missing data and are call features belonging to month 8.We impute these columns as 0. These could be potential churn customers in month 8 and 9.
- Features such as 'Last Date of the month' can be dropped as they have constant data which will not be of much use
- Voice calls related few columns have missing data between 1-4 percent. These are also imputed as 0.


In [None]:
#Impute 0 in  columns which are needed for calculating the high value customers.
#set the columns name to only few features needed for the calculation of high value cust
column_name = ['av_rech_amt_data_6','av_rech_amt_data_7','av_rech_amt_data_8','total_rech_data_6','total_rech_data_7','total_rech_data_8','total_rech_data_9']
for col in column_name:
  df_org[col].fillna(value = 0,inplace=True)

In [None]:
#Remove the columsn with 70% missing data
col_missing = df_org.isnull().sum() * 100 / len(df_org)
df_missing = pd.DataFrame({'column_name': df_org.columns,
                                 'percent_missing': col_missing})

df_missing.sort_values('percent_missing', ascending=False)
print(len(df_missing[df_missing.percent_missing > 70]))


New features such as the following created from the dataset
- Create a new feature which holds the total recharge for the customer for the months 6 and 7. This will provide to calculate the High Value customers.
- It is assumed here the 'total_rech_amt_7'and 'total_rech_amt_6' do not  include the recharge due to data. There are some rows where the 'total_rech_amt_6' is less that the (avg data * Total number of recharge ).
- Hence the Totalmaount has been calculated as df_org.total_rech_amt_7 + df_org.total_rech_amt_6 +                         (df_org['av_rech_amt_data_6'] * df_org['total_rech_data_6']) + (df_org['av_rech_amt_data_7'] * df_org['total_rech_data_7'])

In [None]:
#Create some additional features
df_org['Total_Rech_6_7'] = df_org.total_rech_amt_7 + df_org.total_rech_amt_6 + \
                        (df_org['av_rech_amt_data_6'] * df_org['total_rech_data_6']) + \
                        (df_org['av_rech_amt_data_7'] * df_org['total_rech_data_7'])
df_org['TotalRech_data_6'] = (df_org['av_rech_amt_data_6'] * df_org['total_rech_data_6'])
df_org['TotalRech_data_7'] = (df_org['av_rech_amt_data_7'] * df_org['total_rech_data_7'])
df_org['TotalRech_data_8'] = (df_org['av_rech_amt_data_8'] * df_org['total_rech_data_8'])


In [None]:
#Impute the columns withmore than 70% missing data and set the values to 0. 
#This is since most of these columns are empty since the service is not taken by the customer.
for col in np.array(df_missing[df_missing.percent_missing > 70]['column_name']):
    df_org.drop(col,inplace=True,axis=1)

#Impute the remaing columns with 0
df_org.fillna(value = 0,inplace=True)

To identify the high value customers, the total recharge column for each customer has been calculated for the months 6 and 7. This has been saved in the new feature created 'Total_Rech_6_7'. Any customer who has re-charged more (sum of recharges in 6 and 7) than 70th perentile of this feature is classified as the 'High Value customer'

In [None]:
df_highvalue = df_org[(df_org.Total_Rech_6_7) > (df_org.Total_Rech_6_7.quantile(0.7) )]
#Drop the new 'Total recharge' column as we already have the high value customers identified
df_highvalue.drop('Total_Rech_6_7', axis=1, inplace = True)

In [None]:
df_highvalue.shape

#### Total number of High Values records comes to 29953 rows

Identify the Churn customers in the 9th month by checking the values for the four columsn such as - 'total_ic_mou_9','total_og_mou_9','vol_2g_mb_9','vol_3g_mb_9'.

If the Values is zero for these columns then the customer of a 'Churn' class. 

Create a new feature - Churn and update it as 1 for all Churn customers and 0 for non churn customers

In [None]:
#identify the case whichhave churned already
cols_9 = ['total_ic_mou_9','total_og_mou_9','vol_2g_mb_9','vol_3g_mb_9']
df_highvalue['churn'] = df_highvalue[cols_9].sum(axis=1)
#df_highvalue.loc[df_highvalue.churn.nonzero()].churn = 1
df_highvalue.churn = df_highvalue.churn.apply(lambda x:1 if x == 0 else 0) 

In [None]:
#ratio of  churn Vs non Churn in the dataset
df_highvalue[df_highvalue.churn == 1].mobile_number.count() / df_highvalue[df_highvalue.churn == 0].mobile_number.count()

As seen from the Churn / Non Churn ratio, the dataset is highly imbalanced

In [None]:
#remove the attributes related to churn in month 9
cols_9_drp = [c for c in df_highvalue.columns if c[-2:] == '_9']
df_highvalue.drop(cols_9_drp,axis=1, inplace=True)
df_highvalue.drop('sep_vbc_3g',axis=1, inplace=True)
df_highvalue.shape

In [None]:
df_highvalue.drop('mobile_number',axis=1,inplace=True)

In [None]:
#remove the columns which have only 1 unique values. These columns will not matter since they are constant.
for col in df_highvalue.columns:
  if df_highvalue[col].nunique() == 1:
    df_highvalue.drop(col,axis=1,inplace=True)

In [None]:

from pandas_profiling import ProfileReport

profile = ProfileReport(df_highvalue)
profile


In [None]:
df_highvalue.describe(percentiles=[0.01,0.05,0.10,0.20,0.30,0.50,0.75, 0.80, 0.85, 0.90,0.95,0.99])
#ARPU can be negative and hecne there is no treatment done


## 2 Exploratory data Analysis

In [None]:

#Check the data for ARPU across 3 months
s= sns.boxplot(data=df_highvalue[['arpu_6','arpu_7','arpu_8']])
s.set_yscale("log")
#Almost same across the set for months 6,7,8

In [None]:
#Customer spend for data VS Voice across months 6,7,8
plt.figure(figsize=(10,10))
s = sns.boxplot(data=df_highvalue[['total_rech_amt_6','total_rech_amt_7','total_rech_amt_8','TotalRech_data_6' \
                              ,'TotalRech_data_7','TotalRech_data_8']])
s.set_yscale("log")
#Data revenue is slightly lower than the voice revenue
#However, in the month 8 voice revenue has not reduced but the data revenue has reduced. 
#This could mean that the customer's are not happy with the Data service and hence this is reduced.

Finding 1 - The above plot shows that there is a reduction in the Data usage in th month 8. This is due to the customers who have churned in month 9, and hecne reduced the usage in the month 8. Most of the reduction is in data. Hence it can be concluded that Data is a key reason why customers have churned.


In [None]:
#Check for the Voice Recharge across churn and no  churn users in 6 , 7 , 8
a, num = plt.subplots(1, 3)
plt.subplots_adjust(wspace = 0.5)
box1 = sns.boxplot(x='churn' ,y='total_rech_amt_6' ,data=df_highvalue,ax=num[0])
box1.set_yscale("log")
box2 = sns.boxplot(x='churn' ,y='total_rech_amt_7' ,data=df_highvalue,ax=num[1])
box2.set_yscale("log")
box3 = sns.boxplot(x='churn' ,y='total_rech_amt_8' ,data=df_highvalue,ax=num[2])
box3.set_yscale("log")
#Plot suggests that the Churn customers data in month 8 has reduced while the revenue 
#for non-churn is maintaned at the same level

Finding 2 - Potential Churn customers have reduced the usage of the network in month 8 and this could provide an indication of churn.

In [None]:
#Check for the Recharge across churn and no  churn users in 6 , 7 , 8
a, num = plt.subplots(1, 3)
plt.subplots_adjust(wspace = 0.5)
box1 = sns.boxplot(x='churn' ,y='TotalRech_data_6' ,data=df_highvalue,ax=num[0])
box1.set_yscale("log")
box2 = sns.boxplot(x='churn' ,y='TotalRech_data_7' ,data=df_highvalue,ax=num[1])
box2.set_yscale("log")
box3 = sns.boxplot(x='churn' ,y='TotalRech_data_8' ,data=df_highvalue,ax=num[2])
box3.set_yscale("log")
#Significant fall in the recharge revenue for the customers in month 8, who have churned in Month 9. 
#The revenue for the non-churn customers follows the same pattern as in month 6 and 7.
#Seems like Data is a likely reason for drops.

Finding 3 - Based on the above plot ,the Total recharge Data by the customer  is a significant influencer in identifying the churn of customers


In [None]:
#Check the calls to call center in the months 6,7,8 - local

a, num = plt.subplots(1, 3)
plt.subplots_adjust(wspace = 0.5)
sns.scatterplot(x='churn' ,y='loc_og_t2c_mou_6' ,data=df_highvalue ,ax=num[0])
sns.scatterplot(x='churn' ,y='loc_og_t2c_mou_7' ,data=df_highvalue,ax=num[1])
sns.scatterplot(x='churn' ,y='loc_og_t2c_mou_8' ,data=df_highvalue,ax=num[2])


#Calls to the callcenter seems to be reduced in the month 8 for the churn customers.However the calls to callcenter probably to resolve their issues
#int he month 7 is high for Churn customers. This indicates that the field could be used as a predictor for churn. 
#However , it needs to be seen in the model.

In [None]:
#Check the calls within network in the months 6,7,8 - local

a, num = plt.subplots(1, 3)
plt.subplots_adjust(wspace = 0.5)
box1 = sns.boxplot(x='churn' ,y='onnet_mou_6' ,data=df_highvalue ,ax=num[0])
box1.set_yscale("log")

box = sns.boxplot(x='churn' ,y='onnet_mou_7' ,data=df_highvalue,ax=num[1])
box.set_yscale("log")

box = sns.boxplot(x='churn' ,y='onnet_mou_8' ,data=df_highvalue,ax=num[2])
box.set_yscale("log")

#Reduced Onnet calls in month 8. this could be because of the lower calls for churn customer in month 8

In [None]:
#Check the calls within network in the months 6,7,8 - local

a, num = plt.subplots(1, 3)
plt.subplots_adjust(wspace = 0.5)
box1 = sns.boxplot(x='churn' ,y='offnet_mou_6' ,data=df_highvalue ,ax=num[0])
box1.set_yscale("log")

box = sns.boxplot(x='churn' ,y='offnet_mou_7' ,data=df_highvalue,ax=num[1])
box.set_yscale("log")

box = sns.boxplot(x='churn' ,y='offnet_mou_8' ,data=df_highvalue,ax=num[2])
box.set_yscale("log")

#Reduced Offnet calls in month 8. this could be because of the lower calls for churn customer in month 8

In [None]:
#Check the age within network in the months 6,7,8

#a, num = plt.subplots(1, 3)
plt.subplots_adjust(wspace = 0.5)
box1 = sns.boxplot(x='churn' ,y='aon' ,data=df_highvalue )

Finding 4
- Age is generally lower for the customer leaving the network with lot of outliers. not much of difference
- However the range suggests that most of the customer churning are less than 1000 days. Customer who have stayed with the network beyond 1000 tend to stay with the network

In [None]:
#Check the age within network in the months 8
plt.figure(figsize=(20,25))
a, num = plt.subplots(3, 3,figsize=(15,15))
plt.subplots_adjust(wspace = 0.5)
sc_6 = sns.scatterplot(x='total_rech_amt_6' ,y='total_rech_data_6',hue='churn' ,data=df_highvalue ,ax=num[0,0])
sc_7 = sns.scatterplot(x='total_rech_amt_7' ,y='total_rech_data_7',hue='churn' ,data=df_highvalue ,ax=num[0,1])
sc_8 = sns.scatterplot(x='total_rech_amt_8' ,y='total_rech_data_8',hue='churn' ,data=df_highvalue ,ax=num[0,2])

#a, num = plt.subplots(1, 3)
#plt.subplots_adjust(wspace = 0.5)
box_6 = sns.boxplot(x='churn' ,y='total_rech_data_6',data=df_highvalue ,ax=num[1,0])
#box_6.set_yscale("log")
box_7 = sns.boxplot(x='churn' ,y='total_rech_data_7',data=df_highvalue ,ax=num[1,1])
#box_7.set_yscale("log")
box_8 = sns.boxplot(x='churn' ,y='total_rech_data_8',data=df_highvalue ,ax=num[1,2])
#box_8.set_yscale("log")

box_6 = sns.boxplot(y='total_rech_amt_6' ,x='churn',data=df_highvalue ,ax=num[2,0])
box_7 = sns.boxplot(y='total_rech_amt_7' ,x='churn',data=df_highvalue ,ax=num[2,1])
box_8 = sns.boxplot(y='total_rech_amt_8' ,x='churn',data=df_highvalue ,ax=num[2,2])

Finding 5 - The Churn customers spend reduced in the month 8 as seen from the subplots. there are very few orange points in the plot for month 8

Finding 6 - Also,recharge for Data and Amount is lower as seen from the box plots as well.This had reduced progressivly from 6 to 8 month for the Churn customers.

## 3 Splitting Data, correcting imbalance and scaling
- Split the data into Train and Test
- Mechanism to correct the imbalance in data using SMOTE
- Scale the data 
- Create a interpretable model using the RFE and Logistic regression Ridge algorithm
- Check for the key parameters using the PCA
- Use multiple models such as logisticClassifier,RandomForest Classifier  and SVM with poly kernel to check for the best result 
- manage the confusion matrix to focus on the churn user rather than non churn. use recall score for this.

In [None]:
y = df_highvalue.churn
X = df_highvalue.drop('churn',axis=1)
#Perform the Train test split in the dataset before the data can be upsampled for the churn class
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7,random_state=100)

In [None]:
#Handling Skew - Capping and flooring
for col in X_train.columns:
    p05 = X_train[col].quantile(.05)
    p95 = X_train[col].quantile(.95)   
    X_train[col] = np.where( X_train[col] < p05, p05, X_train[col])
    X_train[col] = np.where(X_train[col] > p95 , p95, X_train[col])
    X_test[col] = np.where( X_test[col] < p05, p05, X_test[col])
    X_test[col] = np.where(X_test[col] > p95 , p95, X_test[col])



In [None]:
sm = SMOTE(random_state=40)
col = X_train.columns
X_train_1, y_train_1 = sm.fit_resample(X_train, y_train)
df_X_train_upsampled = pd.DataFrame(X_train_1,columns=col)

In [None]:
#Scale the Train and test data
sc = StandardScaler()
X_train_scaled = sc.fit_transform(df_X_train_upsampled)
X_test_scaled = sc.transform(X_test)
X_train_upsampled_sc = pd.DataFrame(X_train_scaled,columns=col)
X_test_sc = pd.DataFrame(X_test_scaled,columns=col)

## 4 Using RFE to identify the top 10 and Top 15 features.
These features are then used with a Decision tree classifier to measure the performance on the test set with those features.
Based on this - Top 10 features give a good recall score and it reduces for top 15 features, even though the accuracy increases.

In [None]:
#Apply RFE using skitlearn
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
# Importing decision tree classifier from sklearn library
from sklearn.tree import DecisionTreeClassifier

for i in [10,15]:
    estimator = LogisticRegression(C=0.1,max_iter=500,class_weight="balanced", random_state=100)
    rfe_feat = RFE(estimator,i)
    fit = rfe_feat.fit(X_train_upsampled_sc,y_train_1)
    fit.ranking_
    df_rfe = pd.DataFrame(fit.ranking_,columns=['Ranking'])
    df_col = pd.DataFrame(col,columns=['Feature'])
    df_score_rfe = pd.concat([df_rfe,df_col],axis=1)
    df_score_rfe.sort_values('Ranking',ascending=True,inplace=True)
    print(df_score_rfe.head(20))

    #use the RFE from the earlier steps and include those columns in the input for Training and Testing model
    #Fitting the decision tree with default hyperparameters, apart from
    #max_depth which is 5 to avoid overfitting

    dt_default = DecisionTreeClassifier(max_depth=5,random_state=100)
    dt_default.fit(X_train_upsampled_sc[df_score_rfe[df_score_rfe.Ranking ==1].Feature], y_train_1)
    churn_pred = dt_default.predict(X_test_sc[df_score_rfe[df_score_rfe.Ranking ==1].Feature])
    print (confusion_matrix(y_test,churn_pred))
    print ('Accuracy score - {0}'.format(accuracy_score(y_test, churn_pred)))
    print ('F1 score - {0}'.format(f1_score(y_test, churn_pred)))
    print ('Recall score - {0}'.format(recall_score(y_test, churn_pred)))
    print ('Precision score - {0}'.format(precision_score(y_test, churn_pred)))


In [None]:
#Use the Random classifier to identify the top featues influencing the model

featsel_RFC = RandomForestClassifier(class_weight="balanced",random_state=100)
featsel_RFC.fit(X_train_upsampled_sc,y_train_1)
features = []
for feature in zip(X_train_upsampled_sc.columns, featsel_RFC.feature_importances_):
    features.append (feature)
df_features = pd.DataFrame(features, columns=['Feature', 'Ginni'])
print (df_features.sort_values('Ginni', ascending=False).head(10))

dt_default = DecisionTreeClassifier(max_depth=5,random_state=100)
dt_default.fit(X_train_upsampled_sc[df_features['Feature']], y_train_1)
churn_pred = dt_default.predict(X_test_sc[df_features['Feature']])
print (confusion_matrix(y_test,churn_pred))
print ('Accuracy score - {0}'.format(accuracy_score(y_test, churn_pred)))
print ('F1 score - {0}'.format(f1_score(y_test, churn_pred)))
print ('Recall score - {0}'.format(recall_score(y_test, churn_pred)))
print ('Precision score - {0}'.format(precision_score(y_test, churn_pred)))


RFE gives a very similar score for Top 10 Vs Top 15 features. We can go for the **top 10 features** as the key influencers.

Top 10 features provides by RFE perform better as compared to features from RandomForestClassifier. This could be because the features are mostly linear. Recall Score is slighlty higher for the features selected by RFE method.

Using the key features from both method, the list is as below -

- loc_ic_t2m_mou_8
- loc_ic_mou_7
- loc_ic_t2t_mou_8
- loc_ic_mou_8
- total_rech_num_8
- total_rech_num_7
- spl_ic_mou_8
- vol_2g_mb_8
- aug_vbc_3g
- roam_og_mou_8
- roam_ic_mou_8
- total_ic_mou_8
- last_day_rch_amt_8
- date_of_last_rech_data_8
- loc_og_t2m_mou_8
- loc_og_mou_8

Next step is to identify the co-relation among these features and 'churn' and identify the key recommendation.




In [None]:
RFE_cols = ['loc_ic_t2m_mou_8','loc_ic_mou_7','loc_ic_t2t_mou_8','loc_ic_mou_8','total_rech_num_8','total_rech_num_7' \
            ,'spl_ic_mou_8','vol_2g_mb_8','aug_vbc_3g','roam_og_mou_8','roam_ic_mou_8','total_ic_mou_8','last_day_rch_amt_8' \
            ,'date_of_last_rech_data_8','loc_og_t2m_mou_8','loc_og_mou_8','churn']
corr = df_highvalue[RFE_cols].corr()
# plot the heatmap
plt.figure(figsize=(10,10))
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns,
        annot=True)


- Set 1 of highly co-related fieleds - ```'loc_ic_t2m_mou_8','loc_ic_mou_7','loc_ic_t2t_mou_8','loc_ic_mou_8'``` are highly co related as these are all Incoming calls. This is also corealted with ```'total_ic_mou_8'```
- Set 2 - ```'total_rech_num_8','total_rech_num_7'``` are highly co-related
- Set 3 - ```'roam_og_mou_8','roam_ic_mou_8'``` are highly co-related
- Set 4 - ```'vol_2g_mb_8','aug_vbc_3g'``` are co-related with ```'date_of_last_rech_data_8'```
- Set 5 - ```'last_day_rch_amt_8'``` is not co-related and impacts the Churn. No recharge in Month 8 indicates a Churn.
- Set 6```'loc_og_t2m_mou_8','loc_og_mou_8'``` are highly co-related to each other and also to the incomming call features.


In [None]:
#Set 1 and 6 EDA - 'total_ic_mou_8' can be considered as one of the key predictors

a, num = plt.subplots(2, 4,figsize=(15,5))
plt.subplots_adjust(wspace = 0.5)
#a, num = plt.subplots(1, 3)
#plt.subplots_adjust(wspace = 0.5)
box_1 = sns.boxplot(x='churn' ,y='loc_ic_t2m_mou_8',data=df_highvalue ,ax=num[0,0])
box_1.set_yscale("log")
box_2 = sns.boxplot(x='churn' ,y='loc_ic_mou_7',data=df_highvalue ,ax=num[0,1])
box_2.set_yscale("log")
box_3 = sns.boxplot(x='churn' ,y='loc_ic_t2t_mou_8',data=df_highvalue ,ax=num[0,2])
box_3.set_yscale("log")
box_4 = sns.boxplot(y='loc_ic_mou_8' ,x='churn',data=df_highvalue ,ax=num[0,3])
box_4.set_yscale("log")
box_5 = sns.boxplot(y='total_ic_mou_8' ,x='churn',data=df_highvalue ,ax=num[1,0])
box_5.set_yscale("log")
box_6 = sns.boxplot(y='loc_og_t2m_mou_8' ,x='churn',data=df_highvalue ,ax=num[1,1])
box_6.set_yscale("log")
box_7 = sns.boxplot(y='loc_og_mou_8' ,x='churn',data=df_highvalue ,ax=num[1,2])
box_7.set_yscale("log")


In [None]:
#Set 2 and 3 EDA - 'total_rech_num_8' can be considered as one of the key predictors. 
#Churn customer seems to have a very high Roaming calls in the month 8.

a, num = plt.subplots(2, 2,figsize=(10,5))
plt.subplots_adjust(wspace = 0.5)

box_1 = sns.boxplot(x='churn' ,y='total_rech_num_8',data=df_highvalue ,ax=num[0,0])
box_1.set_yscale("log")
box_2 = sns.boxplot(x='churn' ,y='total_rech_num_7',data=df_highvalue ,ax=num[0,1])
box_2.set_yscale("log")
box_3 = sns.boxplot(x='churn' ,y='roam_og_mou_8',data=df_highvalue ,ax=num[1,0])
box_3.set_yscale("log")
box_4 = sns.boxplot(x='churn' ,y='roam_ic_mou_8',data=df_highvalue ,ax=num[1,1])
box_4.set_yscale("log")


In [None]:
#Set 4-5 EDA - Most Churn customers have not done data recharge in the month 8.
#We can check if there is a pattern we can see between recharge of data vs normal recharge for the churn customers
#last_day_rch_amt_8 is lower for the Chrun customers in month 8

a, num = plt.subplots(1, 3,figsize=(10,5))
plt.subplots_adjust(wspace = 0.5)

box_1 = sns.boxplot(x='churn' ,y='vol_2g_mb_8',data=df_highvalue ,ax=num[0])
box_1.set_yscale("log")
box_2 = sns.boxplot(x='churn' ,y='aug_vbc_3g',data=df_highvalue ,ax=num[1])
box_2.set_yscale("log")
box_3 = sns.boxplot(x='churn' ,y='last_day_rch_amt_8' ,data=df_highvalue , ax=num[2])
box_3.set_yscale("log")

## Key messages from the above plots
 

1.   Churn Customers are not recharging for data in the month 8.
2.   Roaming calls increase for the churn customers in the month 8. this indicates that the customers have moved to other locations and this could be the reason for churn
3.   Out going calls for the Churn customers have reduced drastically in the month 8 as compared to 7. This is also related to very less recharge done in the month 8
4.   Reduced Local /Total incoming for churners indicating that the customer may already have another mobile in some other network in month 8.

## Key recommendation
- Relook at the Data plans and Data availability as this seems to be the pain areas for the Customers churning. This is indicated by the reduction in the Data recharge and Value Based Cost - 3G plans in Month 8.
- Some customers churn due to they moving out of the existing circle.This is seen with an increase in the Roaming calls for Churn customer. In such case the telecom companies can identify this from the existing data and provide plans accordingly.
- Customers do not recharge starting from one month earlier than the churn month(Month 8 in this case). Telecom company can identify these customer and talk to them about the issues they face and provide a solution to the problem they have been facing.







## 5 Perform PCA and identify  a good model for prediction of Churners


In [None]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pca= PCA(random_state=100)
log = LogisticRegression(solver='liblinear' ,class_weight="balanced",penalty='l2',random_state=100)
pipe = Pipeline(steps=[('pca', pca), ('logistic', log)])
param_grid = {
    'pca__n_components': [90],
    'logistic__C': [0.1,0.05],
    'logistic__max_iter' :[500,1000]
}
grid_log = GridSearchCV(pipe, param_grid ,scoring = 'recall')
grid_log.fit(X_train_upsampled_sc, y_train_1)
print('-------Logistic regression Best Hyper Params---------------')
print("Best Recall score=%0.3f:" % grid_log.best_score_)
print(grid_log.best_params_)
print('-----------------------------------------------------------')

rfc = RandomForestClassifier(class_weight="balanced",random_state=100)
pipe_rfc = Pipeline(steps=[('pca', pca), ('rfc', rfc)])
param_rfc = {
    'pca__n_components': [50],
    'rfc__n_estimators': [100],
    'rfc__max_features':[5,10],
    'rfc__min_samples_leaf': [5,10]
}
grid_rfc = GridSearchCV(pipe_rfc, param_rfc ,scoring = 'recall')
grid_rfc.fit(X_train_upsampled_sc, y_train_1)
print('-------RandomForest Best Hyper Params---------------')
print("Best Recall score=%0.3f:" % grid_rfc.best_score_)
print(grid_rfc.best_params_)
print('-----------------------------------------------------------')

svc = SVC(class_weight="balanced",random_state=100)
pipe_svc = Pipeline(steps=[('pca', pca), ('svc', svc)])
param_grid = {
    'pca__n_components': [50],
    'svc__C': [0.1],
    'svc__kernel':['poly'] ,
    'svc__gamma':[0.01]
}
grid_svc = GridSearchCV(pipe_svc, param_grid ,scoring = 'recall')
grid_svc.fit(X_train_upsampled_sc, y_train_1)
print('-------SVM Best Hyper Params---------------')
print("Best Recall score=%0.3f:" % grid_svc.best_score_)
print(grid_svc.best_params_)
print('-----------------------------------------------------------')
#Best recall score 0.971 in poly

In [None]:
#Using the best estimator from the Gird search and using it to predict on the test set

grid_best = grid_log.best_estimator_
churn_pred = grid_best.predict(X_test_sc)
print ('------------ Test data result using best Logistic Model-------------')
print (confusion_matrix(y_test,churn_pred,labels=[0,1]))
print ('Accuracy score - {0}'.format(accuracy_score(y_test, churn_pred)))
print ('F1 score - {0}'.format(f1_score(y_test, churn_pred)))
print ('Recall score - {0}'.format(recall_score(y_test, churn_pred)))
print ('Precision score - {0}'.format(precision_score(y_test, churn_pred)))
print (classification_report(y_test, churn_pred))

grid_best = grid_rfc.best_estimator_
churn_pred = grid_best.predict(X_test_sc)
print ('------------ Test data result using best Random forest Model-------------')
print (confusion_matrix(y_test,churn_pred))
print ('Accuracy score - {0}'.format(accuracy_score(y_test, churn_pred)))
print ('F1 score - {0}'.format(f1_score(y_test, churn_pred)))
print ('Recall score - {0}'.format(recall_score(y_test, churn_pred)))
print ('Precision score - {0}'.format(precision_score(y_test, churn_pred)))


grid_svc = grid_svc.best_estimator_
churn_pred = grid_svc.predict(X_test_sc)
print ('------------ Test data result using best SVC Model-------------')
print (confusion_matrix(y_test,churn_pred))
print ('Accuracy score - {0}'.format(accuracy_score(y_test, churn_pred)))
print ('F1 score - {0}'.format(f1_score(y_test, churn_pred)))
print ('Recall score - {0}'.format(recall_score(y_test, churn_pred)))
print ('Precision score - {0}'.format(precision_score(y_test, churn_pred)))
print (classification_report(y_test, churn_pred))
#Overall Recall score is best for SVM. However the accuracy is very low. 
#Logistic regression performs much better here as the accuracy and recall score is high.
#0.8458 recall score

## Result
- SVM using a poly kernel gives a Recall score of 0.80 on the Test data. This also gives a accuracy of 0.85 on the test dataset. However the Model seems to have some overfitting since there is a hugh difference between the Test and Train score (0.96). 
- Logistic with l2 penalty gives a Recall score of 0.85 on the Test data and a similar score on the Train data. This also gives a accuracy of 0.83 on the test dataset. There is no overfitting since the scores on Train and Test set are very similar. This seems a good robust model
- RandomForest classifier seems ot perform badly on this dataset. It has a good accuracy but a very bad recall score.
- Both the Models, SVM and Logistic, are very close but **Logistic gives a good recall and Accuracy and can be used in this case**