## Problem Statement
 

### Business Problem Overview

In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.

For many incumbent operators, retaining high profitable customers is the number one business goal. 

To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. 

In this project, you will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn.

### Understanding and Defining Churn:

1. There are two main models of payment in the telecom industry - postpaid (customers pay a monthly/annual bill after using the services) and prepaid (customers pay/recharge with a certain amount in advance and then use the services).


2. In the postpaid model, when customers want to switch to another operator, they usually inform the existing operator to terminate the services, and you directly know that this is an instance of churn.


3. However, in the prepaid model, customers who want to switch to another network can simply stop using the services without any notice, and it is hard to know whether someone has actually churned or is simply not using the services temporarily (e.g. someone may be on a trip abroad for a month or two and then intend to resume using the services again).

4. Thus, churn prediction is usually more critical (and non-trivial) for prepaid customers, and the term ‘churn’ should be defined carefully.  Also, prepaid is the most common model in India and southeast Asia, while postpaid is more common in Europe in North America.

This project is based on the Indian and Southeast Asian market.

Definitions of Churn:

There are various ways to define churn, such as:

1. Revenue-based churn: Customers who have not utilised any revenue-generating facilities such as mobile internet, outgoing calls, SMS etc. over a given period of time. One could also use aggregate metrics such as ‘customers who have generated less than INR 4 per month in total/average/median revenue’.

2. The main shortcoming of this definition is that there are customers who only receive calls/SMSes from their wage-earning counterparts, i.e. they don’t generate revenue but use the services. For example, many users in rural areas only receive calls from their wage-earning siblings in urban areas.

3. Usage-based churn: Customers who have not done any usage, either incoming or outgoing - in terms of calls, internet etc. over a period of time.

4. A potential shortcoming of this definition is that when the customer has stopped using the services for a while, it may be too late to take any corrective actions to retain them. For e.g., if you define churn based on a ‘two-months zero usage’ period, predicting churn could be useless since by that time the customer would have already switched to another operator.

In this project, you will use the usage-based definition to define churn.

1. High-value Churn:

1. In the Indian and the southeast Asian market, approximately 80% of revenue comes from the top 20% customers (called high-value customers). Thus, if we can reduce churn of the high-value customers, we will be able to reduce significant revenue leakage.

2. In this project, you will define high-value customers based on a certain metric (mentioned later below) and predict churn only on high-value customers.

 

### Understanding the Business Objective and the Data:

1. The dataset contains customer-level information for a span of four consecutive months - June, July, August and September. The months are encoded as 6, 7, 8 and 9, respectively. 


2. The business objective is to predict the churn in the last (i.e. the ninth) month using the data (features) from the first three months. To do this task well, understanding the typical customer behaviour during churn will be helpful.


### Understanding Customer Behaviour During Churn:

Customers usually do not decide to switch to another competitor instantly, but rather over a period of time (this is especially applicable to high-value customers). In churn prediction, we assume that there are three phases of customer lifecycle :

1. The ‘good’ phase: 
In this phase, the customer is happy with the service and behaves as usual.

2. The ‘action’ phase: 
The customer experience starts to sore in this phase, for e.g. he/she gets a compelling offer from a  competitor, faces unjust charges, becomes unhappy with service quality etc. In this phase, the customer usually shows different behaviour than the ‘good’ months. Also, it is crucial to identify high-churn-risk customers in this phase, since some corrective actions can be taken at this point (such as matching the competitor’s offer/improving the service quality etc.)

3. The ‘churn’ phase: 
In this phase, the customer is said to have churned. You define churn based on this phase. Also, it is important to note that at the time of prediction (i.e. the action months), this data is not available to you for prediction. Thus, after tagging churn as 1/0 based on this phase, you discard all data corresponding to this phase.

 
In this case, since you are working over a four-month window, the first two months are the ‘good’ phase, the third month is the ‘action’ phase, while the fourth month is the ‘churn’ phase.

### Data Dictionary:

The dataset can be download using this link. The data dictionary is provided for download below.

1. Data Dictionary - Telecom Churnfile_download	Download
The data dictionary contains meanings of abbreviations. Some frequent ones are loc (local), IC (incoming), OG (outgoing), T2T (telecom operator to telecom operator), T2O (telecom operator to another operator), RECH (recharge) etc.

 
2. The attributes containing 6, 7, 8, 9 as suffixes imply that those correspond to the months 6, 7, 8, 9 respectively.

### Data Preparation:

The following data preparation steps are crucial for this problem:


1. Derive new features

This is one of the most important parts of data preparation since good features are often the differentiators between good and bad models. Use your business understanding to derive features you think could be important indicators of churn.

 
2. Filter high-value customers

As mentioned above, you need to predict churn only for the high-value customers. Define high-value customers as follows: Those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase).

 After filtering the high-value customers, you should get about 29.9k rows.

3. Tag churners and remove attributes of the churn phase

Now tag the churned customers (churn=1, else 0) based on the fourth month as follows: Those who have not made any calls (either incoming or outgoing) AND have not used mobile internet even once in the churn phase. The attributes you need to use to tag churners are:

total_ic_mou_9

total_og_mou_9

vol_2g_mb_9

vol_3g_mb_9


After tagging churners, remove all the attributes corresponding to the churn phase (all attributes having ‘ _9’, etc. in their names).

### Modelling:

Build models to predict churn. The predictive model that you’re going to build will serve two purposes:

It will be used to predict whether a high-value customer will churn or not, in near future (i.e. churn phase). By knowing this, the company can take action steps such as providing special plans, discounts on recharge etc.

It will be used to identify important variables that are strong predictors of churn. These variables may also indicate why customers choose to switch to other networks.

In some cases, both of the above-stated goals can be achieved by a single machine learning model. But here, you have a large number of attributes, and thus you should try using a dimensionality reduction technique such as PCA and then build a predictive model. After PCA, you can use any classification model.

Also, since the rate of churn is typically low (about 5-10%, this is called class-imbalance) - try using techniques to handle class imbalance. 


You can take the following suggestive steps to build the model:

1. Preprocess data (convert columns to appropriate formats, handle missing values, etc.)

Conduct appropriate exploratory analysis to extract useful insights (whether directly useful for business or for eventual modelling/feature engineering).

2. Derive new features.

3. Reduce the number of variables using PCA.

4. Train a variety of models, tune model hyperparameters, etc. (handle class imbalance using appropriate techniques).

5. Evaluate the models using appropriate evaluation metrics. Note that is is more important to identify churners than the non-churners accurately - choose an appropriate evaluation metric which reflects this business goal.

6. Finally, choose a model based on some evaluation metric.

7. The above model will only be able to achieve one of the two goals - to predict customers who will churn. You can’t use the above model to identify the important features for churn. That’s because PCA usually creates components which are not easy to interpret.

8. Therefore, build another model with the main objective of identifying important predictor attributes which help the business understand indicators of churn. A good choice to identify important variables is a logistic regression model or a model from the tree family. In case of logistic regression, make sure to handle multi-collinearity.

9. After identifying important predictors, display them visually - you can use plots, summary tables etc. - whatever you think best conveys the importance of features. 

10. Finally, recommend strategies to manage customer churn based on your observations.
 

Note: Everything has to be submitted in one Jupyter notebook.

 

The evaluation rubrics are mentioned on the next page.

## 1. Data Reading & Understanding:

In [None]:
# Import the Necessary Libraries

import numpy as np
import pandas as pd

# Import Visualisation libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Import the logistic regression 
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Import the scaler, KMeans etc.,
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree
import os

# Supress the warnings
import warnings
warnings.filterwarnings('ignore')


In [None]:
# Read the Csv file & Read the few coloumns and rows of the file. 

Churn = pd.read_csv('Downloads//telecom_churn_data.csv')
Churn.head()

In [None]:
# Let us check the data shape
Churn.shape

In [None]:
# We have 99999 rows and 226 coloumns. 
# Let us check then datatype 
Churn.info(verbose=1)

In [None]:
# Let us check the statistics 
Churn.describe()

## 2. Data Cleaning:

In [None]:
# Let us see the NAN values % in ascending order wise
round(Churn.isnull().mean(axis=0).sort_values(ascending=False)*100,2)

In [None]:
# Lot of values is having more than 70% of the data null. We need to check the unique values first and deal it one by one.

In [None]:
# Let us check the unique values in sorting order and check the data / value first.
Churn.nunique().sort_values()

In [None]:
## Missing values percentage of all the coloumns in sorting fashion
round(Churn.isnull().mean(axis=0).sort_values(ascending=False)*100,2)

## 3. Data Imputation :

In [None]:
# As We see more than 40% of varaibles are needed for solving the business problem. We need to imute it rather than drop it.
# Let us make a custom function for the data cleaning part.
# Create a funciton and check the % values and print the missing nos. 
def checknull(per_cutoff):
    missing = round(100*(Churn.isnull().sum()/Churn.shape[0]))
    print("There are {} features having more than {}% missing value".format(len(missing.loc[missing > per_cutoff]),per_cutoff))
    return missing.loc[missing > per_cutoff]

In [None]:
# Let us also create the Imute functions so that we can imute it one by one. 
def imputenan(data,imputeColList=False,missingColList=False):
    # Function impute the nan with 0
    # argument: colList, list of columns for which nan is to be replaced with 0
    if imputeColList:
        for column in [x + y for y in ['_6','_7','_8','_9'] for x in imputeColList]:
            data[column].fillna(0, inplace=True)
    else:    
        for column in missingColList:
            data[column].fillna(0, inplace=True)

In [None]:
# Check the missing values again
checknull(70)

In [None]:
# We can see the KPI's are also there in the above sets. we need to imute it rather than drop it. Let us impute it with the zero.
imputeCol = ['av_rech_amt_data', 'arpu_2g', 'arpu_3g', 'count_rech_2g', 'count_rech_3g',
             'max_rech_data', 'total_rech_data','fb_user','night_pck_user']
imputenan(Churn,imputeCol)


In [None]:
# Check the data again now.
checknull(70)

In [None]:
# dropping the columns having more than 50% missing values
missingcol = list(checknull(70).index)
Churn.drop(missingcol,axis=1,inplace=True)
Churn.shape


In [None]:
# Check the head again
Churn.head()

In [None]:
# the Circle id is also of no use having a unique value in this 
Chrun = Churn.drop('circle_id',axis=1,inplace = True)
Churn.head()

In [None]:
# check again the missing values back in the data sets.
checknull(5)

In [None]:
# We can see still there are 29 features having more than missing values. let us check it first.
missingcol = list(checknull(5).index)
print ("There are %d customers missing values for %s"%(len(Churn[Churn[missingcol].isnull().all(axis=1)]),missingcol))
Churn[Churn[missingcol].isnull().all(axis=1)][missingcol].head()

In [None]:
# Let us immute it with zero of these customers as we have huge nos. 7745. 
imputenan(Churn,missingColList=missingcol)

In [None]:
# Check the missing null. 
Churn=Churn[~Churn[missingcol].isnull().all(axis=1)]
Churn.shape

In [None]:
# Let us check the data set again with more than 2% nan values
checknull(2)

In [None]:
# WE could see still there are 89 features are there. Let us check it first 
missingcol = list(checknull(2).index)
print ("There are %d customers missing values for %s"%(len(Churn[Churn[missingcol].isnull().all(axis=1)]),missingcol))
Churn[Churn[missingcol].isnull().all(axis=1)][missingcol].head()

In [None]:
# check the missing null and shape again.
Churn=Churn[~Churn[missingcol].isnull().all(axis=1)]
Churn.shape

In [None]:
# There are 381 customers having 89 features. we can imute it with zero.
missingcol.remove('date_of_last_rech_8')
missingcol.remove('date_of_last_rech_9')
imputenan(Churn,missingColList=missingcol)


In [None]:
# Let us check the missing nan again
# Let us check the data set again
checknull(0)

In [None]:
# Let us create the new data frame and store these features and run the uniqueness to impute it further .
columns = ['loc_og_t2o_mou','std_og_t2o_mou','loc_ic_t2o_mou','last_date_of_month_8','last_date_of_month_9','date_of_last_rech_6','date_of_last_rech_7', 'date_of_last_rech_8', 'date_of_last_rech_9']
for x in columns: 
    print("Unique values in column %s are %s" % (x,Churn[x].unique()))

In [None]:
# It seems are single value only. we can impute it with the same.Let us impute with mode first and proceed
columns = ['loc_og_t2o_mou','std_og_t2o_mou','loc_ic_t2o_mou','last_date_of_month_7','last_date_of_month_8','last_date_of_month_9']
for x in columns:
    print(Churn[x].value_counts())
    Churn[x].fillna(Churn[x].mode()[0], inplace=True)
print("All the above features take only one value. Lets impute the missing values in these features with the mode")

In [None]:
# We have impute the variables 
checknull(0)

In [None]:

# All these features are missing together
missingcol = list(checknull(0).index)
print ("There are %d rows in total having missing values for these variables."%(len(Churn[Churn[missingcol].isnull().all(axis=1)])))

In [None]:
# Let us impute it with the dates frequently coming. 
Churn[Churn['date_of_last_rech_6'].isnull()]['date_of_last_rech_6'] = '6/30/2014'
Churn[Churn['date_of_last_rech_7'].isnull()]['date_of_last_rech_7'] = '7/31/2014'
Churn[Churn['date_of_last_rech_8'].isnull()]['date_of_last_rech_8'] = '8/31/2014'
Churn[Churn['date_of_last_rech_9'].isnull()]['date_of_last_rech_9'] = '9/30/2014'

In [None]:
# Now we can see that the data is completely cleaned. Let us check the head.
Churn.head()

In [None]:
# We have lot of zeros in the coloumns as a single value. let us drop it and proceed.

Single_value =Churn.columns[(Churn == 0).all()]
print ("There are {} features only having zero as values. These features are \n{}".format(len(Single_value),Single_value))

In [None]:
# We can drop these features. 
Churn.drop(Single_value,axis=1,inplace=True)

In [None]:
# Percentage of data left after removing the missing values.
print("Percentage of data remaining after treating missing values: {}%".format(round(Churn.shape[0]/99999 *100,2)))
print ("Number of customers: {}".format(Churn.shape[0]))
print ("Number of features: {}".format(Churn.shape[1]))

In [None]:
Churn.reset_index(inplace=True,drop=True)
# list of all columns which store date
date_columns = list(Churn.filter(regex='date').columns)
date_columns

In [None]:
# Converting dtype of date columns to datetime
for col in date_columns:
    Churn[col] = pd.to_datetime(Churn[col], format='%m/%d/%Y')

In [None]:
# Percentage of data left after removing the missing values.
Churn.shape

## 4. Derive New Features:

In [None]:
# Let us create the new features. Also Filter high-value customers
# Defining high-value customers as follows:
#Those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase).
rech_col = Churn.filter(regex=('count')).columns
Churn[rech_col].head()


In [None]:

# Creating new feature: avg_rech_amt_6,avg_rech_amt_7,avg_rech_amt_8,avg_rech_amt_9
for i in range(6,10):
    Churn['avg_rech_amt_'+str(i)] = round(Churn['total_rech_amt_'+str(i)]/Churn['total_rech_num_'+str(i)]+1,2)

In [None]:
# impute the NAN values

imputenan(Churn,missingColList=['avg_rech_amt_6','avg_rech_amt_7','avg_rech_amt_8','avg_rech_amt_9'])

In [None]:
# Let us create the total recharge amounts of all the months and store it.
# total recharge amount = count of recharge 2g + count of recharge of 3g of all months and convert to integer data.
for i in range(6,10):
    Churn['total_rech_num_data_'+str(i)] = (Churn['count_rech_2g_'+str(i)]+Churn['count_rech_3g_'+str(i)]).astype(int)

In [None]:
# let us store the total recharge amount data = total rechage number data * average rechage amount data
for i in range(6,10):
    Churn['total_rech_amt_data_'+str(i)] = Churn['total_rech_num_data_'+str(i)] * Churn['av_rech_amt_data_'+str(i)]

In [None]:
# Another new feature : total month recharge = total recharge amount + total recharge data for each customer each month
for i in range(6,10):
    Churn['total_month_rech_'+str(i)] = Churn['total_rech_amt_'+str(i)]+Churn['total_rech_amt_data_'+str(i)]
Churn.filter(regex=('total_month_rech')).head()


In [None]:
# calculating the avegare of first two months (good phase) total monthly recharge amount
Good_Phase_avg =(Churn.total_month_rech_6 + Churn.total_month_rech_7)/2
# finding the cutoff which is the 70th percentile of the good phase average recharge amounts
Cut_off = np.percentile(Good_Phase_avg,70)
# Filtering the users whose good phase avg. recharge amount >= to the cutoff of 70th percentile.
Highvalu_users = Churn[Good_Phase_avg >= Cut_off]
# Reset the index.
Highvalu_users.reset_index(inplace=True,drop=True)

print("Number of High-Value Customers in the Dataset: %d\n"% len(Highvalu_users ))
print("Percentage High-value users in data : {}%".format(round(len(Highvalu_users )/Churn.shape[0]*100),2))

In [None]:
## We need the tag the churners. 
#Tagging Churners. Now tag the churned customers (churn=1, else 0) based on the fourth month as follows:

#Those who have not made any calls (either incoming or outgoing) AND have not used mobile internet even once in the churn phase. The attributes we need to use to tag churners are:

#total_ic_mou_9
#total_og_mou_9
#vol_2g_mb_9
#vol_3g_mb_9

def getChurnStatus(data,Churn_Phase_Month=9):
    # Function to tag customers as churners (churn=1, else 0) based on 'vol_2g_mb_','vol_3g_mb_','total_ic_mou_','total_og_mou_'
    #argument: churnPhaseMonth, indicating the month number to be used to define churn (default= 9)
    Churn_features= ['vol_2g_mb_','vol_3g_mb_','total_ic_mou_','total_og_mou_']
    flag = ~data[[s + str(Churn_Phase_Month) for s in Churn_features ]].any(axis=1)
    flag = flag.map({True:1, False:0})
    return flag

In [None]:
# Check the Churn and not churn data for high value customers.
Highvalu_users['Churn'] = getChurnStatus(Highvalu_users,9)
print(" We have {} users tagged as churners out of {} High Value Customers.".format(len(Highvalu_users[Highvalu_users.Churn == 1]),Highvalu_users.shape[0]))
print("High-value Churn Percentage : {}%".format(round(len(Highvalu_users[Highvalu_users.Churn == 1])/Highvalu_users.shape[0] *100,2)))

# We could see that this data is highly imbalced in nature and we can see that 8.09% are churn value percentage. 

In [None]:
Highvalu_users.head()
Highvalu_users.shape

In [None]:
# As we see the REvenue is one of the important parameter. Let us create the total revenues. 
#To find the Average Revenue per unit ARU  = total Revenue / Average Subscribers
# To get the Total Revenue = Average Revenue per unit ARU * Average Subscribers(5000 nos. of the total)

# Another new feature : total month recharge = total recharge amount + total recharge data for each customer each month
for i in range(6,10):
    Highvalu_users['Total_revenue_'+str(i)] = Highvalu_users['arpu_'+str(i)] * 5000


In [None]:
Highvalu_users.filter(regex=('Total_revenue_')).head()

In [None]:
# Total Revenue in the Phase wise manner good Phase - first two months, action phase is third month and fourth month is churn phase.
# Let us see the revenue in the Phase wise manner

Highvalu_users['Good_Phase'] = Highvalu_users['Total_revenue_6'] + Highvalu_users['Total_revenue_7']
Highvalu_users['Action_Phase'] =  Highvalu_users['Total_revenue_8']
Highvalu_users['Chrun_Phase'] = Highvalu_users['Total_revenue_9']

In [None]:
#After tagging churners, remove all the attributes corresponding to the churn phase (all attributes having ‘ _9’, etc. in their names).
Chrun_Phase = Highvalu_users.filter(regex='_9', axis=1)
Highvalu_users.drop(Chrun_Phase,axis=1,inplace=True)
Highvalu_users.shape

## 5. Data Analysis
### Univariate and Bi-Variate Analysis:

In [None]:
# Let's see the correlation matrix 
plt.figure(figsize = (20,20))        # Size of the figure
sns.heatmap(Highvalu_users.corr(),annot = True, fmt = ".2f", cmap = "GnBu")

### Inferences:

It is very difficult to see the correlation of the variables. Let us check for the 80% correlated variables and map again

In [None]:
# correlation matrix
corr_matrix = Highvalu_users.corr().abs()

# Selecting the upper triangle of the correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# feature columns with correlation greater than 0.80
high_corr_feat = [column for column in upper.columns if any(upper[column] > 0.80)]

print("HIGHLY CORRELATED FEATURES IN DATA SET:{}\n\n{}".format(len(high_corr_feat), high_corr_feat))

In [None]:
# Inferences : Around 66 Variables are highly correlated each other more than 80% 

In [None]:
# Let us check only these variables correlation and find the inferences and proceed for the univarate and bi-variate analysis.
new_corrs = Highvalu_users[['onnet_mou_8', 'loc_og_t2t_mou_8', 'loc_og_t2m_mou_8', 'loc_og_t2f_mou_7', 'loc_og_mou_6', 'loc_og_mou_7', 'loc_og_mou_8', 'std_og_t2t_mou_6', 'std_og_t2t_mou_7', 'std_og_t2t_mou_8', 'std_og_t2m_mou_6', 'std_og_t2m_mou_7', 'std_og_t2m_mou_8', 'isd_og_mou_7', 'isd_og_mou_8', 'total_og_mou_6', 'total_og_mou_7', 'total_og_mou_8', 'loc_ic_t2t_mou_7', 'loc_ic_t2t_mou_8', 'loc_ic_t2m_mou_8', 'loc_ic_mou_6', 'loc_ic_mou_7', 'loc_ic_mou_8', 'std_ic_mou_6', 'std_ic_mou_7', 'std_ic_mou_8', 'total_ic_mou_6', 'total_ic_mou_7', 'total_ic_mou_8', 'total_rech_amt_6', 'total_rech_amt_7', 'total_rech_amt_8', 'count_rech_2g_6', 'count_rech_2g_7', 'count_rech_2g_8', 'av_rech_amt_data_8', 'arpu_3g_6', 'arpu_3g_7', 'arpu_3g_8', 'arpu_2g_6', 'arpu_2g_7', 'arpu_2g_8', 'sachet_2g_6', 'sachet_2g_7', 'sachet_2g_8', 'monthly_3g_6', 'monthly_3g_7', 'monthly_3g_8', 'sachet_3g_6', 'sachet_3g_7', 'sachet_3g_8', 'avg_rech_amt_7', 'avg_rech_amt_8', 'total_rech_num_data_6', 'total_rech_num_data_7', 'total_rech_num_data_8', 'total_month_rech_6', 'total_month_rech_7', 'total_month_rech_8', 'Total_revenue_6', 'Total_revenue_7', 'Total_revenue_8', 'Good_Phase', 'Action_Phase', 'Chrun_Phase']]
# Let's see the correlation matrix 
plt.figure(figsize = (20,20))        # Size of the figure
sns.heatmap(new_corrs.corr(),annot = True, fmt = ".2f", cmap = "GnBu")

### Inferences :
The Total recharge amount for june_6 is postive correlation with the sachet data 2g for June month is 0.9.

The total revenue for all the months of june, july and aug are highly postive of 0.96 with the total recharge amount

The Good phase is highly corrlated with the total rechage as 0.8 and action phase as 0.6 and chrun phase 0.5 with the total recharge amount


In [None]:
# Check the churn in a graphical way 

C_IT = sns.catplot("Churn", data = Highvalu_users, aspect=1.5, kind="count", color="b")
C_IT.set_xticklabels(rotation=30)
plt.show()

### Inferences:
The Chrun for the highvalue customers is only 8.09%.

In [None]:
Highvalu_users.head()

In [None]:
#check the list of all the columns. 
list(Highvalu_users.columns)

In [None]:
# the date coloumns are not needed as we have only unique values as well as the mobile nos. Let us drop and proceed
Highvalu_users.drop(['mobile_number','last_date_of_month_6','last_date_of_month_7','last_date_of_month_8','date_of_last_rech_6','date_of_last_rech_7','date_of_last_rech_8'],axis=1,inplace=True)

In [None]:
# sns.boxplot(y='arpu_6', data=tel)
cont_cols = [col for col in Highvalu_users.columns if col not in ['Churn']]
for col in cont_cols:
    plt.figure(figsize=(5, 5))
    sns.boxplot(y=col, data= Highvalu_users)

### Inferences : 
 We could see that lot of ouliers are there in the data sets. 
The variables such as Good Phase, Action phase, churn phase, the total Revenue variable of all the months, and all other variabless we can see normaly distributed with outliers.


In [None]:
# Let us check the Chrun as the target with the other variables 
cont_cols = [col for col in Highvalu_users.columns if col not in ['Churn']]
for col in cont_cols:
    plt.figure(figsize=(5, 5))
    sns.barplot(x='Churn', y=col, data=Highvalu_users)
    plt.show()

### Inferences :

The Good phase we can see the chrun 1 is more compared to action phase.

The recharge days for the month of _6 that i.e., june churn is more compared to other recharge days. 

But the total revenue for the month of june is higher compared to other phases. 

The Total Amount recharge is lower in the Chrun phase compared to other phases of June, July and Aug months.

The roaming for the churn case for the month of june is high compared to other months.

The data 3g is less in june compared to other months. 


In [None]:
# Check the Age of the Network used by the operator Vs the other variables expect churn.
cont_cols = [col for col in Highvalu_users.columns if col not in ['Churn']]
for col in cont_cols:
    plt.figure(figsize=(5, 5))
    sns.jointplot(x='aon',y = col,data= Highvalu_users)
    plt.show()                                                

### Inferences : 
    
We can see the AON - Age of network is postive correlated to onnet/offnet- minutes usages,roaming incoming/outgoing calls.
    
Also with the 2g, Operator T to other operator mobile network, Operator T to other operator fixed lines & own call centers, std calls.

Also we can see AON having no correlation with outgoing call others

In [None]:
# Check the Age of the Network used by the operator Vs the other variables expect churn.
# rug = True
# plotting only a few points since rug takes a long while
sns.distplot(Highvalu_users['aon'][:200], rug=True)
plt.show()

In [None]:
### Inferences : The data is normally distributed with outliers with respect to age on network.

## 7. Model Building:

In [None]:
# Do the X train , Y train test and seperate the data sets. 
# import needed model building sklearn parameters.
from sklearn.model_selection import train_test_split
import sklearn.preprocessing
from sklearn import metrics
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
#putting features variables in X
X = Highvalu_users.drop(['Churn'], axis=1)

#putting response variables in Y
y = Highvalu_users['Churn']    

# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7,test_size=0.3,random_state=100)

In [None]:
### Scaling before PCA 

In [None]:
#Rescaling the features before PCA as it is sensitive to the scales of the features
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
# fitting and transforming the scaler on train
X_train = scaler.fit_transform(X_train)
# transforming the train using the already fit scaler
X_test = scaler.transform(X_test)


## 8.Imbalanced Data Handling:

# Before proceeding we have know that the data is imbalance. 
Handling class imbalance.

1. Standard classifier algorithms like Decision Tree and Logistic Regression have a bias towards classes which have number of instances.

2. They tend to only predict the majority class data. The features of the minority class are treated as noise and are often ignored. Thus, there is a high probability of misclassification of the minority class as compared to the majority class.

### Informed Over Sampling: Synthetic Minority Over-sampling Technique

This technique is followed to avoid overfitting which occurs when exact replicas of minority instances are added to the main dataset. A subset of data is taken from the minority class as an example and then new synthetic similar instances are created. These synthetic instances are then added to the original dataset. The new dataset is used as a sample to train the classification models.

### Advantages

1. Mitigates the problem of overfitting caused by random oversampling as synthetic examples are generated rather than replication of instances
2. No loss of useful information

In [None]:
# Check the sample before for 0 and 1 :
print("Before sample the counts of label '1': {}".format(sum(y_train==1)))
print("Before sample counts of label '0': {} \n".format(sum(y_train==0)))
print("Before sample churn event rate : {}% \n".format(round(sum(y_train==1)/len(y_train)*100,2)))

In [None]:
# Import the SMOTE and Joblib for solving this issue of imbalance. 
from sklearn.externals import joblib
from imblearn.over_sampling import SMOTE
from scipy import sparse


sm = SMOTE(random_state=12, ratio = 1)
X_train_re, y_train_re = sm.fit_sample(X_train, y_train)

In [None]:
# Check the sample after for 0 and 1 :
print("After sample the counts of label '1': {}".format(sum(y_train_re==1)))
print("After sample counts of label '0': {} \n".format(sum(y_train_re==0)))
print("After sample churn event rate : {}% \n".format(round(sum(y_train_re==1)/len(y_train_re)*100,2)))

In [None]:
# Now we can see that the data is now 50% balanced. Let us also do the outlier analysis and proceed further.

## 7. Outlier Analysis:

In [None]:
# Let us check the outlier present in the data before doing modelling and remove the outliers and proceed with the modelling.

prev_box = list(Highvalu_users.columns)
for i in Highvalu_users[prev_box]:
    plt.figure(1,figsize=(15,5))
    sns.boxplot(Highvalu_users[i])
    plt.xticks(rotation = 90,fontsize =10)
    plt.show()


## Inferences:
We could see almost all the variables are having the ouliers preset in the datasets. Let us need to remove it.

In [None]:
# Removal of the data sets from the outliers.Let us take the lower and upper limits are 0.05 and .9 on the upper. so that we could avoid loosing more data.
test_box_Df2 = list(Highvalu_users.columns) 
new_copy = Highvalu_users[test_box_Df2]
for i in new_copy.columns:
    Q1 = new_copy[i].quantile(0.05)
    Q3 = new_copy[i].quantile(0.90)

    IQR = Q3 - Q1
    
    lower_fence = Q1 - 1.5*IQR
    upper_fence = Q3 + 1.5*IQR

    new_copy[i][new_copy[i] <= lower_fence] = lower_fence
    new_copy[i][new_copy[i] >= upper_fence] = upper_fence
    
    print("oUTLIERS:",i,lower_fence,upper_fence)
    
    plt.figure(1,figsize=(10,5))
    sns.boxplot(new_copy[i])
    plt.xticks(rotation =90,fontsize =10)
    plt.show()

In [None]:
## Inferences : We have cleaned the data and now we can proceed for the modelling which we defined as we took lower ranges of the quartiles due to avoid loss of data

In [None]:
# We have stored the cleaned data in newcopy. Let us check the head again
new_copy.head()

In [None]:
# check the data shape again before modelling.
new_copy.shape

## 9. PCA 

In [None]:
# Let us do the PCA to reduce the dimentionality of the data and proceed with the Logistic regression first and move forward to highermodels 

#Improting the PCA module
from sklearn.decomposition import PCA
pca = PCA(svd_solver='randomized', random_state=42)

In [None]:
# apply the PCA
pca.fit(X_train_re)

In [None]:
#List of PCA components.It would be the same as the number of variables
pca.components_

In [None]:
#Let's check the variance ratios
pca.explained_variance_ratio_[:50]

In [None]:
# Import the matplot and visualse the pca variance ratio in a bar graph.
import matplotlib.pyplot as plt
plt.figure(figsize = (6,5))        # Size of the figure

plt.bar(range(1,len(pca.explained_variance_ratio_[:50])+1), pca.explained_variance_ratio_[:50])

In [None]:
# most of the data in the 0 to 2 ranges.  let us see the cummulative vairance ratio.
var_cumu = np.cumsum(pca.explained_variance_ratio_)

In [None]:
# We can see 0 to 10 most of the data are lying. 
# Make the scree plots cleary for choosing the no. of PCA 
plt.figure(figsize=(8,6))
plt.title('Scree plots')
plt.xlabel('No. of Components')
plt.ylabel('Cummulative explained variance')

plt.plot(range(1,len(var_cumu)+1), var_cumu)
plt.show()

In [None]:
# Looks. we will take 35 components for desctribe the 95% of the varaince in the datasets.

In [None]:
#Using incremental PCA for efficiency - saves a lot of time on larger datasets
from sklearn.decomposition import IncrementalPCA
pca_final = IncrementalPCA(n_components=40)

In [None]:
# let us fit the data
X_train_pca = pca_final.fit_transform(X_train_re)
X_train_pca.shape

In [None]:
#creating correlation matrix for the principal components
corrmat = np.corrcoef(X_train_pca.transpose())
# 1s ----> 0s in diagonals
corrmat_nodiag = corrmat - np.diagflat(corrmat.diagonal())
print("max corr:",corrmat_nodiag.max(), ", min corr: ", corrmat_nodiag.min(),)
# we see that correlations are indeed very close to 0

In [None]:
# Seems there is no correlation int the data sets between any two variables in PCA. We dont have any multicolinearity.

In [None]:
#Applying selected components to the test data - 35 components
X_test_pca = pca_final.transform(X_test)
X_test_pca.shape

For the prediction of churn customers we will be fitting variety of models and select one which is the best predictor of churn. Models trained are,

1. Logistic Regression
2. Decision Tree
3. Random Forest
4. Boosting models - Gradient Boosting Classifier and XGBoost Classifier
5. SVM

## 10. Logistic Regression &PCA:

In [None]:
#Import needed libraries
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

lr = LogisticRegression(class_weight='balanced')

In [None]:
#Fit the algorithm on the data
def model_fit(alg, X_train, y_train, performCV=True, cv_folds=5):
    alg.fit(X_train, y_train)
        
    #Predict training set:
    dtrain_predictions = alg.predict(X_train)
    dtrain_predprob = alg.predict_proba(X_train)[:,1]
    
    #Perform cross-validation:
    if performCV:
        cv_score = cross_val_score(alg, X_train, y_train, cv=cv_folds, scoring='roc_auc')
    
    #Print model report:
    print ("\nModel Summary:")
    print ("Accuracy : %.4g" % metrics.roc_auc_score(y_train, dtrain_predictions))
    print ("Recall/Sensitivity : %.4g" % metrics.recall_score(y_train, dtrain_predictions))
    print ("AUC Score (Train): %f" % metrics.roc_auc_score(y_train, dtrain_predprob))
    
    if performCV:
        print ("CV Score : Mean - %.7g | Std - %.7g | Min - %.7g | Max - %.7g" % (np.mean(cv_score),np.std(cv_score),np.min(cv_score),np.max(cv_score)))

In [None]:
# Let us fit the model. 
model_fit(lr, X_train_pca, y_train_re)

In [None]:
# Define the Modelmetrics .
def Model_metrics(Actual_churn=False,Predict_churn=False):

    confusion = metrics.confusion_matrix(Actual_churn, Predict_churn)

    TP = confusion[1,1] # true positive 
    TN = confusion[0,0] # true negatives
    FP = confusion[0,1] # false positives
    FN = confusion[1,0] # false negatives

    print("Roc_auc_score : {}".format(metrics.roc_auc_score(Actual_churn, Predict_churn)))
    # Let's see the sensitivity of our logistic regression model
    print('Sensitivity/Recall : {}'.format(TP / float(TP+FN)))
    # Let us calculate specificity
    print('Specificity: {}'.format(TN / float(TN+FP)))
    # Calculate false postive rate - predicting churn when customer does not have churned
    print('False Positive Rate: {}'.format(FP/ float(TN+FP)))
    # positive predictive value 
    print('Positive predictive value: {}'.format(TP / float(TP+FP)))
    # Negative predictive value
    print('Negative Predictive value: {}'.format(TN / float(TN+ FN)))
    # sklearn precision score value 
    print('sklearn precision score value: {}'.format(metrics.precision_score(Actual_churn, Predict_churn)))

In [None]:
# predictions on Test data
pred_probs_test = lr.predict(X_test_pca)
Model_metrics(y_test,pred_probs_test)

In [None]:
# check the accuracy, recall and precision again,
print("Accuracy : {}".format(metrics.accuracy_score(y_test,pred_probs_test)))
print("Recall : {}".format(metrics.recall_score(y_test,pred_probs_test)))
print("Precision : {}".format(metrics.precision_score(y_test,pred_probs_test)))

In [None]:
#Making prediction on the test data
pred_probs_train = lr.predict_proba(X_train_pca)[:,1]
print("roc_auc_score(Train) {:2.2}".format(metrics.roc_auc_score(y_train_re, pred_probs_train)))

In [None]:
# Let us define for the cutoff values
#Function to find the optimal cutoff for classifing as churn/non-churn
# Let's create columns with different probability cutoffs 
def Optimal_Cutoff(df):
    #Function to find the optimal cutoff for classifing as churn/non-churn
    # Let's create columns with different probability cutoffs 
    numbers = [float(x)/10 for x in range(10)]
    for i in numbers:
        df[i] = df.churn_Prob.map( lambda x: 1 if x > i else 0)
    #print(df.head())
    
    # Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
     # TP = confusion[1,1] # true positive 
    # TN = confusion[0,0] # true negatives
    # FP = confusion[0,1] # false positives
    # FN = confusion[1,0] # false negatives
    
    cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
    
    from sklearn.metrics import confusion_matrix
    
    num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
    for i in num:
        cm1 = metrics.confusion_matrix(df.churn, df[i] )
        total1=sum(sum(cm1))
        accuracy = (cm1[0,0]+cm1[1,1])/total1
        
        speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
        sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
        cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
    print(cutoff_df)
    
    # Let's plot accuracy sensitivity and specificity for various probabilities.
    cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
    plt.show()

In [None]:
def predictChurnWithProb(model,X,y,prob):
    # Funtion to predict the churn using the input probability cut-off
    # Input arguments: model instance, x and y to predict using model and cut-off probability
    
    # predict
    pred_probs = model.predict_proba(X)[:,1]
    
    y_df= pd.DataFrame({'churn':y, 'churn_Prob':pred_probs})
    # Creating new column 'predicted' with 1 if Churn_Prob>0.5 else 0
    y_df['final_predicted'] = y_df.churn_Prob.map( lambda x: 1 if x > prob else 0)
    # Let's see the head
    Model_metrics(y_df.churn,y_df.final_predicted)
    return y_df

In [None]:
cut_off_prob=0.5
y_train_df = predictChurnWithProb(lr,X_train_pca,y_train_re,cut_off_prob)
y_train_df.head()

##### Let Draw the ROC curve:

It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).

The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.

The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

In [None]:
# Let us define the roc curveL
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(6, 6))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return fpr, tpr, thresholds

In [None]:
# Check the ROC curve
draw_roc(y_train_df.churn, y_train_df.final_predicted)

In [None]:
# the ROC curve should be on the top left. showing good fit.

In [None]:

#draw_roc(y_pred_final.Churn, y_pred_final.predicted)
print("roc_auc_score : {:2.2f}".format(metrics.roc_auc_score(y_train_df.churn, y_train_df.final_predicted)))

Finding Optimal Cutoff Point:
A trade off between sensitivity(or recall) and specificity is to be considered in doing so.We need to tune the probability to get a better tradeoff for sensitivity and specificity
# finding cut-off with the right balance of the metrices
# sensitivity vs specificity trade-off
findOptimalCutoff(y_train_df)

In [None]:

# finding cut-off with the right balance of the metrices
# sensitivity vs specificity trade-off
Optimal_Cutoff(y_train_df)

In [None]:
# Note that the cutoff is between the 0.5 to 0.6. we choose to take the 0.52. at this point there will be balance of accuracy, sensitivity and specificity


In [None]:
# predicting with the choosen cut-off on train
cut_off_prob = 0.52
A = predictChurnWithProb(lr,X_train_pca,y_train_re,cut_off_prob)
A.head()

In [None]:
# Let us predict on the test set

# predicting with the choosen cut-off on test
B = predictChurnWithProb(lr,X_test_pca,y_test,cut_off_prob)
B.head()

### Conclusion on Logistic regression & PCA 

**A) Train Set :**

Roc_auc_score : 0.82

Sensitivity/Recall : 0.83

Specificity: 0.82

**B) for test set:**

Roc_auc_score : 0.81

Sensitivity/Recall : 0.80

Specificity: 0.83



## 11. Decision Tree:

In [None]:
# let us apply the Decision tree with PCA and Hyperparameters and fit.

Dt = DecisionTreeClassifier(class_weight='balanced',
                             max_features='auto',
                             min_samples_split=100,
                             min_samples_leaf=100,
                             max_depth=6,
                             random_state=10)
model_fit(Dt, X_train_pca, y_train_re)

In [None]:
# make predictions
pred_probs_test = Dt.predict(X_test_pca)
#Let's check the model metrices.
Model_metrics(Actual_churn=y_test,Predict_churn=pred_probs_test)

In [None]:

# Create the parameter grid based on the results of random search 
param_grid = {'max_depth': range(5,15,3),'min_samples_leaf': range(100, 400, 50),'min_samples_split': range(100, 400, 100),
    'max_features': [8,10,15]}
# Create a based model
dt = DecisionTreeClassifier(class_weight='balanced',random_state=10)
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = dt, param_grid = param_grid, 
                          cv = 3, n_jobs = 4,verbose = 1,scoring="f1_weighted")

In [None]:
# Fit the grid search to the data
grid_search.fit(X_train_pca, y_train_re)

In [None]:
# printing the optimal accuracy score and hyperparameters
print('Recall score',grid_search.best_score_)

In [None]:
# model with the best hyperparameters
dt_final = DecisionTreeClassifier(class_weight='balanced',
                             max_depth=14,
                             min_samples_leaf=100, 
                             min_samples_split=100,
                             max_features=15,
                             random_state=10)

In [None]:
# fit th model and get the model summary:
model_fit(dt_final,X_train_pca,y_train_re)

In [None]:
# make predictions in the test data
pred_probs_test = dt_final.predict(X_test_pca)
#Let's check the model metrices.
Model_metrics(Actual_churn=y_test,Predict_churn=pred_probs_test)

In [None]:
# classification report
print(classification_report(y_test,pred_probs_test))

In [None]:
# After tunning aslo, we can see we get 71% of the churn.
# predicting churn with default cut-off 0.5
cut_off_prob = 0.5
y_train_df = predictChurnWithProb(dt_final,X_train_pca,y_train_re,cut_off_prob)
y_train_df.head()


In [None]:
# finding cut-off with the right balance of the metrices
Optimal_Cutoff(y_train_df)

In [None]:
# We can see that the cut off value is between 0.5 to 0.6. Let us chck again the cutoff as 0.52 and run again.
cut_off_prob = 0.5
y_train_df = predictChurnWithProb(dt_final,X_train_pca,y_train_re,cut_off_prob)
y_train_df.head()

In [None]:
# At 0.52 we can see there is a balance in the sensitivity, specificity and acccuracy.


In [None]:

#Lets see how it performs on test data.
y_test_df= predictChurnWithProb(dt_final,X_test_pca,y_test,cut_off_prob)
y_test_df.head()

### Conclusions:

**A) Train data**

Accuracy score : 0.83

Sensitivity/Recall : 0.83

Specificity: 0.83

**B) Test data**

Accuracy score : 0.76

Sensitivity/Recall : 0.70

Specificity: 0.82



## 12. Random Forest:

In [None]:
# Again let us appy the Random forest pca and hypertuning with maxdepth
parameters = {'max_depth': range(10, 30, 5)}
rfc = RandomForestClassifier()
rfgs = GridSearchCV(rfc, parameters,n_jobs=-1,
                    cv=5, 
                   scoring="f1")
rfgs.fit(X_train_pca,y_train_re)

In [None]:
scores = rfgs.cv_results_
# plotting accuracies with max_depth
plt.figure()
plt.plot(scores["param_max_depth"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_max_depth"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

In [None]:
# We can see that the max_depth after 20th both the curves there were no significant change observed.

In [None]:
# Let us tune with the n_estimators 
parameters = {'n_estimators': range(50, 150, 25)}
rf1 = RandomForestClassifier(max_depth=20,n_jobs=-1,random_state=10)
rfgs = GridSearchCV(rf1, parameters, 
                    cv=3, 
                   scoring="recall")

In [None]:
# AGain fit the parameters

rfgs.fit(X_train_pca,y_train_re)

In [None]:
# plotting accuracies with max_depth
def plt_traintest_acc(score,param):
    scores = score
    plt.figure()
    plt.plot(scores["param_"+param], 
    scores["mean_train_score"], 
    label="training accuracy")
    plt.plot(scores["param_"+param], 
    scores["mean_test_score"], 
    label="test accuracy")
    plt.xlabel(param)
    plt.ylabel("f1")
    plt.legend()
    plt.show()

In [None]:
# check the plot
plt_traintest_acc(rfgs.cv_results_,'n_estimators')

In [None]:
# We can see between 70 to 80, let us take 80 as our n_estimators 

In [None]:
# Max features tuning with 80 
parameters = {'max_features': [4, 8, 14, 20, 24]}
rf3 = RandomForestClassifier(max_depth=20,n_estimators=80,n_jobs=-1,random_state=10)
rfgs = GridSearchCV(rf3, parameters,cv=5,scoring="f1")

In [None]:
# Let us fit again and check 
rfgs.fit(X_train_pca,y_train_re)

In [None]:
# check the acc. max features
plt_traintest_acc(rfgs.cv_results_,'max_features')

In [None]:
# As we see that at the 7.5 and later it is declining. 
# Let us tune the min sample leaf
parameters = {'min_samples_leaf': range(100, 400, 50)}
rf4 = RandomForestClassifier(max_depth=20,n_estimators=80,max_features=8,n_jobs=-1,random_state=10)
rfgs = GridSearchCV(rf4, parameters,cv=3, scoring="f1")


In [None]:
# Let us fit again and plt the curve 
rfgs.fit(X_train_pca,y_train_re)
plt_traintest_acc(rfgs.cv_results_,'min_samples_leaf')

In [None]:
# As we see that the Min samples leaf is 100.
# let us tune min samples split.
parameters = {'min_samples_split': range(50, 300, 50)}
rf5 = RandomForestClassifier(max_depth=20,n_estimators=80,max_features=8,n_jobs=-1,min_samples_leaf=100,random_state=10)
rfgs = GridSearchCV(rf5, parameters,cv=3,scoring="f1")

In [None]:
## Let us fit again and plt the curve 
rfgs.fit(X_train_pca,y_train_re)
plt_traintest_acc(rfgs.cv_results_,'min_samples_split')

In [None]:
# the min samples split is 50 later almost flat and it is declining at 200.
# Final - tunned in all aspects 

rf_final = RandomForestClassifier(max_depth=20,n_estimators=80,n_jobs=-1,max_features=8,min_samples_leaf=100,min_samples_split=50,random_state=10)

In [None]:
# check the train set.
model_fit(rf_final,X_train_pca,y_train_re)

In [None]:
# predict on test data
predictions = rf_final.predict(X_test_pca)

In [None]:
# Check the model metrics in test predictions
Model_metrics(y_test,predictions)

In [None]:
# After fine tunning we can see the recall is for the final Random forest is only 73% with 80% accuracy.

In [None]:
# Let us check the cutoff optimal value as we did for the other models
# predicting churn with default cut-off 0.5
cut_off_prob=0.5
y_train_df = predictChurnWithProb(rf_final,X_train_pca,y_train_re,cut_off_prob)
y_train_df.head()

In [None]:
# Let us see the cut-off with the metrics of accuracy, sensitvity and specificity
Optimal_Cutoff(y_train_df)

In [None]:
# We can see the optimal value is lying in 0.5 to 0.6. let us take as 0.52
cut_off_prob=0.52
A = predictChurnWithProb(rf_final,X_train_pca,y_train_re,cut_off_prob)
A.head()

In [None]:
# Make the prediction on the test:
y_test_df= predictChurnWithProb(rf_final,X_test_pca,y_test,cut_off_prob)
y_test_df.head()

### Conclusions:
Random Forest:

**A) Train set:**

Roc_auc_score : 0.85

Sensitivity/Recall : 0.84

Specificity: 0.87

**B) Test set:**

Roc_auc_score : 0.79

Sensitivity/Recall : 0.71

Specificity: 0.86

## 13.Boosting Models:

In [None]:
# Gradient boosing classifier with PCA and hypertuning 
# Impor the needed libraries
from sklearn.ensemble import GradientBoostingClassifier  

# Fitting the default GradientBoostingClassifier
Gb = GradientBoostingClassifier(random_state=10)
model_fit(Gb, X_train_pca, y_train_re)

In [None]:
# Let us tune the n_estimators
param_test_1 = {'n_estimators':range(20,150,10)}
gsearch_1 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, min_samples_split=500,min_samples_leaf=50,max_depth=8,max_features='sqrt',subsample=0.8,random_state=10), 
param_grid = param_test_1, scoring='f1',n_jobs=4,iid=False, cv=3)
gsearch_1.fit(X_train_pca, y_train_re)

In [None]:
# Let us check the best n_estimators first
print('gsearch_1.best_params_\ngsearch1.best_score_')

In [None]:
# check the score and n_estimators
print("The best n_estimators: {}".format(gsearch_1.best_params_))
print("The best score: {}".format(gsearch_1.best_score_))


In [None]:
# Let us use the n_estimators =140 and tune the hyperparameters of max depth min samples spli & Learning rate is taken as 0.1.
param_test_2 = {'max_depth':range(5,16,2), 'min_samples_split':range(200,1001,200)}
gsearch_2 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=140, max_features='sqrt', subsample=0.8, random_state=10), 
param_grid = param_test_2, scoring='f1',n_jobs=4,iid=False, cv=3)
gsearch_2.fit(X_train_pca, y_train_re)

In [None]:
# check the score and hyper tune parameters
print("The best max depth & min samples split: {}".format(gsearch_2.best_params_))
print("The best score: {}".format(gsearch_2.best_score_))


In [None]:
# Let us tune the hyperparameters of  min samples leaf.
param_test_3 = {'min_samples_leaf':range(30,71,10)}
gsearch_3 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1,n_estimators=140,max_depth = 15,min_samples_split =200, max_features='sqrt', subsample=0.8, random_state=10), 
param_grid = param_test_3, scoring='f1',n_jobs=4,iid=False, cv=3)
gsearch_3.fit(X_train_pca, y_train_re)

In [None]:
# check again the score and hyper tune parameter min_samples_leaf
print("The best min_samples_leaf: {}".format(gsearch_3.best_params_))
print("The best score: {}".format(gsearch_3.best_score_))

In [None]:
# Check again the Max_features hypertuning and do the gridsearch.
param_test_4 = {'max_features':range(7,20,2)}
gsearch_4 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1,n_estimators=140,max_depth=15, min_samples_split=200, min_samples_leaf=30, subsample=0.8, random_state=10),
param_grid = param_test_4, scoring='f1',n_jobs=4,iid=False, cv=3)
gsearch_4.fit(X_train_pca, y_train_re)

In [None]:
# check again the score and hyper tune parameter 
print("The best max_features: {}".format(gsearch_4.best_params_))
print("The best score: {}".format(gsearch_4.best_score_))

In [None]:
# The Final Model for Gradient boosting with max_features =19 
Gb_final = GradientBoostingClassifier(learning_rate=0.1,n_estimators=140,max_features=19,max_depth=15, min_samples_split=200, min_samples_leaf=40, subsample=0.8, random_state=10)
model_fit(Gb_final, X_train_pca, y_train_re)

In [None]:
# predictions on Test data & check the metrics on test data
test_predict = Gb_final.predict(X_test_pca)
Model_metrics(y_test,test_predict)

In [None]:
# Let us do the predict the churn for the default cutoff and fine tune later.
cut_off_prob=0.5
y_train_df = predictChurnWithProb(Gb_final,X_train_pca,y_train_re,cut_off_prob)
y_train_df.head()

In [None]:
# let see the optimal cutoff:
Optimal_Cutoff(y_train_df)

In [None]:
# We can see that the optimal cutoff points between 0.2 to 0.3. let us take as 0.2 and proceed.
cut_off_prob=0.2
A = predictChurnWithProb(Gb_final,X_train_pca,y_train_re,cut_off_prob)
A.head()

In [None]:
# Let us do predict in the test data:
y_test_df= predictChurnWithProb(Gb_final,X_test_pca,y_test,cut_off_prob)
y_test_df.head()

### Conclusions:

The Model in the training data is completely overfitting as we see sensitivity and accuracy is 0.1 as we see it has lower performance in the test set.

**A)Train set**

Roc_auc_score : 0.99

Sensitivity/Recall : 1.0

Specificity: 0.98

**B)Test set**

Roc_auc_score : 0.79

Sensitivity/Recall : 0.69

Specificity: 0.88

## Let us do with the Xgboosting:

In [None]:
# Import the needed libraries
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
# Fitting the XGBClassifier
XGb = XGBClassifier(learning_rate =0.1,n_estimators=1000,max_depth=5,min_child_weight=1,gamma=0,subsample=0.8,colsample_bytree=0.8,
                    objective= 'binary:logistic',nthread=4,scale_pos_weight=1,seed=27)

In [None]:
# Model fit and performance on Train data
model_fit(XGb, X_train_pca, y_train_re)

In [None]:
# Let us tune the hyperparameters one by one
param_test_1 = {'max_depth':range(3,10,2),'min_child_weight':range(1,6,2)}
gsearch_1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,
             min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
            param_grid = param_test_1, scoring='f1',n_jobs=4,iid=False, cv=3)
gsearch_1.fit(X_train_pca, y_train_re)

In [None]:
# check again the score and hyper tune parameter 
print("The best ma_depth & min_child_weight: {}".format(gsearch_1.best_params_))
print("The best score: {}".format(gsearch_1.best_score_))

In [None]:
# Let us tune the hyperparameters one by one
param_test_2 = {'gamma':[i/10.0 for i in range(0,5)]}
gsearch_2 = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=9,min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
             objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), param_grid = param_test_2, scoring='f1',n_jobs=4,iid=False, cv=3)
gsearch_2.fit(X_train_pca, y_train_re)

In [None]:
# check again the score and hyper tune parameter 
print("The best gamma: {}".format(gsearch_2.best_params_))
print("The best score: {}".format(gsearch_2.best_score_))

In [None]:
# Final XGb boosing with gamma 0.2 & fit the model.
XGb = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=9,min_child_weight=1, gamma=0.2, subsample=0.8, colsample_bytree=0.8,
     objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27)
model_fit(XGb, X_train_pca, y_train_re)

In [None]:
# Predict on the test data and do the model metrics.
test_predict = XGb.predict(X_test_pca)
Model_metrics(y_test,test_predict)

In [None]:
# Take the default cutoff and check again
# predicting churn with default cut-off 0.5
cut_off_prob=0.5
y_train_df = predictChurnWithProb(XGb,X_train_pca,y_train_re,cut_off_prob)
y_train_df.head()

In [None]:
# Finding optimal cut-off probability
Optimal_Cutoff(y_train_df)

In [None]:
# the optimal cutoff point will be between 0.3 to 0.5, taken as 0.4
cut_off_prob=0.4
A = predictChurnWithProb(XGb,X_train_pca,y_train_re,cut_off_prob)
A.head()

In [None]:
# Let us predict in the test data
y_test_df= predictChurnWithProb(XGb,X_test_pca,y_test,cut_off_prob)
y_test_df.head()

### Conclusions:

**A) Train set**

Roc_auc_score : 0.99

Sensitivity/Recall : 0.99

Specificity: 0.98

**B) Test set**

Roc_auc_score : 0.77

Sensitivity/Recall : 0.63

Specificity: 0.92

## 14. SVM :

In [None]:
# Let us try doing the model test in SVC and check. https://www.geeksforgeeks.org/svm-hyperparameter-tuning-using-gridsearchcv-ml/
# note that we are using cost C=1
from sklearn.svm import SVC 

Sv = SVC(C = 1)

# fit
Sv.fit(X_train_pca, y_train_re)

# predict on train
S_predict = Sv.predict(X_train_pca)


In [None]:
# check the model metrics
Model_metrics(y_train_re,S_predict)

In [None]:
# Predict on test data
S_predict = Sv.predict(X_test_pca)
Model_metrics(y_test,S_predict)

In [None]:
# Let us do the hyper tunning 
# specify range of parameters (C) as a list
params = {"C": [0.1, 1, 10, 100, 1000]}

Svm= SVC()

# set up grid search and fit on the train set.

model_cv = GridSearchCV(estimator = Svm, param_grid = params, scoring= 'f1',cv = 5, verbose = 1,n_jobs=4,return_train_score=True) 
model_cv.fit(X_train_pca, y_train_re)

In [None]:
# check again the score and hyper tune parameter 
print("The best params: {}".format(model_cv.best_params_))
print("The best score: {}".format(model_cv.best_score_))

In [None]:
# use the C as 1000 and run again
Sv_final = SVC(C = 1000)

# fit
Sv_final.fit(X_train_pca, y_train_re)

# predict on train
S_predict = Sv_final.predict(X_train_pca)

In [None]:
# check the model metrics
Model_metrics(y_train_re,S_predict)

In [None]:
# Check in the test set
S_predict = Sv_final.predict(X_test_pca)
Model_metrics(y_test,S_predict)

### Conclusions:

**A) Train set:**

Roc_auc_score : 0.85

Sensitivity/Recall : 0.84

Specificity: 0.86

**B) Test set:**

Roc_auc_score : 0.81

Sensitivity/Recall : 0.75

Specificity: 0.86

## 15. Conclusions:

We have used the data set in all the Models. The best choice will be selected based on the test data Recall. 

Especailly for this chrun case , the Recall / Sensitivity is most important and telecom industry must work on these customers to retain rather than acquiring new customers. 

As it will more costly in acquiring due to convience the customers, advertisements etc., so better to retain the customers and improve the services on the varaibles to avoid risk which will decline the revenue of the company..

Summary of the Models:

1. Logistic Regression with PCA:


A) Train Set :
Roc_auc_score : 0.82
**Sensitivity/Recall : 0.83**
Specificity: 0.82

B) Test set:
Roc_auc_score : 0.81
**Sensitivity/Recall : 0.80**
Specificity: 0.83

2. Decision Tree :

A) Train data
Accuracy score : 0.83
**Sensitivity/Recall : 0.83**
Specificity: 0.83

B) Test data
Accuracy score : 0.76
**Sensitivity/Recall : 0.70**
Specificity: 0.82

3. Random Forest :

A) Train set:
Roc_auc_score : 0.85
**Sensitivity/Recall : 0.84**
Specificity: 0.87

B) Test set:
Roc_auc_score : 0.79
**Sensitivity/Recall : 0.71**
Specificity: 0.86

4. Boosting Models:

A) Train set
Roc_auc_score : 0.99
**Sensitivity/Recall : 0.99**
Specificity: 0.98

B) Test set
Roc_auc_score : 0.77
**Sensitivity/Recall : 0.63**
Specificity: 0.92

5. SVC :

A) Train set:
Roc_auc_score : 0.85
**Sensitivity/Recall : 0.84**
Specificity: 0.86

B) Test set:
Roc_auc_score : 0.81
**Sensitivity/Recall : 0.75**
Specificity: 0.86


We Suggest to see that the Recall/ Senstivity for the Logistic regression wiht PCA is good for the prediction as is 0.80 compared to other models. WE can see that SVC is ok as we got 0.75 recall. Boosting models are overfitting the data.

In [None]:
# Let us see the top variables which contributes more to the chrun rather than false postives.

In [None]:
# Random forest is used to derive the churn

# Create the parameter grid based on the results of random search 
param_grid = {
    'max_depth': [8,10,12],
    'min_samples_leaf': range(100, 400, 200),
    'min_samples_split': range(200, 500, 200),
    'n_estimators': [100,200, 300], 
    'max_features': [12, 15, 20]
}
# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = 4,verbose = 1)

In [None]:
# Fit the grid search to the data
grid_search.fit(X_train_re, y_train_re)

In [None]:
# printing the optimal accuracy score and hyperparameters
print('Accuracy is',grid_search.best_score_,'using',grid_search.best_params_)

In [None]:
# Apply rf calssifier with hyper tuning and fit the model
rf = RandomForestClassifier(max_depth=12,max_features=20,min_samples_leaf=100,min_samples_split=200,n_estimators=300,random_state=10)
rf.fit(X_train_re, y_train_re)

In [None]:
# Plot the bar graph
plt.figure(figsize=(10,40))
feat_importances = pd.Series(rf.feature_importances_, index=X.columns)
feat_importances.nlargest(len(X.columns)).sort_values().plot(kind='barh', align='center')

#### We can see that the top 25 variables the telecom company must concentrate.

Inorder to improve the service they need to really work on these variables to be more competitive in the market which will help them to improve the revenue of the orgaisation. 

Also, where there is always there is decrease in these varibles will drastically affect the revenue and the customer will be lost if we dont take necessary steps. 

Do launch programs to retain the customer rather than acquiring the new customer. as it will more costly to acquire the new customers rather than retianing the old customers.