In [173]:
import jovian
if True:
    pass
if False:
    jovian.commit(project='telecom-churn', filename='telecom_churn.ipynb', files=['telecom_churn_data.csv'])
    pass

<IPython.core.display.Javascript object>

[jovian] Updating notebook "kavurisrikanth/telecom-churn" on https://jovian.ai/[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/kavurisrikanth/telecom-churn[0m


## Telecom Case Study

#### Definitions
* Usage-based churn - Customers who have not had any usage in a 2 month period
* High-value customers - Customers who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase).

### Data
* 4 months of data available - June (6), July (7), August (8), September (9)
* Predict churn for last month (9th) using data from past 3 months

### Tagging churn
* Tag churn based on the last month (churn period). Customers who have no usage in this month have churned.
    * Customers who have had a "bad" action phase - ???
    * How do we define "bad" action phase?
* **NOTE: Once churn is tagged, remove all attributes corresponding to the last month.**

Steps
1. describe(), info()
2. aggregate column names by column type
    1. Gather counts of columns in each list above
3. Print the percentage of missing values in each column - isnull().sum() / cnt * 100
4. Drop colums that have >= 70% values missing
5. Missing value imputations - ?
    1. Value that makes sense for continuous columns
    2. -1 (or equivalent) for categorica columns
6. Drop all date columns - ?
7. Calculate amount of recharge for months 6 and 7 - Done
8. Identify high-value customers using above step - Done
9. calculate total incoming & outgoing mou for 9th month
10. calculate total 2g & 3g data consumption for 9th month
11. Mark churn using the 2 above values - Done
12. EDA
    1. Univariate - distplot, countplot
    2. Bivariate - regplot
13. calculate diff col (8th - (6th + 7th)/2)
14. Divide x & y
15. Train-test split
    * stratify = yes - ???
16. Scaling
17. PCA
18. Logistic Regression
    * class_weight = balanced - ???
19. Stratified k-fold
    * n_splits=5, shuffle=True, random_state=4
20. GridSearchCV with RandomForest
21. Class imbalance
    * SMOTE - ???

## 1. Read data

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
data = pd.read_csv('telecom_churn_data.csv')

In [4]:
data.head()

Unnamed: 0,mobile_number,circle_id,loc_og_t2o_mou,std_og_t2o_mou,loc_ic_t2o_mou,last_date_of_month_6,last_date_of_month_7,last_date_of_month_8,last_date_of_month_9,arpu_6,...,sachet_3g_9,fb_user_6,fb_user_7,fb_user_8,fb_user_9,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g,sep_vbc_3g
0,7000842753,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,197.385,...,0,1.0,1.0,1.0,,968,30.4,0.0,101.2,3.58
1,7001865778,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,34.047,...,0,,1.0,1.0,,1006,0.0,0.0,0.0,0.0
2,7001625959,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,167.69,...,0,,,,1.0,1103,0.0,0.0,4.17,0.0
3,7001204172,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,221.338,...,0,,,,,2491,0.0,0.0,0.0,0.0
4,7000142493,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,261.636,...,0,0.0,,,,1526,0.0,0.0,0.0,0.0


## 2. Data preprocessing

The following could be done:
* Check data formats
* Handle missing values
* Check for outliers?

### Checking data formats

In [5]:
data.mobile_number.is_unique

True

#### mobile_number could be made the index, since it is unique

In [6]:
data.set_index('mobile_number', inplace=True)

In [7]:
data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99999 entries, 7000842753 to 7001905007
Data columns (total 225 columns):
 #    Column                    Dtype  
---   ------                    -----  
 0    circle_id                 int64  
 1    loc_og_t2o_mou            float64
 2    std_og_t2o_mou            float64
 3    loc_ic_t2o_mou            float64
 4    last_date_of_month_6      object 
 5    last_date_of_month_7      object 
 6    last_date_of_month_8      object 
 7    last_date_of_month_9      object 
 8    arpu_6                    float64
 9    arpu_7                    float64
 10   arpu_8                    float64
 11   arpu_9                    float64
 12   onnet_mou_6               float64
 13   onnet_mou_7               float64
 14   onnet_mou_8               float64
 15   onnet_mou_9               float64
 16   offnet_mou_6              float64
 17   offnet_mou_7              float64
 18   offnet_mou_8              float64
 19   offnet_mou_9              floa

All date columns are of type "object". So check those.

### Checking date columns

In [8]:
data.last_date_of_month_6.value_counts()

6/30/2014    99999
Name: last_date_of_month_6, dtype: int64

In [9]:
data.last_date_of_month_7.value_counts()

7/31/2014    99398
Name: last_date_of_month_7, dtype: int64

In [10]:
data.last_date_of_month_8.value_counts()

8/31/2014    98899
Name: last_date_of_month_8, dtype: int64

In [11]:
data.last_date_of_month_9.value_counts()

9/30/2014    98340
Name: last_date_of_month_9, dtype: int64

Some values are missing, but all the remaining values are just showing what the last date of the month is.

This can be inferred from the month itself, so the columns can be dropped.

In [12]:
last_date_cols = ['last_date_of_month_6', 'last_date_of_month_7', 'last_date_of_month_8', 'last_date_of_month_9']

In [13]:
data.shape

(99999, 225)

In [14]:
data.drop(last_date_cols, axis=1, inplace=True)

In [15]:
data.shape

(99999, 221)

#### All other date cols don't need month and year, since they are inferrable.

In [16]:
cols = data.columns

In [17]:
cols

Index(['circle_id', 'loc_og_t2o_mou', 'std_og_t2o_mou', 'loc_ic_t2o_mou',
       'arpu_6', 'arpu_7', 'arpu_8', 'arpu_9', 'onnet_mou_6', 'onnet_mou_7',
       ...
       'sachet_3g_9', 'fb_user_6', 'fb_user_7', 'fb_user_8', 'fb_user_9',
       'aon', 'aug_vbc_3g', 'jul_vbc_3g', 'jun_vbc_3g', 'sep_vbc_3g'],
      dtype='object', length=221)

In [18]:
date_cols = [x for x in cols if 'date' in x]

In [19]:
date_cols

['date_of_last_rech_6',
 'date_of_last_rech_7',
 'date_of_last_rech_8',
 'date_of_last_rech_9',
 'date_of_last_rech_data_6',
 'date_of_last_rech_data_7',
 'date_of_last_rech_data_8',
 'date_of_last_rech_data_9']

In [20]:
other_date_cols = [x for x in date_cols if x not in last_date_cols]

In [21]:
other_date_cols

['date_of_last_rech_6',
 'date_of_last_rech_7',
 'date_of_last_rech_8',
 'date_of_last_rech_9',
 'date_of_last_rech_data_6',
 'date_of_last_rech_data_7',
 'date_of_last_rech_data_8',
 'date_of_last_rech_data_9']

In [22]:
data[other_date_cols].head()

Unnamed: 0_level_0,date_of_last_rech_6,date_of_last_rech_7,date_of_last_rech_8,date_of_last_rech_9,date_of_last_rech_data_6,date_of_last_rech_data_7,date_of_last_rech_data_8,date_of_last_rech_data_9
mobile_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7000842753,6/21/2014,7/16/2014,8/8/2014,9/28/2014,6/21/2014,7/16/2014,8/8/2014,
7001865778,6/29/2014,7/31/2014,8/28/2014,9/30/2014,,7/25/2014,8/10/2014,
7001625959,6/17/2014,7/24/2014,8/14/2014,9/29/2014,,,,9/17/2014
7001204172,6/28/2014,7/31/2014,8/31/2014,9/30/2014,,,,
7000142493,6/26/2014,7/28/2014,8/9/2014,9/28/2014,6/4/2014,,,


In [23]:
for col in other_date_cols:
    data.loc[~data[col].isna(), col] = pd.to_datetime(data[~data[col].isna()][col]).dt.day

In [24]:
data[other_date_cols].head()

Unnamed: 0_level_0,date_of_last_rech_6,date_of_last_rech_7,date_of_last_rech_8,date_of_last_rech_9,date_of_last_rech_data_6,date_of_last_rech_data_7,date_of_last_rech_data_8,date_of_last_rech_data_9
mobile_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7000842753,21,16,8,28,21.0,16.0,8.0,
7001865778,29,31,28,30,,25.0,10.0,
7001625959,17,24,14,29,,,,17.0
7001204172,28,31,31,30,,,,
7000142493,26,28,9,28,4.0,,,


No other obvious wrong types.

Before checking missing values, we must add a few helper columns.

In [25]:
data['total_rech_amt_data_6'] = data['total_rech_data_6'] * data['av_rech_amt_data_6']
data['total_rech_amt_data_7'] = data['total_rech_data_7'] * data['av_rech_amt_data_7']

total_rech_amt_good_period = data['total_rech_amt_data_6'].fillna(0) + data['total_rech_amt_data_7'].fillna(0) + data['total_rech_amt_6'].fillna(0) + data['total_rech_amt_7'].fillna(0)

In [26]:
helper_cols = ['total_rech_amt_data_6', 'total_rech_amt_data_7']

## Check missing values

In [27]:
def get_missing_data(cols):
    missing_dict = {}
    for x in cols:
        missing_dict[x] = [data[x].isna().sum() * 100.0 / data.shape[0]]
    # print(missing_dict)
    missing_data = pd.DataFrame.from_dict(missing_dict, orient='index').reset_index()
    missing_data.columns = ['col', 'num_missing']
    missing_data.sort_values(by='num_missing', ascending=False, inplace=True)
    missing_data = missing_data[missing_data['num_missing'] != 0]
    return missing_data

In [28]:
missing_data = get_missing_data(data.columns)

In [29]:
missing_data.head(10)

Unnamed: 0,col,num_missing
160,max_rech_data_6,74.846748
168,count_rech_3g_6,74.846748
221,total_rech_amt_data_6,74.846748
152,date_of_last_rech_data_6,74.846748
172,av_rech_amt_data_6,74.846748
156,total_rech_data_6,74.846748
212,fb_user_6,74.846748
188,arpu_2g_6,74.846748
184,arpu_3g_6,74.846748
192,night_pck_user_6,74.846748


In [30]:
missing_data.shape

(165, 2)

In [31]:
missing_data_over_70 = missing_data[missing_data['num_missing'] > 70]

In [32]:
missing_data_over_70.shape

(42, 2)

In [33]:
missing_data_over_70_cols = list(missing_data_over_70.col.values)

Drop columns with more than 70% values missing.

In [34]:
data.drop(missing_data_over_70_cols, axis=1, inplace=True)

In [35]:
data.shape

(99999, 181)

In [36]:
missing_data = missing_data[~missing_data['col'].isin(missing_data_over_70_cols)]

In [37]:
missing_data.head()

Unnamed: 0,col,num_missing
95,loc_ic_mou_9,7.745077
67,isd_og_mou_9,7.745077
71,spl_og_mou_9,7.745077
75,og_others_9,7.745077
83,loc_ic_t2t_mou_9,7.745077


In [38]:
missing_data.shape

(123, 2)

## Impute missing values.

To impute missing values, collect columns and decide the value to impute.

### Collecting columns

In [39]:
id_cols = ['mobile_number', 'circle_id']

In [40]:
# date_cols = ['last_date_of_month_6', 'last_date_of_month_7', 'last_date_of_month_8', 'last_date_of_month_9', 'date_of_last_rech_6', 'date_of_last_rech_7', 'date_of_last_rech_8', 'date_of_last_rech_9', 'date_of_last_rech_data_6', 'date_of_last_rech_data_7', 'date_of_last_rech_data_8', 'date_of_last_rech_data_9']

In [41]:
date_cols = [x for x in data.columns if 'date' in x]

In [42]:
int_cols = ['total_rech_num_6', 'total_rech_num_7', 'total_rech_num_8', 'total_rech_num_9', 'total_rech_amt_6', 'total_rech_amt_7', 'total_rech_amt_8', 'total_rech_amt_9', 'max_rech_amt_6', 'max_rech_amt_7', 'max_rech_amt_8', 'max_rech_amt_9', 'last_day_rch_amt_6', 'last_day_rch_amt_7', 'last_day_rch_amt_8', 'last_day_rch_amt_9', 'monthly_2g_6', 'monthly_2g_7', 'monthly_2g_8', 'monthly_2g_9', 'sachet_2g_6', 'sachet_2g_7', 'sachet_2g_8', 'sachet_2g_9', 'monthly_3g_6', 'monthly_3g_7', 'monthly_3g_8', 'monthly_3g_9', 'sachet_3g_6', 'sachet_3g_7', 'sachet_3g_8', 'sachet_3g_9', 'aon']

In [43]:
float_cols = ['arpu_6', 'arpu_7', 'arpu_8', 'arpu_9', 'arpu_3g_6', 'arpu_3g_7', 'arpu_3g_8', 'arpu_3g_9', 'arpu_2g_6', 'arpu_2g_7', 'arpu_2g_8', 'arpu_2g_9', 'loc_og_t2o_mou', 'loc_og_t2t_mou_6', 'loc_og_t2t_mou_7', 'loc_og_t2t_mou_8', 'loc_og_t2t_mou_9', 'loc_og_t2m_mou_6', 'loc_og_t2m_mou_7', 'loc_og_t2m_mou_8', 'loc_og_t2m_mou_9', 'loc_og_t2f_mou_6', 'loc_og_t2f_mou_7', 'loc_og_t2f_mou_8', 'loc_og_t2f_mou_9', 'loc_og_t2c_mou_6', 'loc_og_t2c_mou_7', 'loc_og_t2c_mou_8', 'loc_og_t2c_mou_9', 'loc_og_mou_6', 'loc_og_mou_7', 'loc_og_mou_8', 'loc_og_mou_9', 'roam_og_mou_6', 'roam_og_mou_7', 'roam_og_mou_8', 'roam_og_mou_9', 'std_og_t2o_mou',  'std_og_t2t_mou_6', 'std_og_t2t_mou_7', 'std_og_t2t_mou_8', 'std_og_t2t_mou_9', 'std_og_t2m_mou_6', 'std_og_t2m_mou_7', 'std_og_t2m_mou_8', 'std_og_t2m_mou_9', 'std_og_t2f_mou_6', 'std_og_t2f_mou_7', 'std_og_t2f_mou_8', 'std_og_t2f_mou_9', 'std_og_t2c_mou_6', 'std_og_t2c_mou_7', 'std_og_t2c_mou_8', 'std_og_t2c_mou_9', 'std_og_mou_6', 'std_og_mou_7', 'std_og_mou_8', 'std_og_mou_9', 'isd_og_mou_6', 'isd_og_mou_7', 'isd_og_mou_8', 'isd_og_mou_9', 'spl_og_mou_6', 'spl_og_mou_7', 'spl_og_mou_8', 'spl_og_mou_9', 'og_others_6', 'og_others_7', 'og_others_8', 'og_others_9', 'total_og_mou_6', 'total_og_mou_7', 'total_og_mou_8', 'total_og_mou_9', 'loc_ic_t2o_mou', 'loc_ic_t2t_mou_6', 'loc_ic_t2t_mou_7', 'loc_ic_t2t_mou_8', 'loc_ic_t2t_mou_9', 'loc_ic_t2m_mou_6', 'loc_ic_t2m_mou_7', 'loc_ic_t2m_mou_8', 'loc_ic_t2m_mou_9', 'loc_ic_t2f_mou_6', 'loc_ic_t2f_mou_7', 'loc_ic_t2f_mou_8', 'loc_ic_t2f_mou_9', 'loc_ic_mou_6', 'loc_ic_mou_7', 'loc_ic_mou_8', 'loc_ic_mou_9', 'roam_ic_mou_6', 'roam_ic_mou_7', 'roam_ic_mou_8', 'roam_ic_mou_9', 'std_ic_t2t_mou_6', 'std_ic_t2t_mou_7', 'std_ic_t2t_mou_8', 'std_ic_t2t_mou_9', 'std_ic_t2m_mou_6', 'std_ic_t2m_mou_7', 'std_ic_t2m_mou_8', 'std_ic_t2m_mou_9', 'std_ic_t2f_mou_6', 'std_ic_t2f_mou_7', 'std_ic_t2f_mou_8', 'std_ic_t2f_mou_9', 'std_ic_t2o_mou_6', 'std_ic_t2o_mou_7', 'std_ic_t2o_mou_8', 'std_ic_t2o_mou_9', 'std_ic_mou_6', 'std_ic_mou_7', 'std_ic_mou_8', 'std_ic_mou_9', 'total_ic_mou_6', 'total_ic_mou_7', 'total_ic_mou_8', 'total_ic_mou_9', 'spl_ic_mou_6', 'spl_ic_mou_7', 'spl_ic_mou_8', 'spl_ic_mou_9', 'isd_ic_mou_6', 'isd_ic_mou_7', 'isd_ic_mou_8', 'isd_ic_mou_9', 'ic_others_6', 'ic_others_7', 'ic_others_8', 'ic_others_9', 'onnet_mou_6', 'onnet_mou_7', 'onnet_mou_8', 'onnet_mou_9', 'offnet_mou_6', 'offnet_mou_7', 'offnet_mou_8', 'offnet_mou_9', 'total_rech_data_6', 'total_rech_data_7', 'total_rech_data_8', 'total_rech_data_9', 'max_rech_data_6', 'max_rech_data_7', 'max_rech_data_8', 'max_rech_data_9', 'count_rech_2g_6', 'count_rech_2g_7', 'count_rech_2g_8', 'count_rech_2g_9', 'count_rech_3g_6', 'count_rech_3g_7', 'count_rech_3g_8', 'count_rech_3g_9', 'av_rech_amt_data_6', 'av_rech_amt_data_7', 'av_rech_amt_data_8', 'av_rech_amt_data_9', 'vol_2g_mb_6', 'vol_2g_mb_7', 'vol_2g_mb_8', 'vol_2g_mb_9', 'vol_3g_mb_6', 'vol_3g_mb_7', 'vol_3g_mb_8', 'vol_3g_mb_9', 'night_pck_user_6', 'night_pck_user_7', 'night_pck_user_8', 'night_pck_user_9', 'fb_user_6', 'fb_user_7', 'fb_user_8', 'fb_user_9', 'aug_vbc_3g', 'jul_vbc_3g', 'jun_vbc_3g', 'sep_vbc_3g']

In [44]:
data['std_og_t2c_mou_6'].value_counts()

0.0    96062
Name: std_og_t2c_mou_6, dtype: int64

In [45]:
data['std_og_t2c_mou_7'].value_counts()

0.0    96140
Name: std_og_t2c_mou_7, dtype: int64

In [46]:
data['std_og_t2c_mou_8'].value_counts()

0.0    94621
Name: std_og_t2c_mou_8, dtype: int64

In [47]:
data['std_og_t2c_mou_9'].value_counts()

0.0    92254
Name: std_og_t2c_mou_9, dtype: int64

In [48]:
data['std_ic_t2o_mou_6'].value_counts()

0.0    96062
Name: std_ic_t2o_mou_6, dtype: int64

In [49]:
data['std_ic_t2o_mou_7'].value_counts()

0.0    96140
Name: std_ic_t2o_mou_7, dtype: int64

In [50]:
data['std_ic_t2o_mou_8'].value_counts()

0.0    94621
Name: std_ic_t2o_mou_8, dtype: int64

In [51]:
data['std_ic_t2o_mou_9'].value_counts()

0.0    92254
Name: std_ic_t2o_mou_9, dtype: int64

In [52]:
possible_categs = ['monthly_2g_6', 'monthly_2g_7', 'monthly_2g_8', 'monthly_2g_9', 'monthly_3g_6', 'monthly_3g_7', 'monthly_3g_8', 'monthly_3g_9', 'sachet_2g_6', 'sachet_2g_7', 'sachet_2g_8', 'sachet_2g_9', 'sachet_3g_6', 'sachet_3g_7', 'sachet_3g_8', 'sachet_3g_9']

In [53]:
numeric_cols = [x for x in data.columns if x not in id_cols and x not in date_cols and x not in possible_categs]

In [54]:
len(numeric_cols)

160

In [55]:
missing_data = get_missing_data(numeric_cols)

In [56]:
not_mou_cols = list(missing_data.col)

In [57]:
not_mou_cols = [x for x in not_mou_cols if 'mou' not in x]

In [58]:
not_mou_cols

['og_others_9',
 'ic_others_9',
 'ic_others_8',
 'og_others_8',
 'og_others_6',
 'ic_others_6',
 'ic_others_7',
 'og_others_7']

In [59]:
not_mou = missing_data[missing_data['col'].isin(not_mou_cols)]

In [60]:
not_mou

Unnamed: 0,col,num_missing
74,og_others_9,7.745077
130,ic_others_9,7.745077
129,ic_others_8,5.378054
73,og_others_8,5.378054
71,og_others_6,3.937039
127,ic_others_6,3.937039
128,ic_others_7,3.859039
72,og_others_7,3.859039


In [61]:
for x in not_mou.col:
    print(x)
    print(data[x].value_counts())
    print('')

og_others_9
0.00      91832
0.16         17
0.18         11
0.66          8
0.98          7
          ...  
67.38         1
9.20          1
145.16        1
15.06         1
0.31          1
Name: og_others_9, Length: 235, dtype: int64

ic_others_9
0.00     72018
0.06       566
0.10       518
0.08       495
0.13       364
         ...  
27.48        1
25.88        1
22.24        1
11.10        1
20.15        1
Name: ic_others_9, Length: 1923, dtype: int64

ic_others_8
0.00      72892
0.10        831
0.06        771
0.08        676
0.13        486
          ...  
160.68        1
135.06        1
14.59         1
11.54         1
23.41         1
Name: ic_others_8, Length: 1896, dtype: int64

og_others_8
0.00     94210
0.16        23
0.01        13
0.03        11
0.11         9
         ...  
84.91        1
10.59        1
43.51        1
4.36         1
0.96         1
Name: og_others_8, Length: 216, dtype: int64

og_others_6
0.00      79128
0.21        584
0.43        218
0.20        152
0.65    

Missing values in all numeric columns can be imputed with 0.

In [62]:
for x in missing_data.col:
    data[x] = data[x].fillna(0)

Impute any missing values for categorical columns

In [63]:
missing_data = get_missing_data(data.columns)
missing_data

Unnamed: 0,col,num_missing
147,date_of_last_rech_9,4.760048
146,date_of_last_rech_8,3.622036
145,date_of_last_rech_7,1.767018
144,date_of_last_rech_6,1.607016


Since the columns are only dates, value of 0 can be used to represent missing value.

In [64]:
for x in missing_data.col:
    data[x] = data[x].fillna(0)

In [65]:
missing_data = get_missing_data(data.columns)
missing_data

Unnamed: 0,col,num_missing


## Data imbalance

## Check outliers - ?

## Deriving new features

In [66]:
data.loc_og_t2o_mou.describe()

count    99999.0
mean         0.0
std          0.0
min          0.0
25%          0.0
50%          0.0
75%          0.0
max          0.0
Name: loc_og_t2o_mou, dtype: float64

In [67]:
data.std_og_t2o_mou.describe()

count    99999.0
mean         0.0
std          0.0
min          0.0
25%          0.0
50%          0.0
75%          0.0
max          0.0
Name: std_og_t2o_mou, dtype: float64

In [68]:
data.loc_ic_t2o_mou.describe()

count    99999.0
mean         0.0
std          0.0
min          0.0
25%          0.0
50%          0.0
75%          0.0
max          0.0
Name: loc_ic_t2o_mou, dtype: float64

In [69]:
cols

Index(['circle_id', 'loc_og_t2o_mou', 'std_og_t2o_mou', 'loc_ic_t2o_mou',
       'arpu_6', 'arpu_7', 'arpu_8', 'arpu_9', 'onnet_mou_6', 'onnet_mou_7',
       ...
       'sachet_3g_9', 'fb_user_6', 'fb_user_7', 'fb_user_8', 'fb_user_9',
       'aon', 'aug_vbc_3g', 'jul_vbc_3g', 'jun_vbc_3g', 'sep_vbc_3g'],
      dtype='object', length=221)

In [70]:
incoming_cols = [x for x in cols if 'ic' in x]

In [71]:
incoming_cols

['loc_ic_t2o_mou',
 'roam_ic_mou_6',
 'roam_ic_mou_7',
 'roam_ic_mou_8',
 'roam_ic_mou_9',
 'loc_ic_t2t_mou_6',
 'loc_ic_t2t_mou_7',
 'loc_ic_t2t_mou_8',
 'loc_ic_t2t_mou_9',
 'loc_ic_t2m_mou_6',
 'loc_ic_t2m_mou_7',
 'loc_ic_t2m_mou_8',
 'loc_ic_t2m_mou_9',
 'loc_ic_t2f_mou_6',
 'loc_ic_t2f_mou_7',
 'loc_ic_t2f_mou_8',
 'loc_ic_t2f_mou_9',
 'loc_ic_mou_6',
 'loc_ic_mou_7',
 'loc_ic_mou_8',
 'loc_ic_mou_9',
 'std_ic_t2t_mou_6',
 'std_ic_t2t_mou_7',
 'std_ic_t2t_mou_8',
 'std_ic_t2t_mou_9',
 'std_ic_t2m_mou_6',
 'std_ic_t2m_mou_7',
 'std_ic_t2m_mou_8',
 'std_ic_t2m_mou_9',
 'std_ic_t2f_mou_6',
 'std_ic_t2f_mou_7',
 'std_ic_t2f_mou_8',
 'std_ic_t2f_mou_9',
 'std_ic_t2o_mou_6',
 'std_ic_t2o_mou_7',
 'std_ic_t2o_mou_8',
 'std_ic_t2o_mou_9',
 'std_ic_mou_6',
 'std_ic_mou_7',
 'std_ic_mou_8',
 'std_ic_mou_9',
 'total_ic_mou_6',
 'total_ic_mou_7',
 'total_ic_mou_8',
 'total_ic_mou_9',
 'spl_ic_mou_6',
 'spl_ic_mou_7',
 'spl_ic_mou_8',
 'spl_ic_mou_9',
 'isd_ic_mou_6',
 'isd_ic_mou_7',
 'isd_i

In [72]:
local_incoming_cols = [x for x in incoming_cols if 'loc' in x]

In [73]:
local_incoming_cols

['loc_ic_t2o_mou',
 'loc_ic_t2t_mou_6',
 'loc_ic_t2t_mou_7',
 'loc_ic_t2t_mou_8',
 'loc_ic_t2t_mou_9',
 'loc_ic_t2m_mou_6',
 'loc_ic_t2m_mou_7',
 'loc_ic_t2m_mou_8',
 'loc_ic_t2m_mou_9',
 'loc_ic_t2f_mou_6',
 'loc_ic_t2f_mou_7',
 'loc_ic_t2f_mou_8',
 'loc_ic_t2f_mou_9',
 'loc_ic_mou_6',
 'loc_ic_mou_7',
 'loc_ic_mou_8',
 'loc_ic_mou_9']

In [74]:
data[local_incoming_cols].sum(axis=1).head()

mobile_number
7000842753      10.88
7001865778    1409.53
7001625959    1879.59
7001204172    1106.83
7000142493    1905.61
dtype: float64

In [75]:
data[local_incoming_cols].head()

Unnamed: 0_level_0,loc_ic_t2o_mou,loc_ic_t2t_mou_6,loc_ic_t2t_mou_7,loc_ic_t2t_mou_8,loc_ic_t2t_mou_9,loc_ic_t2m_mou_6,loc_ic_t2m_mou_7,loc_ic_t2m_mou_8,loc_ic_t2m_mou_9,loc_ic_t2f_mou_6,loc_ic_t2f_mou_7,loc_ic_t2f_mou_8,loc_ic_t2f_mou_9,loc_ic_mou_6,loc_ic_mou_7,loc_ic_mou_8,loc_ic_mou_9
mobile_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
7000842753,0.0,0.0,0.0,0.16,0.0,0.0,0.0,4.13,0.0,0.0,0.0,1.15,0.0,0.0,0.0,5.44,0.0
7001865778,0.0,1.61,29.91,29.23,116.09,17.48,65.38,375.58,56.93,0.0,8.93,3.61,0.0,19.09,104.23,408.43,173.03
7001625959,0.0,115.69,71.11,67.46,148.23,14.38,15.44,38.89,38.98,99.48,122.29,49.63,158.19,229.56,208.86,155.99,345.41
7001204172,0.0,62.08,19.98,8.04,41.73,113.96,64.51,20.28,52.86,57.43,27.09,19.84,65.59,233.48,111.59,48.18,160.19
7000142493,0.0,105.68,88.49,233.81,154.56,106.84,109.54,104.13,48.24,1.5,0.0,0.0,0.0,214.03,198.04,337.94,202.81


In [76]:
local_incoming_6_cols = [x for x in local_incoming_cols if '6' in x]

In [77]:
local_incoming_6_cols

['loc_ic_t2t_mou_6', 'loc_ic_t2m_mou_6', 'loc_ic_t2f_mou_6', 'loc_ic_mou_6']

In [78]:
data[local_incoming_6_cols].head()

Unnamed: 0_level_0,loc_ic_t2t_mou_6,loc_ic_t2m_mou_6,loc_ic_t2f_mou_6,loc_ic_mou_6
mobile_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7000842753,0.0,0.0,0.0,0.0
7001865778,1.61,17.48,0.0,19.09
7001625959,115.69,14.38,99.48,229.56
7001204172,62.08,113.96,57.43,233.48
7000142493,105.68,106.84,1.5,214.03


In [79]:
x = data[local_incoming_6_cols]

In [80]:
x['loc_ic_mou_6_calc'] = data[['loc_ic_t2t_mou_6', 'loc_ic_t2m_mou_6', 'loc_ic_t2f_mou_6']].sum(axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x['loc_ic_mou_6_calc'] = data[['loc_ic_t2t_mou_6', 'loc_ic_t2m_mou_6', 'loc_ic_t2f_mou_6']].sum(axis=1)


In [81]:
x.head(20)

Unnamed: 0_level_0,loc_ic_t2t_mou_6,loc_ic_t2m_mou_6,loc_ic_t2f_mou_6,loc_ic_mou_6,loc_ic_mou_6_calc
mobile_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
7000842753,0.0,0.0,0.0,0.0,0.0
7001865778,1.61,17.48,0.0,19.09,19.09
7001625959,115.69,14.38,99.48,229.56,229.55
7001204172,62.08,113.96,57.43,233.48,233.47
7000142493,105.68,106.84,1.5,214.03,214.02
7000286308,28.73,49.19,0.0,77.93,77.92
7001051193,1857.99,248.64,20.24,2126.89,2126.87
7000701601,58.14,217.56,152.16,427.88,427.86
7001524846,23.84,57.58,0.0,81.43,81.42
7001864400,129.34,132.94,0.4,262.69,262.68


In [82]:
local_incoming_7_cols = [x for x in local_incoming_cols if '7' in x]

In [83]:
data[local_incoming_7_cols].head()

Unnamed: 0_level_0,loc_ic_t2t_mou_7,loc_ic_t2m_mou_7,loc_ic_t2f_mou_7,loc_ic_mou_7
mobile_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7000842753,0.0,0.0,0.0,0.0
7001865778,29.91,65.38,8.93,104.23
7001625959,71.11,15.44,122.29,208.86
7001204172,19.98,64.51,27.09,111.59
7000142493,88.49,109.54,0.0,198.04


In [84]:
local_incoming_8_cols = [x for x in local_incoming_cols if '8' in x]

In [85]:
data[local_incoming_8_cols].head()

Unnamed: 0_level_0,loc_ic_t2t_mou_8,loc_ic_t2m_mou_8,loc_ic_t2f_mou_8,loc_ic_mou_8
mobile_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7000842753,0.16,4.13,1.15,5.44
7001865778,29.23,375.58,3.61,408.43
7001625959,67.46,38.89,49.63,155.99
7001204172,8.04,20.28,19.84,48.18
7000142493,233.81,104.13,0.0,337.94


In [86]:
local_incoming_9_cols = [x for x in local_incoming_cols if '9' in x]

In [87]:
data[local_incoming_9_cols].head()

Unnamed: 0_level_0,loc_ic_t2t_mou_9,loc_ic_t2m_mou_9,loc_ic_t2f_mou_9,loc_ic_mou_9
mobile_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7000842753,0.0,0.0,0.0,0.0
7001865778,116.09,56.93,0.0,173.03
7001625959,148.23,38.98,158.19,345.41
7001204172,41.73,52.86,65.59,160.19
7000142493,154.56,48.24,0.0,202.81


## Filter high-value customers

High-value customers are those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase).

In [88]:
avg_total_rech_amt_good_period = total_rech_amt_good_period/2

avg_total_rech_70th = np.percentile(avg_total_rech_amt_good_period, 70)
print('70th percentile of average recharge amount in the good period: {:.2f}'.format(avg_total_rech_70th))

high_value = data[avg_total_rech_amt_good_period >= avg_total_rech_70th]
print('Number of High value customers:', high_value.shape[0])

data = high_value

70th percentile of average recharge amount in the good period: 478.00
Number of High value customers: 30001


## Mark churn

In [89]:
usage_cols = ['total_ic_mou_9', 'total_og_mou_9', 'vol_2g_mb_9', 'vol_3g_mb_9']

In [90]:
data['churn'] = data.apply(lambda x: 1 if x['total_ic_mou_9'] == 0 and x['total_og_mou_9'] == 0 and x['vol_2g_mb_9'] == 0 and x['vol_3g_mb_9'] == 0 else 0, axis=1)

In [91]:
data.head()

Unnamed: 0_level_0,circle_id,loc_og_t2o_mou,std_og_t2o_mou,loc_ic_t2o_mou,arpu_6,arpu_7,arpu_8,arpu_9,onnet_mou_6,onnet_mou_7,...,sachet_3g_6,sachet_3g_7,sachet_3g_8,sachet_3g_9,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g,sep_vbc_3g,churn
mobile_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7000842753,109,0.0,0.0,0.0,197.385,214.816,213.803,21.1,0.0,0.0,...,0,0,0,0,968,30.4,0.0,101.2,3.58,1
7000701601,109,0.0,0.0,0.0,1069.18,1349.85,3171.48,500.0,57.84,54.68,...,0,0,0,0,802,57.74,19.38,18.74,0.0,1
7001524846,109,0.0,0.0,0.0,378.721,492.223,137.362,166.787,413.69,351.03,...,0,0,0,0,315,21.03,910.65,122.16,0.0,0
7002124215,109,0.0,0.0,0.0,514.453,597.753,637.76,578.596,102.41,132.11,...,0,0,0,0,720,0.0,0.0,0.0,0.0,0
7000887461,109,0.0,0.0,0.0,74.35,193.897,366.966,811.48,48.96,50.66,...,0,0,1,0,604,40.45,51.86,0.0,0.0,0


### Drop all columns for the last month

These columns cannot be used for learning.

In [92]:
month_9_cols = [x for x in data.columns if '_9' in x]

In [93]:
month_9_cols

['arpu_9',
 'onnet_mou_9',
 'offnet_mou_9',
 'roam_ic_mou_9',
 'roam_og_mou_9',
 'loc_og_t2t_mou_9',
 'loc_og_t2m_mou_9',
 'loc_og_t2f_mou_9',
 'loc_og_t2c_mou_9',
 'loc_og_mou_9',
 'std_og_t2t_mou_9',
 'std_og_t2m_mou_9',
 'std_og_t2f_mou_9',
 'std_og_t2c_mou_9',
 'std_og_mou_9',
 'isd_og_mou_9',
 'spl_og_mou_9',
 'og_others_9',
 'total_og_mou_9',
 'loc_ic_t2t_mou_9',
 'loc_ic_t2m_mou_9',
 'loc_ic_t2f_mou_9',
 'loc_ic_mou_9',
 'std_ic_t2t_mou_9',
 'std_ic_t2m_mou_9',
 'std_ic_t2f_mou_9',
 'std_ic_t2o_mou_9',
 'std_ic_mou_9',
 'total_ic_mou_9',
 'spl_ic_mou_9',
 'isd_ic_mou_9',
 'ic_others_9',
 'total_rech_num_9',
 'total_rech_amt_9',
 'max_rech_amt_9',
 'date_of_last_rech_9',
 'last_day_rch_amt_9',
 'vol_2g_mb_9',
 'vol_3g_mb_9',
 'monthly_2g_9',
 'sachet_2g_9',
 'monthly_3g_9',
 'sachet_3g_9']

In [94]:
data.shape

(30001, 182)

In [95]:
data.drop(month_9_cols, axis=1, inplace=True)

In [96]:
data.shape

(30001, 139)

## EDA

### Univariate analysis

In [97]:
non_id_cols = [x for x in data.columns if x not in id_cols and x != 'churn']

In [98]:
data['arpu_6'].value_counts()

0.000       227
213.821      18
213.822      17
213.818      16
213.815      14
           ... 
104.173       1
107.266       1
167.867       1
1340.760      1
322.991       1
Name: arpu_6, Length: 29072, dtype: int64

In [99]:
non_id_cols

['loc_og_t2o_mou',
 'std_og_t2o_mou',
 'loc_ic_t2o_mou',
 'arpu_6',
 'arpu_7',
 'arpu_8',
 'onnet_mou_6',
 'onnet_mou_7',
 'onnet_mou_8',
 'offnet_mou_6',
 'offnet_mou_7',
 'offnet_mou_8',
 'roam_ic_mou_6',
 'roam_ic_mou_7',
 'roam_ic_mou_8',
 'roam_og_mou_6',
 'roam_og_mou_7',
 'roam_og_mou_8',
 'loc_og_t2t_mou_6',
 'loc_og_t2t_mou_7',
 'loc_og_t2t_mou_8',
 'loc_og_t2m_mou_6',
 'loc_og_t2m_mou_7',
 'loc_og_t2m_mou_8',
 'loc_og_t2f_mou_6',
 'loc_og_t2f_mou_7',
 'loc_og_t2f_mou_8',
 'loc_og_t2c_mou_6',
 'loc_og_t2c_mou_7',
 'loc_og_t2c_mou_8',
 'loc_og_mou_6',
 'loc_og_mou_7',
 'loc_og_mou_8',
 'std_og_t2t_mou_6',
 'std_og_t2t_mou_7',
 'std_og_t2t_mou_8',
 'std_og_t2m_mou_6',
 'std_og_t2m_mou_7',
 'std_og_t2m_mou_8',
 'std_og_t2f_mou_6',
 'std_og_t2f_mou_7',
 'std_og_t2f_mou_8',
 'std_og_t2c_mou_6',
 'std_og_t2c_mou_7',
 'std_og_t2c_mou_8',
 'std_og_mou_6',
 'std_og_mou_7',
 'std_og_mou_8',
 'isd_og_mou_6',
 'isd_og_mou_7',
 'isd_og_mou_8',
 'spl_og_mou_6',
 'spl_og_mou_7',
 'spl_og_mou

In [100]:
mou_cols = [x for x in non_id_cols if 'mou' in x]

In [101]:
def plot_cols(cols):
    idx = 0
    for x in cols:
        plt.figure(figsize=(12, 4))
        print('*** ' + x + ' ***')
        sns.countplot(data=data, x=x, hue='churn')
        plt.show()
        print('')

        idx += 1

In [102]:
# plot_cols(mou_cols)

### Bivariate analysis

## Train-test split

In [103]:
def split_x_and_y(df):
    y = df.pop('churn')
    X = df

    return X, y

In [104]:
X, y = split_x_and_y(data)

In [105]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [106]:
X_train.head()

Unnamed: 0_level_0,circle_id,loc_og_t2o_mou,std_og_t2o_mou,loc_ic_t2o_mou,arpu_6,arpu_7,arpu_8,onnet_mou_6,onnet_mou_7,onnet_mou_8,...,monthly_3g_7,monthly_3g_8,sachet_3g_6,sachet_3g_7,sachet_3g_8,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g,sep_vbc_3g
mobile_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7000763462,109,0.0,0.0,0.0,469.744,758.259,550.074,406.73,742.63,541.44,...,0,0,0,0,0,1047,0.0,0.0,0.0,0.0
7000910526,109,0.0,0.0,0.0,576.239,400.562,701.24,239.41,199.96,162.59,...,0,0,1,0,2,385,499.33,375.23,410.63,0.0
7002151457,109,0.0,0.0,0.0,101.116,155.191,53.24,0.0,0.91,0.0,...,0,0,0,0,0,1590,0.0,0.0,0.0,0.0
7001013312,109,0.0,0.0,0.0,140.406,145.862,129.348,0.0,0.0,0.0,...,0,0,1,1,1,3210,96.09,23.54,113.82,0.0
7002019570,109,0.0,0.0,0.0,493.929,490.526,133.539,160.64,124.01,5.34,...,0,0,0,0,0,2274,0.0,0.0,0.0,0.0


In [107]:
X_test.head()

Unnamed: 0_level_0,circle_id,loc_og_t2o_mou,std_og_t2o_mou,loc_ic_t2o_mou,arpu_6,arpu_7,arpu_8,onnet_mou_6,onnet_mou_7,onnet_mou_8,...,monthly_3g_7,monthly_3g_8,sachet_3g_6,sachet_3g_7,sachet_3g_8,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g,sep_vbc_3g
mobile_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7000297339,109,0.0,0.0,0.0,944.344,41.31,2.78,26.81,4.23,0.0,...,0,0,0,0,0,1076,0.0,0.0,0.0,0.0
7001494082,109,0.0,0.0,0.0,544.532,745.897,581.13,151.89,474.46,101.83,...,0,1,0,0,0,3023,10.12,0.0,0.0,0.0
7000365084,109,0.0,0.0,0.0,215.456,608.271,532.28,0.0,45.96,145.91,...,0,0,0,1,0,357,84.44,9.78,0.0,0.0
7001600514,109,0.0,0.0,0.0,623.978,740.704,603.844,145.83,188.19,146.34,...,0,0,0,0,0,2988,0.0,0.0,0.0,0.0
7001323116,109,0.0,0.0,0.0,285.444,276.185,241.623,4.29,2.66,0.2,...,0,0,0,0,2,692,650.7,1038.05,1135.27,181.94


Checking distribution of values in the datasets.

In [108]:
y.value_counts(normalize=True)

0    0.918636
1    0.081364
Name: churn, dtype: float64

In [109]:
y_train.value_counts(normalize=True)

0    0.918619
1    0.081381
Name: churn, dtype: float64

In [110]:
y_test.value_counts(normalize=True)

0    0.918676
1    0.081324
Name: churn, dtype: float64

## Scaling

In [111]:
from sklearn.preprocessing import StandardScaler

In [112]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

In [113]:
X_test = scaler.transform(X_test)

## PCA

## Modeling

### Checking with Logistic Regression

In [114]:
from sklearn.linear_model import LogisticRegression

In [115]:
logreg = LogisticRegression(random_state=42)

In [116]:
logreg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [117]:
y_train_pred_log = logreg.predict(X_train)

Maximize sensitivity to minimize the false negatives.

sensitivity = TP / (TP + FN)

informedness = TPR + TNR - 1

TPR = TP / (TP + FN)

TNR = TN / (TN + FP)

In [118]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_auc_score

In [119]:
def evaluate(y, y_pred):
    acc = accuracy_score(y_pred=y_pred, y_true=y)
    print('')
    print(f'*** accuracy: {acc}')

    conf = confusion_matrix(y_pred=y_pred, y_true=y)
    tp = conf[0][0]
    fn = conf[0][1]
    fp = conf[1][0]
    tn = conf[1][1]

    sensitivity = tp / (tp + fn)
    print('')
    print(f'*** sensitivity: {sensitivity}')

    print('')
    print(classification_report(y_pred=y_pred, y_true=y))
    print('')

    tpr = tp / (tp + fn)
    tnr = tn / (tn + fp)
    informedness = tpr + tnr - 1
    print(f'*** informedness: {informedness}')

    roc_auc = roc_auc_score(y_score=y_pred, y_true=y)
    print(f'*** ROC AUC: {roc_auc}')


In [120]:
evaluate(y_train, y_train_pred_log)


*** accuracy: 0.9323809523809524

*** sensitivity: 0.9863666994971748

              precision    recall  f1-score   support

           0       0.94      0.99      0.96     19291
           1       0.68      0.32      0.44      1709

    accuracy                           0.93     21000
   macro avg       0.81      0.65      0.70     21000
weighted avg       0.92      0.93      0.92     21000


*** informedness: 0.30936260353462375
*** ROC AUC: 0.6546813017673118


Even though accuracy and sensitivity are high, the ROC's AUC is not good.

### Decision Tree

In [121]:
from sklearn.tree import DecisionTreeClassifier

In [122]:
dt_class = DecisionTreeClassifier()

In [123]:
data.shape

(30001, 138)

In [124]:
dt_class.fit(X_train, y_train)

In [125]:
y_train_pred_dt = dt_class.predict(X_train)

In [126]:
evaluate(y_train, y_train_pred_dt)


*** accuracy: 1.0

*** sensitivity: 1.0

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     19291
           1       1.00      1.00      1.00      1709

    accuracy                           1.00     21000
   macro avg       1.00      1.00      1.00     21000
weighted avg       1.00      1.00      1.00     21000


*** informedness: 1.0
*** ROC AUC: 1.0


The model has most likely overfit. Confirm with test data.

In [127]:
y_test_pred_dt = dt_class.predict(X_test)

In [128]:
evaluate(y_test, y_test_pred_dt)


*** accuracy: 0.9173425174980557

*** sensitivity: 0.9529568267021405

              precision    recall  f1-score   support

           0       0.96      0.95      0.95      8269
           1       0.49      0.52      0.50       732

    accuracy                           0.92      9001
   macro avg       0.72      0.73      0.73      9001
weighted avg       0.92      0.92      0.92      9001


*** informedness: 0.467984149106512
*** ROC AUC: 0.7339920745532561


Accuracy has only dropped slightly, but ROC AUC has dropped significantly. So, test performance is not good. This indicates overfitting.

#### Hyperparameter tuning using cross-validation

In [None]:
from sklearn.model_selection import GridSearchCV

In [129]:
data.shape

(30001, 138)

In [157]:
params = {
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 5, 10, 15, 20],
    'max_leaf_nodes': [10, 11, 12, 13, 14, 15]
}

dt_class = DecisionTreeClassifier(random_state=42)

cv = GridSearchCV(
    estimator=dt_class,
    param_grid=params,
    n_jobs=-1,
    return_train_score=True,
    verbose=2
)

cv.fit(X_train, y_train)

In [160]:
dt_best = cv.best_estimator_

In [161]:
dt_best

In [162]:
y_train_pred_dt_best = dt_best.predict(X_train)

In [163]:
evaluate(y_train, y_train_pred_dt_best)


*** accuracy: 0.9443333333333334

*** sensitivity: 0.9824270385153698

              precision    recall  f1-score   support

           0       0.96      0.98      0.97     19291
           1       0.72      0.51      0.60      1709

    accuracy                           0.94     21000
   macro avg       0.84      0.75      0.79     21000
weighted avg       0.94      0.94      0.94     21000


*** informedness: 0.4967629074445683
*** ROC AUC: 0.7483814537222842


Checking with test data

In [164]:
y_test_pred_dt_best = dt_best.predict(X_test)

In [165]:
evaluate(y_test, y_test_pred_dt_best)


*** accuracy: 0.944006221530941

*** sensitivity: 0.9850042326762607

              precision    recall  f1-score   support

           0       0.96      0.99      0.97      8269
           1       0.74      0.48      0.58       732

    accuracy                           0.94      9001
   macro avg       0.85      0.73      0.78      9001
weighted avg       0.94      0.94      0.94      9001


*** informedness: 0.46587854961615127
*** ROC AUC: 0.7329392748080757


### Random Forest

In [166]:
from sklearn.ensemble import RandomForestClassifier

In [167]:
rf = RandomForestClassifier(random_state=42)

In [168]:
rf.fit(X_train, y_train)

In [169]:
y_train_pred_rf = rf.predict(X_train)

In [170]:
evaluate(y_train, y_train_pred_rf)


*** accuracy: 0.9998571428571429

*** sensitivity: 1.0

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     19291
           1       1.00      1.00      1.00      1709

    accuracy                           1.00     21000
   macro avg       1.00      1.00      1.00     21000
weighted avg       1.00      1.00      1.00     21000


*** informedness: 0.9982445874780574
*** ROC AUC: 0.9991222937390287


Random Forest with Cross validation

In [172]:
params_rf = {
    'max_depth': [1, 2, 5, 10, 15, 25, 50],
    'min_samples_split': [2, 5, 10, 15, 20],
    'max_leaf_nodes': [1, 2, 5, 10, 15, 20, 25, 50]
}

rf = RandomForestClassifier(random_state=42)

cv_rf = GridSearchCV(
    estimator=rf,
    param_grid=params_rf,
    n_jobs=-1,
    return_train_score=True,
    verbose=2
)

cv_rf.fit(X_train, y_train)

Fitting 5 folds for each of 280 candidates, totalling 1400 fits


KeyboardInterrupt: 

### Evaluation

Since it is important not to misclassify churns, we use <>

What about F1-score? How does it tie into class imbalance, and does it cover not misclassifying churn?