# Data Description 


Estimating churners prior to their abandonment of a product or service is crucial. In this machine learning assignment, I will design a churn prediction model for the telecom industry to forecast which customers are most likely to churn.

The 39 features of this dataset are as follows:

1. Acronyms: Descriptions
2. MOBILE_NUMBER: Customer phone number
3. CIRCLE_ID: Telecom circle area to which the customer belongs to
4. LOC	Local calls: within same telecom circle
5. STD	STD calls: outside the calling circle
6. IC: Incoming calls
7. OG: Outgoing calls
8. T2T: Operator T to T, i.e. within same operator (mobile to mobile)
9. T2M: Operator T to other operator mobile
10. T2O: Operator T to other operator fixed line
11. T2F: Operator T to fixed lines of T
12. T2C: Operator T to it’s own call center
13. ARPU: Average revenue per user
14. MOU: Minutes of usage - voice calls
15. AON: Age on network - number of days the customer is using the operator T network
16. ONNET: All kind of calls within the same operator network
17. OFFNET:    	All kind of calls outside the operator T network
18. ROAM:	Indicates that customer is in roaming zone during the call
19. SPL:   	Special calls
20. ISD:    	ISD calls
21. RECH:    	Recharge
22. NUM:    	Number
23. AMT:    	Amount in local currency
24. MAX:    	Maximum
25. DATA:    	Mobile internet
26. 3G:    	3G network
27. AV:    	Average
28. VOL:    	Mobile internet usage volume (in MB)
29. 2G:    	2G network
30. PCK:    	Prepaid service schemes called - PACKS
31. NIGHT:    	Scheme to use during specific night hours only
32. MONTHLY:    	Service schemes with validity equivalent to a month
33. SACHET:   	Service schemes with validity smaller than a month
34. *.6:    	KPI for the month of June
35. *.7:    	KPI for the month of July
36. *.8:    	KPI for the month of August
37. *.9:    	KPI for the month of September
38. FB_USER:	Service scheme to avail services of Facebook and similar social networking sites
39. VBC:    	Volume based cost - when no specific scheme is not purchased and paid as per usage

# Problem Statement

In the telecommunications business, clients may pick from a variety of service providers and actively switch from one to another. In this extremely competitive sector, the yearly turnover rate for the telecoms industry averages between 15 and 25 percent. Given that it costs five to ten times as much to gain a new client as it does to maintain an existing one, customer retention is now more crucial than customer acquisition.

Retaining high-profitable clients is the primary business objective for many incumbent operators of T. To decrease customer turnover, telecom businesses must anticipate which consumers are at high risk of churning.

# Aim

In this project, I am going to analyze the customer-level data of a large telecommunications company, develop predictive models to identify consumers at high risk of churn, and identify the primary signs of churn.


# Tech stack

* Language – Python
* Libraries - Numpy, Pandas, Matplotlib, Seaborn, Scikit-learn,

# Approach

1. Importing the required libraries and reading the dataset.
    * Understanding the dataset
2. Exploratory Data Analysis (EDA) –
3. Filtering High Value Customers
4. Creating target Variable
5. Developing new features
6. Handling Missing values
7. Data Visualization-Univariate Analysis
8. Data Visualization- Bivariate Analysis
9. Outlier Detection
10. Data Preparation
11. Data Modeling and Eavlaution
12. Non-Interpretable Models
13. Interpretable Models
14. Conclusion

# 1. Understanding the dataset

In [1]:
%%HTML
<style type="text/css">
table.dataframe td, table.dataframe th {
    border: 1px  black solid !important;
  color: black !important;
}
</style>

## 1.1 Importing Libraries

In [2]:
!pip install graphviz



In [3]:
#Importing Data Reading and Processing Libraries
import pandas as pd
import numpy as np

#Imporitng Data Visualization Libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#Importing Data Preparation and Modeling Libraries
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, GridSearchCV,StratifiedKFold

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier


#Importing Warning Libraries
import warnings
warnings.filterwarnings("ignore")

#Importing Miscellaneous Libraries
pd.set_option("display.max_columns",None)
pd.set_option("display.max_rows",None)
pd.set_option('display.width', None)

# Import the StandardScaler()
from sklearn.preprocessing import StandardScaler

#Improting the PCA module
from sklearn.decomposition import PCA

# For Hopkins test
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
import numpy as np
from math import isnan

# For clustering 
## using KMeans ##
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Importing classification report and confusion matrix from sklearn metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import recall_score,roc_auc_score,roc_curve

## using Hierarchical ##
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

# Importing required packages for visualization
from IPython.display import Image  
#from sklearn.externals.six import StringIO  
from sklearn.tree import export_graphviz
import pydot, graphviz


# Other sklearn packages
import sklearn.metrics as metrics
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

from datetime import date,datetime
import math
import multiprocessing

print("Successfully Importing Libraries...")

Successfully Importing Libraries...


# 1.2 Dataset Loading 

In [4]:
df = pd.read_csv('telecom_churn_data.csv')
print("Successully Datset Load..")
df.sample(2)

Successully Datset Load..


Unnamed: 0,mobile_number,circle_id,loc_og_t2o_mou,std_og_t2o_mou,loc_ic_t2o_mou,last_date_of_month_6,last_date_of_month_7,last_date_of_month_8,last_date_of_month_9,arpu_6,arpu_7,arpu_8,arpu_9,onnet_mou_6,onnet_mou_7,onnet_mou_8,onnet_mou_9,offnet_mou_6,offnet_mou_7,offnet_mou_8,offnet_mou_9,roam_ic_mou_6,roam_ic_mou_7,roam_ic_mou_8,roam_ic_mou_9,roam_og_mou_6,roam_og_mou_7,roam_og_mou_8,roam_og_mou_9,loc_og_t2t_mou_6,loc_og_t2t_mou_7,loc_og_t2t_mou_8,loc_og_t2t_mou_9,loc_og_t2m_mou_6,loc_og_t2m_mou_7,loc_og_t2m_mou_8,loc_og_t2m_mou_9,loc_og_t2f_mou_6,loc_og_t2f_mou_7,loc_og_t2f_mou_8,loc_og_t2f_mou_9,loc_og_t2c_mou_6,loc_og_t2c_mou_7,loc_og_t2c_mou_8,loc_og_t2c_mou_9,loc_og_mou_6,loc_og_mou_7,loc_og_mou_8,loc_og_mou_9,std_og_t2t_mou_6,std_og_t2t_mou_7,std_og_t2t_mou_8,std_og_t2t_mou_9,std_og_t2m_mou_6,std_og_t2m_mou_7,std_og_t2m_mou_8,std_og_t2m_mou_9,std_og_t2f_mou_6,std_og_t2f_mou_7,std_og_t2f_mou_8,std_og_t2f_mou_9,std_og_t2c_mou_6,std_og_t2c_mou_7,std_og_t2c_mou_8,std_og_t2c_mou_9,std_og_mou_6,std_og_mou_7,std_og_mou_8,std_og_mou_9,isd_og_mou_6,isd_og_mou_7,isd_og_mou_8,isd_og_mou_9,spl_og_mou_6,spl_og_mou_7,spl_og_mou_8,spl_og_mou_9,og_others_6,og_others_7,og_others_8,og_others_9,total_og_mou_6,total_og_mou_7,total_og_mou_8,total_og_mou_9,loc_ic_t2t_mou_6,loc_ic_t2t_mou_7,loc_ic_t2t_mou_8,loc_ic_t2t_mou_9,loc_ic_t2m_mou_6,loc_ic_t2m_mou_7,loc_ic_t2m_mou_8,loc_ic_t2m_mou_9,loc_ic_t2f_mou_6,loc_ic_t2f_mou_7,loc_ic_t2f_mou_8,loc_ic_t2f_mou_9,loc_ic_mou_6,loc_ic_mou_7,loc_ic_mou_8,loc_ic_mou_9,std_ic_t2t_mou_6,std_ic_t2t_mou_7,std_ic_t2t_mou_8,std_ic_t2t_mou_9,std_ic_t2m_mou_6,std_ic_t2m_mou_7,std_ic_t2m_mou_8,std_ic_t2m_mou_9,std_ic_t2f_mou_6,std_ic_t2f_mou_7,std_ic_t2f_mou_8,std_ic_t2f_mou_9,std_ic_t2o_mou_6,std_ic_t2o_mou_7,std_ic_t2o_mou_8,std_ic_t2o_mou_9,std_ic_mou_6,std_ic_mou_7,std_ic_mou_8,std_ic_mou_9,total_ic_mou_6,total_ic_mou_7,total_ic_mou_8,total_ic_mou_9,spl_ic_mou_6,spl_ic_mou_7,spl_ic_mou_8,spl_ic_mou_9,isd_ic_mou_6,isd_ic_mou_7,isd_ic_mou_8,isd_ic_mou_9,ic_others_6,ic_others_7,ic_others_8,ic_others_9,total_rech_num_6,total_rech_num_7,total_rech_num_8,total_rech_num_9,total_rech_amt_6,total_rech_amt_7,total_rech_amt_8,total_rech_amt_9,max_rech_amt_6,max_rech_amt_7,max_rech_amt_8,max_rech_amt_9,date_of_last_rech_6,date_of_last_rech_7,date_of_last_rech_8,date_of_last_rech_9,last_day_rch_amt_6,last_day_rch_amt_7,last_day_rch_amt_8,last_day_rch_amt_9,date_of_last_rech_data_6,date_of_last_rech_data_7,date_of_last_rech_data_8,date_of_last_rech_data_9,total_rech_data_6,total_rech_data_7,total_rech_data_8,total_rech_data_9,max_rech_data_6,max_rech_data_7,max_rech_data_8,max_rech_data_9,count_rech_2g_6,count_rech_2g_7,count_rech_2g_8,count_rech_2g_9,count_rech_3g_6,count_rech_3g_7,count_rech_3g_8,count_rech_3g_9,av_rech_amt_data_6,av_rech_amt_data_7,av_rech_amt_data_8,av_rech_amt_data_9,vol_2g_mb_6,vol_2g_mb_7,vol_2g_mb_8,vol_2g_mb_9,vol_3g_mb_6,vol_3g_mb_7,vol_3g_mb_8,vol_3g_mb_9,arpu_3g_6,arpu_3g_7,arpu_3g_8,arpu_3g_9,arpu_2g_6,arpu_2g_7,arpu_2g_8,arpu_2g_9,night_pck_user_6,night_pck_user_7,night_pck_user_8,night_pck_user_9,monthly_2g_6,monthly_2g_7,monthly_2g_8,monthly_2g_9,sachet_2g_6,sachet_2g_7,sachet_2g_8,sachet_2g_9,monthly_3g_6,monthly_3g_7,monthly_3g_8,monthly_3g_9,sachet_3g_6,sachet_3g_7,sachet_3g_8,sachet_3g_9,fb_user_6,fb_user_7,fb_user_8,fb_user_9,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g,sep_vbc_3g
9574,7000122759,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,472.709,429.116,326.263,351.012,82.66,19.93,23.21,23.41,167.13,90.21,65.13,58.14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29.44,12.14,9.21,12.64,116.33,89.31,59.94,54.66,0.0,0.0,5.18,0.46,0.0,0.0,0.0,2.26,145.78,101.46,74.34,67.78,53.21,7.78,13.99,10.76,47.11,0.9,0.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.33,8.68,13.99,11.51,0.0,0.0,0.0,0.0,7.51,7.28,14.58,8.61,3.69,0.0,0.0,0.0,257.33,117.43,102.93,87.91,15.74,13.98,16.54,23.69,3390.96,435.73,124.68,139.89,25.61,33.64,49.21,19.93,3432.33,483.36,190.44,183.53,30.99,0.0,0.86,0.0,175.74,9.46,9.44,9.99,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,206.74,9.46,10.31,9.99,3648.66,492.83,200.76,193.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.58,0.0,0.0,0.0,17,14,14,17,570,460,370,410,50,50,30,30,6/30/2014,7/30/2014,8/27/2014,9/28/2014,30,30,30,30,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,665,0.0,0.0,0.0,0.0
26343,7000248369,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,263.28,154.144,88.0,86.0,9.19,33.46,65.36,47.81,0.9,5.49,1.4,3.98,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.19,33.46,65.36,47.81,0.9,5.49,1.4,3.98,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.09,38.96,66.76,51.79,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.09,38.96,66.76,51.79,486.23,470.99,310.51,249.63,11.53,56.11,20.46,14.66,0.0,0.0,0.0,0.0,497.76,527.11,330.98,264.29,0.0,0.0,0.0,0.0,0.0,0.33,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.33,0.0,0.0,497.76,527.44,330.98,264.29,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,2,2,0,0,250,250,0,0,250,250,0,6/3/2014,7/15/2014,8/28/2014,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,621,0.0,31.17,26.64,0.0


In [5]:
df.shape

(99999, 226)

`Observation`: Number of rows: 99999, 
               Number of columns: 226 

In [6]:
df.info(verbose=1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99999 entries, 0 to 99998
Data columns (total 226 columns):
 #    Column                    Dtype  
---   ------                    -----  
 0    mobile_number             int64  
 1    circle_id                 int64  
 2    loc_og_t2o_mou            float64
 3    std_og_t2o_mou            float64
 4    loc_ic_t2o_mou            float64
 5    last_date_of_month_6      object 
 6    last_date_of_month_7      object 
 7    last_date_of_month_8      object 
 8    last_date_of_month_9      object 
 9    arpu_6                    float64
 10   arpu_7                    float64
 11   arpu_8                    float64
 12   arpu_9                    float64
 13   onnet_mou_6               float64
 14   onnet_mou_7               float64
 15   onnet_mou_8               float64
 16   onnet_mou_9               float64
 17   offnet_mou_6              float64
 18   offnet_mou_7              float64
 19   offnet_mou_8              float64
 20   offn

In [7]:
#Let's examine the data's distribution.
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
mobile_number,99999.0,7001207000.0,695669.38629,7000000000.0,7000606000.0,7001205000.0,7001812000.0,7002411000.0
circle_id,99999.0,109.0,0.0,109.0,109.0,109.0,109.0,109.0
loc_og_t2o_mou,98981.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
std_og_t2o_mou,98981.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
loc_ic_t2o_mou,98981.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
arpu_6,99999.0,282.9874,328.43977,-2258.709,93.4115,197.704,371.06,27731.09
arpu_7,99999.0,278.5366,338.156291,-2014.045,86.9805,191.64,365.3445,35145.83
arpu_8,99999.0,279.1547,344.474791,-945.808,84.126,192.08,369.3705,33543.62
arpu_9,99999.0,261.6451,341.99863,-1899.505,62.685,176.849,353.4665,38805.62
onnet_mou_6,96062.0,132.3959,297.207406,0.0,7.38,34.31,118.74,7376.71


# 2. Data Cleaning

Some columns representing volume-based users include the month in their names, whereas the remaining columns use the numerals "6,7,8" to indicate the month. I am planning to rename the missplead column.

In [8]:
month = ['aug_vbc_3g','jul_vbc_3g','jun_vbc_3g','sep_vbc_3g']
df = df.rename(columns = {'aug_vbc_3g':'vbc_3g_8','jul_vbc_3g':'vbc_3g_7','jun_vbc_3g':'vbc_3g_6',
                          'sep_vbc_3g':'vbc_3g_9'})


## 2.1 Converting Datetime

In [9]:
#The conversion of date columns to date-time format

date_col= [col for col in df.columns if 'date' in col]

for i in df[date_col]:
    df[i] = pd.to_datetime(df[i])


In [10]:
df.info(verbose=1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99999 entries, 0 to 99998
Data columns (total 226 columns):
 #    Column                    Dtype         
---   ------                    -----         
 0    mobile_number             int64         
 1    circle_id                 int64         
 2    loc_og_t2o_mou            float64       
 3    std_og_t2o_mou            float64       
 4    loc_ic_t2o_mou            float64       
 5    last_date_of_month_6      datetime64[ns]
 6    last_date_of_month_7      datetime64[ns]
 7    last_date_of_month_8      datetime64[ns]
 8    last_date_of_month_9      datetime64[ns]
 9    arpu_6                    float64       
 10   arpu_7                    float64       
 11   arpu_8                    float64       
 12   arpu_9                    float64       
 13   onnet_mou_6               float64       
 14   onnet_mou_7               float64       
 15   onnet_mou_8               float64       
 16   onnet_mou_9               float64     

# 2.2 Removing single unique value
Columns with a single unique value for all clients are essentially meaningless. As a result, i am deleting columns with zero variance from the dataset.

In [11]:
cols = []
for i in df.columns:
    if df[i].nunique() ==1:
        cols.append(i)
df = df.drop(cols,axis=1)
df.sample(2)

Unnamed: 0,mobile_number,arpu_6,arpu_7,arpu_8,arpu_9,onnet_mou_6,onnet_mou_7,onnet_mou_8,onnet_mou_9,offnet_mou_6,offnet_mou_7,offnet_mou_8,offnet_mou_9,roam_ic_mou_6,roam_ic_mou_7,roam_ic_mou_8,roam_ic_mou_9,roam_og_mou_6,roam_og_mou_7,roam_og_mou_8,roam_og_mou_9,loc_og_t2t_mou_6,loc_og_t2t_mou_7,loc_og_t2t_mou_8,loc_og_t2t_mou_9,loc_og_t2m_mou_6,loc_og_t2m_mou_7,loc_og_t2m_mou_8,loc_og_t2m_mou_9,loc_og_t2f_mou_6,loc_og_t2f_mou_7,loc_og_t2f_mou_8,loc_og_t2f_mou_9,loc_og_t2c_mou_6,loc_og_t2c_mou_7,loc_og_t2c_mou_8,loc_og_t2c_mou_9,loc_og_mou_6,loc_og_mou_7,loc_og_mou_8,loc_og_mou_9,std_og_t2t_mou_6,std_og_t2t_mou_7,std_og_t2t_mou_8,std_og_t2t_mou_9,std_og_t2m_mou_6,std_og_t2m_mou_7,std_og_t2m_mou_8,std_og_t2m_mou_9,std_og_t2f_mou_6,std_og_t2f_mou_7,std_og_t2f_mou_8,std_og_t2f_mou_9,std_og_mou_6,std_og_mou_7,std_og_mou_8,std_og_mou_9,isd_og_mou_6,isd_og_mou_7,isd_og_mou_8,isd_og_mou_9,spl_og_mou_6,spl_og_mou_7,spl_og_mou_8,spl_og_mou_9,og_others_6,og_others_7,og_others_8,og_others_9,total_og_mou_6,total_og_mou_7,total_og_mou_8,total_og_mou_9,loc_ic_t2t_mou_6,loc_ic_t2t_mou_7,loc_ic_t2t_mou_8,loc_ic_t2t_mou_9,loc_ic_t2m_mou_6,loc_ic_t2m_mou_7,loc_ic_t2m_mou_8,loc_ic_t2m_mou_9,loc_ic_t2f_mou_6,loc_ic_t2f_mou_7,loc_ic_t2f_mou_8,loc_ic_t2f_mou_9,loc_ic_mou_6,loc_ic_mou_7,loc_ic_mou_8,loc_ic_mou_9,std_ic_t2t_mou_6,std_ic_t2t_mou_7,std_ic_t2t_mou_8,std_ic_t2t_mou_9,std_ic_t2m_mou_6,std_ic_t2m_mou_7,std_ic_t2m_mou_8,std_ic_t2m_mou_9,std_ic_t2f_mou_6,std_ic_t2f_mou_7,std_ic_t2f_mou_8,std_ic_t2f_mou_9,std_ic_mou_6,std_ic_mou_7,std_ic_mou_8,std_ic_mou_9,total_ic_mou_6,total_ic_mou_7,total_ic_mou_8,total_ic_mou_9,spl_ic_mou_6,spl_ic_mou_7,spl_ic_mou_8,spl_ic_mou_9,isd_ic_mou_6,isd_ic_mou_7,isd_ic_mou_8,isd_ic_mou_9,ic_others_6,ic_others_7,ic_others_8,ic_others_9,total_rech_num_6,total_rech_num_7,total_rech_num_8,total_rech_num_9,total_rech_amt_6,total_rech_amt_7,total_rech_amt_8,total_rech_amt_9,max_rech_amt_6,max_rech_amt_7,max_rech_amt_8,max_rech_amt_9,date_of_last_rech_6,date_of_last_rech_7,date_of_last_rech_8,date_of_last_rech_9,last_day_rch_amt_6,last_day_rch_amt_7,last_day_rch_amt_8,last_day_rch_amt_9,date_of_last_rech_data_6,date_of_last_rech_data_7,date_of_last_rech_data_8,date_of_last_rech_data_9,total_rech_data_6,total_rech_data_7,total_rech_data_8,total_rech_data_9,max_rech_data_6,max_rech_data_7,max_rech_data_8,max_rech_data_9,count_rech_2g_6,count_rech_2g_7,count_rech_2g_8,count_rech_2g_9,count_rech_3g_6,count_rech_3g_7,count_rech_3g_8,count_rech_3g_9,av_rech_amt_data_6,av_rech_amt_data_7,av_rech_amt_data_8,av_rech_amt_data_9,vol_2g_mb_6,vol_2g_mb_7,vol_2g_mb_8,vol_2g_mb_9,vol_3g_mb_6,vol_3g_mb_7,vol_3g_mb_8,vol_3g_mb_9,arpu_3g_6,arpu_3g_7,arpu_3g_8,arpu_3g_9,arpu_2g_6,arpu_2g_7,arpu_2g_8,arpu_2g_9,night_pck_user_6,night_pck_user_7,night_pck_user_8,night_pck_user_9,monthly_2g_6,monthly_2g_7,monthly_2g_8,monthly_2g_9,sachet_2g_6,sachet_2g_7,sachet_2g_8,sachet_2g_9,monthly_3g_6,monthly_3g_7,monthly_3g_8,monthly_3g_9,sachet_3g_6,sachet_3g_7,sachet_3g_8,sachet_3g_9,fb_user_6,fb_user_7,fb_user_8,fb_user_9,aon,vbc_3g_8,vbc_3g_7,vbc_3g_6,vbc_3g_9
25886,7001898209,223.495,244.635,142.68,229.555,0.0,0.0,0.0,13.14,29.74,52.81,10.33,13.41,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13.14,7.23,49.43,8.68,13.41,1.56,1.46,1.65,0.0,0.0,0.0,0.0,0.0,8.79,50.89,10.33,26.56,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20.95,1.91,0.0,0.0,0.0,0.0,0.0,0.0,29.74,52.81,10.33,26.56,32.34,7.39,0.0,15.41,24.71,114.19,23.01,56.73,0.44,85.76,16.04,0.0,57.51,207.36,39.06,72.14,0.0,0.0,0.0,0.0,0.0,0.43,0.0,0.0,8.58,7.34,1.56,0.0,8.58,7.78,1.56,0.0,301.56,383.38,224.04,189.96,0.0,0.0,0.0,0.0,40.03,0.0,0.0,0.0,195.43,168.23,183.41,117.81,6,8,5,4,269,279,178,262,179,202,154,202,2014-06-26,2014-07-26,2014-08-28,2014-09-26,30,0,7,30,2014-06-02,2014-07-03,2014-08-05,2014-09-02,1.0,2.0,1.0,2.0,179.0,202.0,154.0,202.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,179.0,404.0,154.0,404.0,107.29,472.96,864.12,558.79,888.89,438.61,0.0,489.29,166.46,85.01,0.0,85.01,175.67,91.81,0.71,94.91,0.0,0.0,0.0,0.0,0,0,1,0,0,1,0,1,1,0,0,0,0,1,0,1,1.0,1.0,1.0,1.0,2647,0.0,485.6,5.15,0.0
8724,7001078801,346.067,317.799,229.246,144.23,63.81,27.58,53.31,20.94,317.78,342.66,228.36,134.66,10.69,6.28,0.0,0.0,153.88,98.33,0.0,0.0,32.61,14.58,36.18,14.43,134.81,147.91,120.44,111.28,0.0,0.0,3.74,0.0,0.0,0.0,0.0,0.0,167.43,162.49,160.38,125.71,17.76,0.16,17.13,6.51,42.51,109.24,104.16,23.38,0.0,0.0,0.0,0.0,60.28,109.41,121.29,29.89,0.0,0.0,0.0,0.0,0.0,0.0,0.7,0.0,0.0,0.0,0.0,0.0,227.71,271.91,282.38,155.61,158.61,10.64,67.66,37.78,78.84,71.21,122.56,130.69,0.68,1.86,6.98,0.0,238.14,83.73,197.21,168.48,18.29,1.05,13.96,9.76,15.98,24.63,12.11,15.39,0.0,1.25,0.0,0.0,34.28,26.93,26.08,25.16,272.43,110.66,223.29,193.64,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9,4,6,5,604,160,321,101,120,65,65,61,2014-06-30,2014-07-24,2014-08-30,2014-09-29,120,65,61,20,NaT,NaT,NaT,NaT,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,3443,0.0,0.0,0.0,0.0


In [12]:
df.shape

(99999, 210)

`Observation`: Number of columns: unique column 16 is dropped from 226 to 210. 

# 2.3 Create ID for each customer

Eliminate the mobile number column and create an id column for client identification.

In [13]:
#drop mobile number
df = df.drop('mobile_number', axis=1)

#Create id
df = df.reset_index()
df = df.rename(columns = {'index':'cust_id'})
df['cust_id'] = df['cust_id']+1
df.sample(2)

Unnamed: 0,cust_id,arpu_6,arpu_7,arpu_8,arpu_9,onnet_mou_6,onnet_mou_7,onnet_mou_8,onnet_mou_9,offnet_mou_6,offnet_mou_7,offnet_mou_8,offnet_mou_9,roam_ic_mou_6,roam_ic_mou_7,roam_ic_mou_8,roam_ic_mou_9,roam_og_mou_6,roam_og_mou_7,roam_og_mou_8,roam_og_mou_9,loc_og_t2t_mou_6,loc_og_t2t_mou_7,loc_og_t2t_mou_8,loc_og_t2t_mou_9,loc_og_t2m_mou_6,loc_og_t2m_mou_7,loc_og_t2m_mou_8,loc_og_t2m_mou_9,loc_og_t2f_mou_6,loc_og_t2f_mou_7,loc_og_t2f_mou_8,loc_og_t2f_mou_9,loc_og_t2c_mou_6,loc_og_t2c_mou_7,loc_og_t2c_mou_8,loc_og_t2c_mou_9,loc_og_mou_6,loc_og_mou_7,loc_og_mou_8,loc_og_mou_9,std_og_t2t_mou_6,std_og_t2t_mou_7,std_og_t2t_mou_8,std_og_t2t_mou_9,std_og_t2m_mou_6,std_og_t2m_mou_7,std_og_t2m_mou_8,std_og_t2m_mou_9,std_og_t2f_mou_6,std_og_t2f_mou_7,std_og_t2f_mou_8,std_og_t2f_mou_9,std_og_mou_6,std_og_mou_7,std_og_mou_8,std_og_mou_9,isd_og_mou_6,isd_og_mou_7,isd_og_mou_8,isd_og_mou_9,spl_og_mou_6,spl_og_mou_7,spl_og_mou_8,spl_og_mou_9,og_others_6,og_others_7,og_others_8,og_others_9,total_og_mou_6,total_og_mou_7,total_og_mou_8,total_og_mou_9,loc_ic_t2t_mou_6,loc_ic_t2t_mou_7,loc_ic_t2t_mou_8,loc_ic_t2t_mou_9,loc_ic_t2m_mou_6,loc_ic_t2m_mou_7,loc_ic_t2m_mou_8,loc_ic_t2m_mou_9,loc_ic_t2f_mou_6,loc_ic_t2f_mou_7,loc_ic_t2f_mou_8,loc_ic_t2f_mou_9,loc_ic_mou_6,loc_ic_mou_7,loc_ic_mou_8,loc_ic_mou_9,std_ic_t2t_mou_6,std_ic_t2t_mou_7,std_ic_t2t_mou_8,std_ic_t2t_mou_9,std_ic_t2m_mou_6,std_ic_t2m_mou_7,std_ic_t2m_mou_8,std_ic_t2m_mou_9,std_ic_t2f_mou_6,std_ic_t2f_mou_7,std_ic_t2f_mou_8,std_ic_t2f_mou_9,std_ic_mou_6,std_ic_mou_7,std_ic_mou_8,std_ic_mou_9,total_ic_mou_6,total_ic_mou_7,total_ic_mou_8,total_ic_mou_9,spl_ic_mou_6,spl_ic_mou_7,spl_ic_mou_8,spl_ic_mou_9,isd_ic_mou_6,isd_ic_mou_7,isd_ic_mou_8,isd_ic_mou_9,ic_others_6,ic_others_7,ic_others_8,ic_others_9,total_rech_num_6,total_rech_num_7,total_rech_num_8,total_rech_num_9,total_rech_amt_6,total_rech_amt_7,total_rech_amt_8,total_rech_amt_9,max_rech_amt_6,max_rech_amt_7,max_rech_amt_8,max_rech_amt_9,date_of_last_rech_6,date_of_last_rech_7,date_of_last_rech_8,date_of_last_rech_9,last_day_rch_amt_6,last_day_rch_amt_7,last_day_rch_amt_8,last_day_rch_amt_9,date_of_last_rech_data_6,date_of_last_rech_data_7,date_of_last_rech_data_8,date_of_last_rech_data_9,total_rech_data_6,total_rech_data_7,total_rech_data_8,total_rech_data_9,max_rech_data_6,max_rech_data_7,max_rech_data_8,max_rech_data_9,count_rech_2g_6,count_rech_2g_7,count_rech_2g_8,count_rech_2g_9,count_rech_3g_6,count_rech_3g_7,count_rech_3g_8,count_rech_3g_9,av_rech_amt_data_6,av_rech_amt_data_7,av_rech_amt_data_8,av_rech_amt_data_9,vol_2g_mb_6,vol_2g_mb_7,vol_2g_mb_8,vol_2g_mb_9,vol_3g_mb_6,vol_3g_mb_7,vol_3g_mb_8,vol_3g_mb_9,arpu_3g_6,arpu_3g_7,arpu_3g_8,arpu_3g_9,arpu_2g_6,arpu_2g_7,arpu_2g_8,arpu_2g_9,night_pck_user_6,night_pck_user_7,night_pck_user_8,night_pck_user_9,monthly_2g_6,monthly_2g_7,monthly_2g_8,monthly_2g_9,sachet_2g_6,sachet_2g_7,sachet_2g_8,sachet_2g_9,monthly_3g_6,monthly_3g_7,monthly_3g_8,monthly_3g_9,sachet_3g_6,sachet_3g_7,sachet_3g_8,sachet_3g_9,fb_user_6,fb_user_7,fb_user_8,fb_user_9,aon,vbc_3g_8,vbc_3g_7,vbc_3g_6,vbc_3g_9
37801,37802,311.469,239.701,202.85,216.139,93.03,0.03,38.46,36.83,598.99,472.61,329.43,286.54,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.54,0.03,0.0,0.0,33.14,16.18,16.23,15.21,0.38,0.0,0.0,0.0,0.0,0.0,11.68,0.0,35.08,16.21,16.23,15.21,91.48,0.0,38.46,36.83,565.46,456.43,284.39,270.96,0.0,0.0,0.0,0.0,656.94,456.43,322.86,307.79,0.0,0.0,0.03,0.0,0.0,0.0,28.76,0.36,0.0,0.0,0.0,0.0,692.03,472.64,367.89,323.38,1.16,0.38,0.0,0.0,30.19,4.69,16.06,7.94,0.24,0.85,0.0,0.78,31.61,5.93,16.06,8.73,0.0,6.9,0.0,0.0,16.81,42.59,3.99,6.56,0.0,20.65,0.6,0.0,16.81,70.14,4.59,6.56,54.26,89.26,27.54,15.58,0.0,0.0,0.0,0.0,5.55,13.04,6.71,0.0,0.28,0.13,0.16,0.28,5,6,6,13,456,206,233,244,110,110,130,130,2014-06-29,2014-07-29,2014-08-24,2014-09-30,110,50,50,10,NaT,NaT,2014-08-03,2014-09-20,,,1.0,3.0,,,17.0,17.0,,,1.0,3.0,,,0.0,0.0,,,17.0,48.0,0.0,0.0,81.06,269.82,0.0,0.0,0.0,0.0,,,0.0,0.0,,,0.6,2.3,,,0.0,0.0,0,0,0,0,0,0,1,3,0,0,0,0,0,0,0,0,,,1.0,1.0,369,0.0,0.0,0.0,0.0
98814,98815,180.04,505.767,86.507,255.502,9.13,0.0,0.0,7.26,171.44,53.89,24.78,75.46,46.26,0.0,0.0,0.0,61.26,0.0,0.0,0.0,0.0,0.0,0.0,7.26,63.98,45.66,24.78,74.71,7.34,6.19,0.0,0.75,5.04,1.99,0.0,0.0,71.33,51.86,24.78,82.73,8.95,0.0,0.0,0.0,29.06,0.03,0.0,0.0,0.0,0.0,0.0,0.0,38.01,0.03,0.0,0.0,0.0,0.0,0.0,0.0,9.96,5.73,0.6,0.76,0.0,0.0,0.0,0.0,119.31,57.63,25.38,83.49,0.0,0.0,0.0,11.59,20.08,79.98,58.21,129.33,0.0,0.0,0.58,0.55,20.08,79.98,58.79,141.48,0.28,0.0,0.0,0.0,15.09,15.46,11.58,26.54,0.0,0.0,0.0,0.0,15.38,15.46,11.58,26.54,35.46,95.44,70.38,168.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,8,6,6,0,593,132,329,0,179,56,175,2014-06-14,2014-07-29,2014-08-30,2014-09-26,0,30,23,86,NaT,2014-07-28,2014-08-23,2014-09-05,,3.0,1.0,1.0,,179.0,56.0,175.0,,0.0,1.0,1.0,,3.0,0.0,0.0,,503.0,56.0,175.0,0.0,150.78,543.12,617.89,0.0,1065.13,177.03,32.32,,423.43,0.0,-0.04,,430.52,10.01,20.36,,0.0,0.0,0.0,0,0,0,1,0,0,1,0,0,3,0,0,0,0,0,0,,1.0,1.0,1.0,753,378.51,405.92,141.2,0.0


## 2.4 Separate Categorical columns

Check if this dataset has any category columns. This may be determined by identifying which columns contain the numbers 1 and 0. These values are equivalent to yes and no, respectively.

In [14]:
cat_cols = []

for i in df.columns:
    if df[i].nunique()==2:
        cat_cols.append(i)
cat_cols 

['night_pck_user_6',
 'night_pck_user_7',
 'night_pck_user_8',
 'night_pck_user_9',
 'fb_user_6',
 'fb_user_7',
 'fb_user_8',
 'fb_user_9']

## 2.5 Some Important Insights

### 2.5.1 Avaerage Revenue of call per user

In [15]:
arpu=[]
arpu_col= [col for col in df.columns if 'arpu' in col]

for i in df[arpu_col]:
    x=df[[i]].min()
    arpu.append(x)
print(arpu)

[arpu_6   -2258.709
dtype: float64, arpu_7   -2014.045
dtype: float64, arpu_8   -945.808
dtype: float64, arpu_9   -1899.505
dtype: float64, arpu_3g_6   -30.82
dtype: float64, arpu_3g_7   -26.04
dtype: float64, arpu_3g_8   -24.49
dtype: float64, arpu_3g_9   -71.09
dtype: float64, arpu_2g_6   -35.83
dtype: float64, arpu_2g_7   -15.48
dtype: float64, arpu_2g_8   -55.83
dtype: float64, arpu_2g_9   -45.74
dtype: float64]


`Observation`
The minimum values in arpu 6, arpu 7, arpu 8, and arpu 9 are seen to be negative. This suggests that certain clients are causing a loss for the business. I will maintain them in my study since my research criteria for a high-value client are based on usage-based churn, not revenue-based churn. Getting rid of them might result in the loss of some insightful knowledge. Observe their significance in the exploratory data analysis section before making a decision.

### 2.5.2 Sachet recharge
Recharge sachets are Service plans having a duration of less than one month. This indicates that the number of days needed to recharge a sachet should be fewer than 30. Any service plans that extend beyond 29 days should indicate that the client has performed a monthly recharge or that the entry is incorrect. Let us limit the values beyond 29 days to the maximum number of days recharged below 30 days.

In [16]:
sachet=[]
sachet_col= [col for col in df.columns if 'sachet' in col]

for i in df[sachet_col]:
    x=df[[i]].max()
    sachet.append(x)
print(sachet)

[sachet_2g_6    42
dtype: int64, sachet_2g_7    48
dtype: int64, sachet_2g_8    44
dtype: int64, sachet_2g_9    40
dtype: int64, sachet_3g_6    29
dtype: int64, sachet_3g_7    35
dtype: int64, sachet_3g_8    41
dtype: int64, sachet_3g_9    49
dtype: int64]


In [17]:
df['sachet_2g_6'] = df['sachet_2g_6'].clip(0,28)
df['sachet_2g_7'] = df['sachet_2g_7'].clip(0,29)
df['sachet_2g_8'] = df['sachet_2g_8'].clip(0,29)
df['sachet_3g_6'] = df['sachet_3g_6'].clip(0,29)
df['sachet_3g_7'] = df['sachet_3g_7'].clip(0,24)
df['sachet_3g_8'] = df['sachet_3g_8'].clip(0,29)

In [18]:
df.info(verbose=1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99999 entries, 0 to 99998
Data columns (total 210 columns):
 #    Column                    Dtype         
---   ------                    -----         
 0    cust_id                   int64         
 1    arpu_6                    float64       
 2    arpu_7                    float64       
 3    arpu_8                    float64       
 4    arpu_9                    float64       
 5    onnet_mou_6               float64       
 6    onnet_mou_7               float64       
 7    onnet_mou_8               float64       
 8    onnet_mou_9               float64       
 9    offnet_mou_6              float64       
 10   offnet_mou_7              float64       
 11   offnet_mou_8              float64       
 12   offnet_mou_9              float64       
 13   roam_ic_mou_6             float64       
 14   roam_ic_mou_7             float64       
 15   roam_ic_mou_8             float64       
 16   roam_ic_mou_9             float64     

# 3 Focusing on high value Customers

In [19]:
df['total_amt_6'] = df[['total_rech_amt_6', 'total_rech_data_6']].sum(axis=1)
df['total_amt_7'] = df[['total_rech_amt_7', 'total_rech_data_7']].sum(axis=1)
df['total_amt_8'] = df[['total_rech_amt_8', 'total_rech_data_8']].sum(axis=1)
df['total_amt_9'] = df[['total_rech_amt_9', 'total_rech_data_9']].sum(axis=1)

In [20]:
#calculate total recharge amount per user for months
df['total_rech_amt_per_user']=df[['total_amt_6','total_amt_7','total_amt_8','total_amt_9']].sum(axis=1)

In [21]:
df.sample(2)

Unnamed: 0,cust_id,arpu_6,arpu_7,arpu_8,arpu_9,onnet_mou_6,onnet_mou_7,onnet_mou_8,onnet_mou_9,offnet_mou_6,offnet_mou_7,offnet_mou_8,offnet_mou_9,roam_ic_mou_6,roam_ic_mou_7,roam_ic_mou_8,roam_ic_mou_9,roam_og_mou_6,roam_og_mou_7,roam_og_mou_8,roam_og_mou_9,loc_og_t2t_mou_6,loc_og_t2t_mou_7,loc_og_t2t_mou_8,loc_og_t2t_mou_9,loc_og_t2m_mou_6,loc_og_t2m_mou_7,loc_og_t2m_mou_8,loc_og_t2m_mou_9,loc_og_t2f_mou_6,loc_og_t2f_mou_7,loc_og_t2f_mou_8,loc_og_t2f_mou_9,loc_og_t2c_mou_6,loc_og_t2c_mou_7,loc_og_t2c_mou_8,loc_og_t2c_mou_9,loc_og_mou_6,loc_og_mou_7,loc_og_mou_8,loc_og_mou_9,std_og_t2t_mou_6,std_og_t2t_mou_7,std_og_t2t_mou_8,std_og_t2t_mou_9,std_og_t2m_mou_6,std_og_t2m_mou_7,std_og_t2m_mou_8,std_og_t2m_mou_9,std_og_t2f_mou_6,std_og_t2f_mou_7,std_og_t2f_mou_8,std_og_t2f_mou_9,std_og_mou_6,std_og_mou_7,std_og_mou_8,std_og_mou_9,isd_og_mou_6,isd_og_mou_7,isd_og_mou_8,isd_og_mou_9,spl_og_mou_6,spl_og_mou_7,spl_og_mou_8,spl_og_mou_9,og_others_6,og_others_7,og_others_8,og_others_9,total_og_mou_6,total_og_mou_7,total_og_mou_8,total_og_mou_9,loc_ic_t2t_mou_6,loc_ic_t2t_mou_7,loc_ic_t2t_mou_8,loc_ic_t2t_mou_9,loc_ic_t2m_mou_6,loc_ic_t2m_mou_7,loc_ic_t2m_mou_8,loc_ic_t2m_mou_9,loc_ic_t2f_mou_6,loc_ic_t2f_mou_7,loc_ic_t2f_mou_8,loc_ic_t2f_mou_9,loc_ic_mou_6,loc_ic_mou_7,loc_ic_mou_8,loc_ic_mou_9,std_ic_t2t_mou_6,std_ic_t2t_mou_7,std_ic_t2t_mou_8,std_ic_t2t_mou_9,std_ic_t2m_mou_6,std_ic_t2m_mou_7,std_ic_t2m_mou_8,std_ic_t2m_mou_9,std_ic_t2f_mou_6,std_ic_t2f_mou_7,std_ic_t2f_mou_8,std_ic_t2f_mou_9,std_ic_mou_6,std_ic_mou_7,std_ic_mou_8,std_ic_mou_9,total_ic_mou_6,total_ic_mou_7,total_ic_mou_8,total_ic_mou_9,spl_ic_mou_6,spl_ic_mou_7,spl_ic_mou_8,spl_ic_mou_9,isd_ic_mou_6,isd_ic_mou_7,isd_ic_mou_8,isd_ic_mou_9,ic_others_6,ic_others_7,ic_others_8,ic_others_9,total_rech_num_6,total_rech_num_7,total_rech_num_8,total_rech_num_9,total_rech_amt_6,total_rech_amt_7,total_rech_amt_8,total_rech_amt_9,max_rech_amt_6,max_rech_amt_7,max_rech_amt_8,max_rech_amt_9,date_of_last_rech_6,date_of_last_rech_7,date_of_last_rech_8,date_of_last_rech_9,last_day_rch_amt_6,last_day_rch_amt_7,last_day_rch_amt_8,last_day_rch_amt_9,date_of_last_rech_data_6,date_of_last_rech_data_7,date_of_last_rech_data_8,date_of_last_rech_data_9,total_rech_data_6,total_rech_data_7,total_rech_data_8,total_rech_data_9,max_rech_data_6,max_rech_data_7,max_rech_data_8,max_rech_data_9,count_rech_2g_6,count_rech_2g_7,count_rech_2g_8,count_rech_2g_9,count_rech_3g_6,count_rech_3g_7,count_rech_3g_8,count_rech_3g_9,av_rech_amt_data_6,av_rech_amt_data_7,av_rech_amt_data_8,av_rech_amt_data_9,vol_2g_mb_6,vol_2g_mb_7,vol_2g_mb_8,vol_2g_mb_9,vol_3g_mb_6,vol_3g_mb_7,vol_3g_mb_8,vol_3g_mb_9,arpu_3g_6,arpu_3g_7,arpu_3g_8,arpu_3g_9,arpu_2g_6,arpu_2g_7,arpu_2g_8,arpu_2g_9,night_pck_user_6,night_pck_user_7,night_pck_user_8,night_pck_user_9,monthly_2g_6,monthly_2g_7,monthly_2g_8,monthly_2g_9,sachet_2g_6,sachet_2g_7,sachet_2g_8,sachet_2g_9,monthly_3g_6,monthly_3g_7,monthly_3g_8,monthly_3g_9,sachet_3g_6,sachet_3g_7,sachet_3g_8,sachet_3g_9,fb_user_6,fb_user_7,fb_user_8,fb_user_9,aon,vbc_3g_8,vbc_3g_7,vbc_3g_6,vbc_3g_9,total_amt_6,total_amt_7,total_amt_8,total_amt_9,total_rech_amt_per_user
89817,89818,62.307,37.77,44.769,44.318,9.71,10.94,10.06,18.24,32.13,25.83,30.61,19.51,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.78,0.56,0.53,0.0,19.11,20.99,27.38,9.78,0.0,0.0,0.0,0.0,0.0,4.83,2.48,1.56,24.89,21.56,27.91,9.78,3.93,10.38,9.53,18.24,13.01,0.0,0.75,8.16,0.0,0.0,0.0,0.0,16.94,10.38,10.28,26.41,0.0,0.0,0.0,0.0,3.73,5.26,2.48,2.53,2.54,0.0,0.0,0.0,48.13,37.21,40.68,38.73,26.11,4.18,0.23,0.81,71.86,86.93,84.34,59.94,0.0,0.0,0.0,0.45,97.98,91.11,84.58,61.21,0.81,0.01,0.0,0.06,0.0,0.0,1.33,0.93,0.0,0.0,0.0,0.0,0.81,0.01,1.33,0.99,100.58,92.04,86.11,66.86,0.0,0.91,0.2,1.81,1.18,0.0,0.0,0.0,0.59,0.0,0.0,2.83,9,7,7,5,70,40,50,50,10,10,10,20,2014-06-25,2014-07-22,2014-08-28,2014-09-30,10,10,10,0,NaT,NaT,NaT,NaT,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,397,0.0,0.0,0.0,0.0,70.0,40.0,50.0,50.0,210.0
44657,44658,537.145,476.488,210.044,0.0,138.99,209.99,57.66,,1082.81,883.58,159.08,,0.0,0.0,0.0,,0.0,0.0,0.0,,49.68,48.24,36.71,,304.78,265.39,41.24,,0.0,0.0,0.0,,0.0,0.0,0.0,,354.46,313.64,77.96,,89.31,161.74,20.94,,778.03,618.18,117.83,,0.0,0.0,0.0,,867.34,779.93,138.78,,0.0,0.0,0.0,,0.0,0.0,0.0,,0.0,0.0,0.0,,1221.81,1093.58,216.74,0.0,38.41,59.33,22.43,,180.38,110.23,97.44,,0.75,0.0,1.36,,219.54,169.56,121.24,,0.0,0.83,0.0,,14.78,0.0,0.05,,0.0,0.0,0.0,,14.78,0.83,0.05,,234.48,170.39,121.29,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,,0.15,0.0,0.0,,12,6,3,0,775,561,213,0,128,120,169,0,2014-06-28,2014-07-28,2014-08-06,NaT,120,110,169,0,NaT,NaT,NaT,NaT,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,938,0.0,0.0,0.0,0.0,775.0,561.0,213.0,0.0,1549.0


## 3.1 Filtering high value customers

In [22]:
avg_amt_rech=df['total_rech_amt_per_user'].mean()

In [23]:
high_val_customer = df.loc[df['total_rech_amt_per_user'] >= avg_amt_rech]
high_val_customer.head()

Unnamed: 0,cust_id,arpu_6,arpu_7,arpu_8,arpu_9,onnet_mou_6,onnet_mou_7,onnet_mou_8,onnet_mou_9,offnet_mou_6,offnet_mou_7,offnet_mou_8,offnet_mou_9,roam_ic_mou_6,roam_ic_mou_7,roam_ic_mou_8,roam_ic_mou_9,roam_og_mou_6,roam_og_mou_7,roam_og_mou_8,roam_og_mou_9,loc_og_t2t_mou_6,loc_og_t2t_mou_7,loc_og_t2t_mou_8,loc_og_t2t_mou_9,loc_og_t2m_mou_6,loc_og_t2m_mou_7,loc_og_t2m_mou_8,loc_og_t2m_mou_9,loc_og_t2f_mou_6,loc_og_t2f_mou_7,loc_og_t2f_mou_8,loc_og_t2f_mou_9,loc_og_t2c_mou_6,loc_og_t2c_mou_7,loc_og_t2c_mou_8,loc_og_t2c_mou_9,loc_og_mou_6,loc_og_mou_7,loc_og_mou_8,loc_og_mou_9,std_og_t2t_mou_6,std_og_t2t_mou_7,std_og_t2t_mou_8,std_og_t2t_mou_9,std_og_t2m_mou_6,std_og_t2m_mou_7,std_og_t2m_mou_8,std_og_t2m_mou_9,std_og_t2f_mou_6,std_og_t2f_mou_7,std_og_t2f_mou_8,std_og_t2f_mou_9,std_og_mou_6,std_og_mou_7,std_og_mou_8,std_og_mou_9,isd_og_mou_6,isd_og_mou_7,isd_og_mou_8,isd_og_mou_9,spl_og_mou_6,spl_og_mou_7,spl_og_mou_8,spl_og_mou_9,og_others_6,og_others_7,og_others_8,og_others_9,total_og_mou_6,total_og_mou_7,total_og_mou_8,total_og_mou_9,loc_ic_t2t_mou_6,loc_ic_t2t_mou_7,loc_ic_t2t_mou_8,loc_ic_t2t_mou_9,loc_ic_t2m_mou_6,loc_ic_t2m_mou_7,loc_ic_t2m_mou_8,loc_ic_t2m_mou_9,loc_ic_t2f_mou_6,loc_ic_t2f_mou_7,loc_ic_t2f_mou_8,loc_ic_t2f_mou_9,loc_ic_mou_6,loc_ic_mou_7,loc_ic_mou_8,loc_ic_mou_9,std_ic_t2t_mou_6,std_ic_t2t_mou_7,std_ic_t2t_mou_8,std_ic_t2t_mou_9,std_ic_t2m_mou_6,std_ic_t2m_mou_7,std_ic_t2m_mou_8,std_ic_t2m_mou_9,std_ic_t2f_mou_6,std_ic_t2f_mou_7,std_ic_t2f_mou_8,std_ic_t2f_mou_9,std_ic_mou_6,std_ic_mou_7,std_ic_mou_8,std_ic_mou_9,total_ic_mou_6,total_ic_mou_7,total_ic_mou_8,total_ic_mou_9,spl_ic_mou_6,spl_ic_mou_7,spl_ic_mou_8,spl_ic_mou_9,isd_ic_mou_6,isd_ic_mou_7,isd_ic_mou_8,isd_ic_mou_9,ic_others_6,ic_others_7,ic_others_8,ic_others_9,total_rech_num_6,total_rech_num_7,total_rech_num_8,total_rech_num_9,total_rech_amt_6,total_rech_amt_7,total_rech_amt_8,total_rech_amt_9,max_rech_amt_6,max_rech_amt_7,max_rech_amt_8,max_rech_amt_9,date_of_last_rech_6,date_of_last_rech_7,date_of_last_rech_8,date_of_last_rech_9,last_day_rch_amt_6,last_day_rch_amt_7,last_day_rch_amt_8,last_day_rch_amt_9,date_of_last_rech_data_6,date_of_last_rech_data_7,date_of_last_rech_data_8,date_of_last_rech_data_9,total_rech_data_6,total_rech_data_7,total_rech_data_8,total_rech_data_9,max_rech_data_6,max_rech_data_7,max_rech_data_8,max_rech_data_9,count_rech_2g_6,count_rech_2g_7,count_rech_2g_8,count_rech_2g_9,count_rech_3g_6,count_rech_3g_7,count_rech_3g_8,count_rech_3g_9,av_rech_amt_data_6,av_rech_amt_data_7,av_rech_amt_data_8,av_rech_amt_data_9,vol_2g_mb_6,vol_2g_mb_7,vol_2g_mb_8,vol_2g_mb_9,vol_3g_mb_6,vol_3g_mb_7,vol_3g_mb_8,vol_3g_mb_9,arpu_3g_6,arpu_3g_7,arpu_3g_8,arpu_3g_9,arpu_2g_6,arpu_2g_7,arpu_2g_8,arpu_2g_9,night_pck_user_6,night_pck_user_7,night_pck_user_8,night_pck_user_9,monthly_2g_6,monthly_2g_7,monthly_2g_8,monthly_2g_9,sachet_2g_6,sachet_2g_7,sachet_2g_8,sachet_2g_9,monthly_3g_6,monthly_3g_7,monthly_3g_8,monthly_3g_9,sachet_3g_6,sachet_3g_7,sachet_3g_8,sachet_3g_9,fb_user_6,fb_user_7,fb_user_8,fb_user_9,aon,vbc_3g_8,vbc_3g_7,vbc_3g_6,vbc_3g_9,total_amt_6,total_amt_7,total_amt_8,total_amt_9,total_rech_amt_per_user
3,4,221.338,251.102,508.054,389.5,99.91,54.39,310.98,241.71,123.31,109.01,71.68,113.54,0.0,54.86,44.38,0.0,0.0,28.09,39.04,0.0,73.68,34.81,10.61,15.49,107.43,83.21,22.46,65.46,1.91,0.65,4.91,2.06,0.0,0.0,0.0,0.0,183.03,118.68,37.99,83.03,26.23,14.89,289.58,226.21,2.99,1.73,6.53,9.99,0.0,0.0,0.0,0.0,29.23,16.63,296.11,236.21,0.0,0.0,0.0,0.0,10.96,0.0,18.09,43.29,0.0,0.0,0.0,0.0,223.23,135.31,352.21,362.54,62.08,19.98,8.04,41.73,113.96,64.51,20.28,52.86,57.43,27.09,19.84,65.59,233.48,111.59,48.18,160.19,43.48,66.44,0.0,129.84,1.33,38.56,4.94,13.98,1.18,0.0,0.0,0.0,45.99,105.01,4.94,143.83,280.08,216.61,53.13,305.38,0.59,0.0,0.0,0.55,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.8,10,11,18,14,230,310,601,410,60,50,50,50,2014-06-28,2014-07-31,2014-08-31,2014-09-30,30,50,50,30,NaT,NaT,NaT,NaT,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,2491,0.0,0.0,0.0,0.0,230.0,310.0,601.0,410.0,1551.0
7,8,1069.18,1349.85,3171.48,500.0,57.84,54.68,52.29,,453.43,567.16,325.91,,16.23,33.49,31.64,,23.74,12.59,38.06,,51.39,31.38,40.28,,308.63,447.38,162.28,,62.13,55.14,53.23,,0.0,0.0,0.0,,422.16,533.91,255.79,,4.3,23.29,12.01,,49.89,31.76,49.14,,6.66,20.08,16.68,,60.86,75.14,77.84,,0.0,0.18,10.01,,4.5,0.0,6.5,,0.0,0.0,0.0,,487.53,609.24,350.16,0.0,58.14,32.26,27.31,,217.56,221.49,121.19,,152.16,101.46,39.53,,427.88,355.23,188.04,,36.89,11.83,30.39,,91.44,126.99,141.33,,52.19,34.24,22.21,,180.54,173.08,193.94,,626.46,558.04,428.74,0.0,0.21,0.0,0.0,,2.06,14.53,31.59,,15.74,15.19,15.14,,5,5,7,3,1580,790,3638,0,1580,790,1580,0,2014-06-27,2014-07-25,2014-08-26,2014-09-30,0,0,779,0,NaT,NaT,NaT,NaT,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,802,57.74,19.38,18.74,0.0,1580.0,790.0,3638.0,0.0,6008.0
8,9,378.721,492.223,137.362,166.787,413.69,351.03,35.08,33.46,94.66,80.63,136.48,108.71,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,297.13,217.59,12.49,26.13,80.96,70.58,50.54,34.58,0.0,0.0,0.0,0.0,0.0,0.0,7.15,0.0,378.09,288.18,63.04,60.71,116.56,133.43,22.58,7.33,13.69,10.04,75.69,74.13,0.0,0.0,0.0,0.0,130.26,143.48,98.28,81.46,0.0,0.0,0.0,0.0,0.0,0.0,10.23,0.0,0.0,0.0,0.0,0.0,508.36,431.66,171.56,142.18,23.84,9.84,0.31,4.03,57.58,13.98,15.48,17.34,0.0,0.0,0.0,0.0,81.43,23.83,15.79,21.38,0.0,0.58,0.1,0.0,22.43,4.08,0.65,13.53,0.0,0.0,0.0,0.0,22.43,4.66,0.75,13.53,103.86,28.49,16.54,34.91,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19,21,14,15,437,601,120,186,90,154,30,36,2014-06-25,2014-07-31,2014-08-30,2014-09-30,50,0,10,0,NaT,2014-07-31,2014-08-23,NaT,,2.0,3.0,,,154.0,23.0,,,2.0,3.0,,,0.0,0.0,,,177.0,69.0,,0.0,356.0,0.03,0.0,0.0,750.95,11.94,0.0,,0.0,19.83,,,0.0,0.0,,,0.0,0.0,,0,1,0,0,0,1,3,0,0,0,0,0,0,0,0,0,,1.0,1.0,,315,21.03,910.65,122.16,0.0,437.0,603.0,123.0,186.0,1349.0
13,14,492.846,205.671,593.26,322.732,501.76,108.39,534.24,244.81,413.31,119.28,482.46,214.06,23.53,144.24,72.11,136.78,7.98,35.26,1.44,12.78,49.63,6.19,36.01,6.14,151.13,47.28,294.46,108.24,4.54,0.0,23.51,5.29,0.0,0.0,0.49,0.0,205.31,53.48,353.99,119.69,446.41,85.98,498.23,230.38,255.36,52.94,156.94,96.01,0.0,0.0,0.0,0.0,701.78,138.93,655.18,326.39,0.0,0.0,1.29,0.0,0.0,0.0,4.78,0.0,0.0,0.0,0.0,0.0,907.09,192.41,1015.26,446.09,67.88,7.58,52.58,24.98,142.88,18.53,195.18,104.79,4.81,0.0,7.49,8.51,215.58,26.11,255.26,138.29,115.68,38.29,154.58,62.39,308.13,29.79,317.91,151.51,0.0,0.0,1.91,0.0,423.81,68.09,474.41,213.91,968.61,172.58,1144.53,631.86,0.45,0.0,0.0,0.0,245.28,62.11,393.39,259.33,83.48,16.24,21.44,20.31,6,4,11,7,507,253,717,353,110,110,130,130,2014-06-20,2014-07-22,2014-08-30,2014-09-26,110,50,0,0,NaT,NaT,2014-08-30,NaT,,,3.0,,,,23.0,,,,3.0,,,,0.0,,,,69.0,,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,,,0.0,,,,0.2,,,,0.0,,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,,,1.0,,2607,0.0,0.0,0.0,0.0,507.0,253.0,720.0,353.0,1833.0
15,16,31.0,510.465,590.643,510.39,,246.56,280.31,289.79,,839.58,1011.91,642.14,,0.0,0.0,0.0,,0.88,0.0,0.0,,16.13,44.79,48.33,,38.99,92.53,158.11,,5.13,4.83,8.43,,9.78,0.0,0.03,,60.26,142.16,214.88,,230.43,235.51,241.46,,775.66,914.54,475.56,,0.0,0.0,0.0,,1006.09,1150.06,717.03,,0.0,0.0,0.0,,18.89,0.0,0.03,,0.0,0.0,0.0,0.0,1085.26,1292.23,931.94,,16.91,16.19,14.91,,36.94,45.76,50.01,,7.24,9.51,15.54,,61.11,71.48,80.48,,1.76,2.68,30.16,,40.06,14.31,82.46,,0.0,0.0,0.0,,41.83,16.99,112.63,0.0,105.86,89.71,198.28,,0.0,0.61,1.01,,0.0,0.0,0.0,,2.91,0.61,4.14,1,13,11,8,0,686,696,556,0,110,130,130,2014-06-14,2014-07-28,2014-08-30,2014-09-29,0,110,130,0,NaT,NaT,NaT,NaT,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,270,0.0,0.0,0.0,0.0,0.0,686.0,696.0,556.0,1938.0


In [24]:
print("Total number of high value customers: ",len(high_val_customer))

Total number of high value customers:  34958


# 4 Creating Target Variable¶

In this step, I will tag the churned customers whose target variable is categorised as churn Yes or churn No (Yes=1, No=0) based on T-four-month Mobile's data.


## 4.1 Find put inactive customers

During the churn period, those who have not made any calls (incoming or outgoing) and have not utilized mobile internet even once are considered to be inactive. The following attributes must be used to identify churners:

* total_ic_mou_6
* total_ic_mou_7
* total_ic_mou_8
* total_ic_mou_9
* total_og_mou_6
* total_og_mou_7
* total_og_mou_8
* total_og_mou_9
* vol_2g_mb_6
* vol_2g_mb_7
* vol_2g_mb_8
* vol_2g_mb_9
* vol_3g_mb_6
* vol_3g_mb_7
* vol_3g_mb_8
* vol_3g_mb_9

In [25]:
# The month of September is used to screen out churned customers.

df['churn'] = df.apply(lambda x: 1 if (x.total_ic_mou_9 == 0 and x.total_og_mou_9 == 0 and x.vol_2g_mb_9 ==0 and x.vol_3g_mb_9==0) else 0, axis=1)
df['churn'] = df['churn'].astype("str")
df.shape

(99999, 216)

In [26]:
# number of churned customers
df['churn'].value_counts()

0    89808
1    10191
Name: churn, dtype: int64

In [27]:
#what's the % of churned customers
print("The Percentage of churned customers is:" , round(100*(df.churn.astype("int").sum()/len(df)),2))

The Percentage of churned customers is: 10.19


### 4.2 After labeling churners, let's delete any characteristics pertaining to the churn phase (those with' 9', etc. in their names).

In [28]:
col_9 = [i for i in df.columns if '9' in i]
df = df.drop(col_9,axis=1)
df.shape

(99999, 163)

In [29]:
# let's update our categorical column list
cat_cols = [ele for ele in cat_cols if ele not in col_9]
cat_cols

['night_pck_user_6',
 'night_pck_user_7',
 'night_pck_user_8',
 'fb_user_6',
 'fb_user_7',
 'fb_user_8']

In [30]:
df.info(verbose=1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99999 entries, 0 to 99998
Data columns (total 163 columns):
 #    Column                    Dtype         
---   ------                    -----         
 0    cust_id                   int64         
 1    arpu_6                    float64       
 2    arpu_7                    float64       
 3    arpu_8                    float64       
 4    onnet_mou_6               float64       
 5    onnet_mou_7               float64       
 6    onnet_mou_8               float64       
 7    offnet_mou_6              float64       
 8    offnet_mou_7              float64       
 9    offnet_mou_8              float64       
 10   roam_ic_mou_6             float64       
 11   roam_ic_mou_7             float64       
 12   roam_ic_mou_8             float64       
 13   roam_og_mou_6             float64       
 14   roam_og_mou_7             float64       
 15   roam_og_mou_8             float64       
 16   loc_og_t2t_mou_6          float64     

In [31]:
df.sample(2)

Unnamed: 0,cust_id,arpu_6,arpu_7,arpu_8,onnet_mou_6,onnet_mou_7,onnet_mou_8,offnet_mou_6,offnet_mou_7,offnet_mou_8,roam_ic_mou_6,roam_ic_mou_7,roam_ic_mou_8,roam_og_mou_6,roam_og_mou_7,roam_og_mou_8,loc_og_t2t_mou_6,loc_og_t2t_mou_7,loc_og_t2t_mou_8,loc_og_t2m_mou_6,loc_og_t2m_mou_7,loc_og_t2m_mou_8,loc_og_t2f_mou_6,loc_og_t2f_mou_7,loc_og_t2f_mou_8,loc_og_t2c_mou_6,loc_og_t2c_mou_7,loc_og_t2c_mou_8,loc_og_mou_6,loc_og_mou_7,loc_og_mou_8,std_og_t2t_mou_6,std_og_t2t_mou_7,std_og_t2t_mou_8,std_og_t2m_mou_6,std_og_t2m_mou_7,std_og_t2m_mou_8,std_og_t2f_mou_6,std_og_t2f_mou_7,std_og_t2f_mou_8,std_og_mou_6,std_og_mou_7,std_og_mou_8,isd_og_mou_6,isd_og_mou_7,isd_og_mou_8,spl_og_mou_6,spl_og_mou_7,spl_og_mou_8,og_others_6,og_others_7,og_others_8,total_og_mou_6,total_og_mou_7,total_og_mou_8,loc_ic_t2t_mou_6,loc_ic_t2t_mou_7,loc_ic_t2t_mou_8,loc_ic_t2m_mou_6,loc_ic_t2m_mou_7,loc_ic_t2m_mou_8,loc_ic_t2f_mou_6,loc_ic_t2f_mou_7,loc_ic_t2f_mou_8,loc_ic_mou_6,loc_ic_mou_7,loc_ic_mou_8,std_ic_t2t_mou_6,std_ic_t2t_mou_7,std_ic_t2t_mou_8,std_ic_t2m_mou_6,std_ic_t2m_mou_7,std_ic_t2m_mou_8,std_ic_t2f_mou_6,std_ic_t2f_mou_7,std_ic_t2f_mou_8,std_ic_mou_6,std_ic_mou_7,std_ic_mou_8,total_ic_mou_6,total_ic_mou_7,total_ic_mou_8,spl_ic_mou_6,spl_ic_mou_7,spl_ic_mou_8,isd_ic_mou_6,isd_ic_mou_7,isd_ic_mou_8,ic_others_6,ic_others_7,ic_others_8,total_rech_num_6,total_rech_num_7,total_rech_num_8,total_rech_amt_6,total_rech_amt_7,total_rech_amt_8,max_rech_amt_6,max_rech_amt_7,max_rech_amt_8,date_of_last_rech_6,date_of_last_rech_7,date_of_last_rech_8,last_day_rch_amt_6,last_day_rch_amt_7,last_day_rch_amt_8,date_of_last_rech_data_6,date_of_last_rech_data_7,date_of_last_rech_data_8,total_rech_data_6,total_rech_data_7,total_rech_data_8,max_rech_data_6,max_rech_data_7,max_rech_data_8,count_rech_2g_6,count_rech_2g_7,count_rech_2g_8,count_rech_3g_6,count_rech_3g_7,count_rech_3g_8,av_rech_amt_data_6,av_rech_amt_data_7,av_rech_amt_data_8,vol_2g_mb_6,vol_2g_mb_7,vol_2g_mb_8,vol_3g_mb_6,vol_3g_mb_7,vol_3g_mb_8,arpu_3g_6,arpu_3g_7,arpu_3g_8,arpu_2g_6,arpu_2g_7,arpu_2g_8,night_pck_user_6,night_pck_user_7,night_pck_user_8,monthly_2g_6,monthly_2g_7,monthly_2g_8,sachet_2g_6,sachet_2g_7,sachet_2g_8,monthly_3g_6,monthly_3g_7,monthly_3g_8,sachet_3g_6,sachet_3g_7,sachet_3g_8,fb_user_6,fb_user_7,fb_user_8,aon,vbc_3g_8,vbc_3g_7,vbc_3g_6,total_amt_6,total_amt_7,total_amt_8,total_rech_amt_per_user,churn
50018,50019,2.5,27.368,172.104,0.0,36.74,431.79,4.03,1.64,33.51,0.1,6.79,0.0,4.03,38.39,0.0,0.0,0.0,12.76,0.0,0.0,25.98,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,38.74,0.0,0.0,419.03,0.0,0.0,7.53,0.0,0.0,0.0,0.0,0.0,426.56,0.0,0.0,0.0,0.0,0.0,1.93,0.0,0.0,0.0,0.0,0.0,467.24,0.0,0.0,14.68,0.0,0.0,51.41,0.0,0.0,0.0,0.0,0.0,66.09,0.0,0.0,2.59,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.59,0.0,0.0,68.69,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,5,9,0,30,305,0,30,130,2014-06-18,2014-07-15,2014-08-29,0,0,130,NaT,NaT,NaT,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,,,,1049,0.0,0.0,0.0,0.0,30.0,305.0,815.0,0
52686,52687,921.935,702.293,641.774,341.11,68.41,102.89,518.74,552.58,494.26,0.0,0.0,10.04,0.0,0.0,0.48,66.91,53.34,54.39,421.21,398.28,307.01,0.0,0.0,0.0,0.0,0.0,2.23,488.13,451.63,361.41,274.19,15.06,48.49,97.53,154.29,186.76,0.0,0.0,0.0,371.73,169.36,235.26,0.0,0.0,0.0,0.0,0.0,2.23,0.0,0.0,0.0,859.86,620.99,598.91,133.99,71.13,74.29,515.81,589.69,481.04,1.56,3.03,9.01,651.38,663.86,564.36,112.63,25.14,15.04,18.39,36.66,26.96,0.4,0.9,0.0,131.43,62.71,42.01,782.81,726.68,606.78,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.4,41,32,28,1060,783,740,30,50,50,2014-06-30,2014-07-31,2014-08-30,30,10,30,NaT,NaT,NaT,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,,,,1470,0.0,0.0,0.0,1060.0,783.0,740.0,3663.0,0


# 5 Developing new features

In [32]:
#AON: Age on network 
#Conevrt AON in years
df['aon_yr'] = round((df['aon']/365),1)

In [33]:
df.drop('aon', axis=1, inplace=True)
df.sample(2)

Unnamed: 0,cust_id,arpu_6,arpu_7,arpu_8,onnet_mou_6,onnet_mou_7,onnet_mou_8,offnet_mou_6,offnet_mou_7,offnet_mou_8,roam_ic_mou_6,roam_ic_mou_7,roam_ic_mou_8,roam_og_mou_6,roam_og_mou_7,roam_og_mou_8,loc_og_t2t_mou_6,loc_og_t2t_mou_7,loc_og_t2t_mou_8,loc_og_t2m_mou_6,loc_og_t2m_mou_7,loc_og_t2m_mou_8,loc_og_t2f_mou_6,loc_og_t2f_mou_7,loc_og_t2f_mou_8,loc_og_t2c_mou_6,loc_og_t2c_mou_7,loc_og_t2c_mou_8,loc_og_mou_6,loc_og_mou_7,loc_og_mou_8,std_og_t2t_mou_6,std_og_t2t_mou_7,std_og_t2t_mou_8,std_og_t2m_mou_6,std_og_t2m_mou_7,std_og_t2m_mou_8,std_og_t2f_mou_6,std_og_t2f_mou_7,std_og_t2f_mou_8,std_og_mou_6,std_og_mou_7,std_og_mou_8,isd_og_mou_6,isd_og_mou_7,isd_og_mou_8,spl_og_mou_6,spl_og_mou_7,spl_og_mou_8,og_others_6,og_others_7,og_others_8,total_og_mou_6,total_og_mou_7,total_og_mou_8,loc_ic_t2t_mou_6,loc_ic_t2t_mou_7,loc_ic_t2t_mou_8,loc_ic_t2m_mou_6,loc_ic_t2m_mou_7,loc_ic_t2m_mou_8,loc_ic_t2f_mou_6,loc_ic_t2f_mou_7,loc_ic_t2f_mou_8,loc_ic_mou_6,loc_ic_mou_7,loc_ic_mou_8,std_ic_t2t_mou_6,std_ic_t2t_mou_7,std_ic_t2t_mou_8,std_ic_t2m_mou_6,std_ic_t2m_mou_7,std_ic_t2m_mou_8,std_ic_t2f_mou_6,std_ic_t2f_mou_7,std_ic_t2f_mou_8,std_ic_mou_6,std_ic_mou_7,std_ic_mou_8,total_ic_mou_6,total_ic_mou_7,total_ic_mou_8,spl_ic_mou_6,spl_ic_mou_7,spl_ic_mou_8,isd_ic_mou_6,isd_ic_mou_7,isd_ic_mou_8,ic_others_6,ic_others_7,ic_others_8,total_rech_num_6,total_rech_num_7,total_rech_num_8,total_rech_amt_6,total_rech_amt_7,total_rech_amt_8,max_rech_amt_6,max_rech_amt_7,max_rech_amt_8,date_of_last_rech_6,date_of_last_rech_7,date_of_last_rech_8,last_day_rch_amt_6,last_day_rch_amt_7,last_day_rch_amt_8,date_of_last_rech_data_6,date_of_last_rech_data_7,date_of_last_rech_data_8,total_rech_data_6,total_rech_data_7,total_rech_data_8,max_rech_data_6,max_rech_data_7,max_rech_data_8,count_rech_2g_6,count_rech_2g_7,count_rech_2g_8,count_rech_3g_6,count_rech_3g_7,count_rech_3g_8,av_rech_amt_data_6,av_rech_amt_data_7,av_rech_amt_data_8,vol_2g_mb_6,vol_2g_mb_7,vol_2g_mb_8,vol_3g_mb_6,vol_3g_mb_7,vol_3g_mb_8,arpu_3g_6,arpu_3g_7,arpu_3g_8,arpu_2g_6,arpu_2g_7,arpu_2g_8,night_pck_user_6,night_pck_user_7,night_pck_user_8,monthly_2g_6,monthly_2g_7,monthly_2g_8,sachet_2g_6,sachet_2g_7,sachet_2g_8,monthly_3g_6,monthly_3g_7,monthly_3g_8,sachet_3g_6,sachet_3g_7,sachet_3g_8,fb_user_6,fb_user_7,fb_user_8,vbc_3g_8,vbc_3g_7,vbc_3g_6,total_amt_6,total_amt_7,total_amt_8,total_rech_amt_per_user,churn,aon_yr
7965,7966,273.524,494.252,422.562,95.39,136.28,142.46,225.93,417.94,492.91,11.84,0.0,0.0,27.91,0.0,0.0,77.31,121.44,122.63,183.49,387.36,431.68,0.0,17.08,3.63,0.0,0.0,0.0,260.81,525.89,557.94,7.23,14.83,19.83,25.36,13.49,55.84,0.0,0.0,0.0,32.59,28.33,75.68,0.0,0.0,0.0,0.0,0.0,1.75,0.0,0.0,0.0,293.41,554.23,635.38,80.46,128.13,134.43,183.89,280.13,286.69,33.68,24.71,30.28,298.04,432.98,451.41,0.0,14.58,5.41,2.01,0.0,4.11,0.56,1.18,3.28,2.58,15.76,12.81,300.63,448.74,464.23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7,9,5,420,603,437,120,130,144,2014-06-29,2014-07-30,2014-08-27,110,130,144,NaT,NaT,NaT,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,,,,0.0,0.0,0.0,420.0,603.0,437.0,2028.0,0,3.5
79139,79140,355.728,235.28,58.202,77.24,6.51,6.99,272.08,225.36,35.21,0.0,0.0,0.0,0.0,0.0,0.0,77.24,6.51,6.99,270.94,225.36,35.21,1.13,0.0,0.0,0.0,0.0,0.0,349.33,231.88,42.21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.36,1.34,3.71,0.0,0.0,353.04,232.24,43.56,15.28,20.03,10.31,156.54,413.83,81.18,2.85,10.88,5.59,174.68,444.74,97.09,0.0,0.0,0.0,0.0,1.21,0.0,2.76,3.18,0.0,2.76,4.39,0.0,180.01,449.74,97.78,0.0,0.0,0.0,0.0,0.0,0.0,2.56,0.59,0.68,13,11,2,390,260,80,50,50,50,2014-06-30,2014-07-28,2014-08-27,30,30,50,NaT,NaT,NaT,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,,,,0.0,0.0,0.0,390.0,260.0,80.0,760.0,0,0.9


Let's build bins for the age of network column, which will indicate the number of years a client has been utilizing the T network.

In [34]:
age_range = [ 0,  2,  4,  6,  8, 10, 12]
age_bin = [ 1, 2, 3, 4, 5, 6]
df['age_group'] = pd.cut(df['aon_yr'], age_range, labels=age_bin)
df['age_group'] = df['age_group'].astype(str)
df['age_group'].head()

0    2
1    2
2    2
3    4
4    3
Name: age_group, dtype: object

In [35]:
# let us update our categorical column list
cat_cols.append('age_group')
cat_cols

['night_pck_user_6',
 'night_pck_user_7',
 'night_pck_user_8',
 'fb_user_6',
 'fb_user_7',
 'fb_user_8',
 'age_group']

# 6 Handling Missing values

## 6.1 Checking the null values

In [36]:
null = round(100*(df.isnull().sum()/len(df.index)),2).sort_values(ascending = False)
null = null[null!=0]
null

arpu_2g_6                   74.85
fb_user_6                   74.85
arpu_3g_6                   74.85
date_of_last_rech_data_6    74.85
night_pck_user_6            74.85
total_rech_data_6           74.85
max_rech_data_6             74.85
count_rech_2g_6             74.85
av_rech_amt_data_6          74.85
count_rech_3g_6             74.85
arpu_2g_7                   74.43
count_rech_3g_7             74.43
fb_user_7                   74.43
date_of_last_rech_data_7    74.43
max_rech_data_7             74.43
count_rech_2g_7             74.43
total_rech_data_7           74.43
arpu_3g_7                   74.43
av_rech_amt_data_7          74.43
night_pck_user_7            74.43
count_rech_3g_8             73.66
night_pck_user_8            73.66
arpu_2g_8                   73.66
fb_user_8                   73.66
date_of_last_rech_data_8    73.66
total_rech_data_8           73.66
arpu_3g_8                   73.66
av_rech_amt_data_8          73.66
max_rech_data_8             73.66
count_rech_2g_

`Observation`: After analysing I have got the 74.85% missing values for some features in this dataset.

## 6.2 Imputation of missing values

In [37]:
#Observe missing values in recharge columns 
rech_col = [i for i in df.columns if 'rech' in i]
rech_6_col = [i for i in rech_col if '6' in i]
rech_7_col = [i for i in rech_col if '7' in i]
rech_8_col = [i for i in rech_col if '8' in i]

In [38]:
#Observe missing values in recharge columns in the month of june(6)
rech_6 = pd.DataFrame(df[rech_6_col])

#adding some other columns describing data usage of customer in june(6)
vol_col = df[["vol_2g_mb_6",'vol_3g_mb_6']]

rech_6 = pd.concat([rech_6,vol_col], axis = 1) 
rech_6.sample(2)

Unnamed: 0,total_rech_num_6,total_rech_amt_6,max_rech_amt_6,date_of_last_rech_6,date_of_last_rech_data_6,total_rech_data_6,max_rech_data_6,count_rech_2g_6,count_rech_3g_6,av_rech_amt_data_6,vol_2g_mb_6,vol_3g_mb_6
16732,10,984,252,2014-06-28,2014-06-28,6.0,252.0,3.0,3.0,731.0,0.0,2143.47
29362,4,330,110,2014-06-25,NaT,,,,,,0.0,0.0


`Observation`: The missing values for max rech, count rech 2g, and 3g are displayed in the table above whenever the date of the last recharge is absent. As observed in the preceding table, the mobile internet use (2G and 3G data) for these values is zero. As there were no client recharges, I would thus impute these missing values with a value of zero.

In [39]:
#Similarly, observe missing values in recharge columns for the month of july(7)
rech_7_col

['total_rech_num_7',
 'total_rech_amt_7',
 'max_rech_amt_7',
 'date_of_last_rech_7',
 'date_of_last_rech_data_7',
 'total_rech_data_7',
 'max_rech_data_7',
 'count_rech_2g_7',
 'count_rech_3g_7',
 'av_rech_amt_data_7']

In [40]:
rech_7 = pd.DataFrame(df[rech_7_col])

#adding some other columns describing data usage of customer in july
vol_col = df[["vol_2g_mb_7",'vol_3g_mb_7']]

rech_7 = pd.concat([rech_7,vol_col], axis = 1) 
rech_7.sample(2)

Unnamed: 0,total_rech_num_7,total_rech_amt_7,max_rech_amt_7,date_of_last_rech_7,date_of_last_rech_data_7,total_rech_data_7,max_rech_data_7,count_rech_2g_7,count_rech_3g_7,av_rech_amt_data_7,vol_2g_mb_7,vol_3g_mb_7
77702,7,462,250,2014-07-31,2014-07-31,2.0,145.0,0.0,2.0,290.0,50.02,180.82
14632,9,140,50,2014-07-29,NaT,,,,,,0.0,0.0


`Observation`: When the date of the previous recharge is absent, the associated values for max rech, count rech 2g, and 3g are also missing, as shown in the table above. According to the preceding table, the mobile internet consumption (2G and 3G data) corresponding to these numbers is zero. Therefore, I will replace these missing numbers with zero, as the consumer did not do any recharges. This resembles the trend observed throughout the month of June.

In [41]:
#Similarly,observe missing values in recharge columns in the month of August(8)
rech_8_col

['total_rech_num_8',
 'total_rech_amt_8',
 'max_rech_amt_8',
 'date_of_last_rech_8',
 'date_of_last_rech_data_8',
 'total_rech_data_8',
 'max_rech_data_8',
 'count_rech_2g_8',
 'count_rech_3g_8',
 'av_rech_amt_data_8']

In [42]:
rech_8 = pd.DataFrame(df[rech_8_col])

# adding some other columns describing data usage of customer in August
vol_col = df[["vol_2g_mb_8",'vol_3g_mb_8']]

rech_8 = pd.concat([rech_8,vol_col], axis = 1) 
rech_8.sample(2)

Unnamed: 0,total_rech_num_8,total_rech_amt_8,max_rech_amt_8,date_of_last_rech_8,date_of_last_rech_data_8,total_rech_data_8,max_rech_data_8,count_rech_2g_8,count_rech_3g_8,av_rech_amt_data_8,vol_2g_mb_8,vol_3g_mb_8
31352,5,55,30,2014-08-28,2014-08-28,2.0,25.0,2.0,0.0,50.0,4.96,116.02
87300,9,191,150,2014-08-30,2014-08-23,5.0,52.0,0.0,5.0,260.0,8.93,5148.32


`Observation`: Similar to the months of June and July, the values for max rech, count rech 2g, and 3g are missing if the date of the last recharge is absent. According to the preceding table, the mobile internet consumption (2G and 3G data) corresponding to these numbers is zero. Therefore, let's impute these missing numbers with zero, given that there were no client recharges. This is identical to the trend observed throughout the months of June and July.

## 6.3 Filling the missing value by zero: Numerical features

In [43]:
# Let's use zero for the missing recharge column data.

impute_0 = [ 'date_of_last_rech_data_6','max_rech_data_6','count_rech_2g_6','count_rech_3g_6',
           'date_of_last_rech_data_7','max_rech_data_7','count_rech_2g_7','count_rech_3g_7',
           'date_of_last_rech_data_8','max_rech_data_8','count_rech_2g_8','count_rech_3g_8']

df[impute_0] = df[impute_0].apply(lambda x: x.fillna(0))

In [44]:
# Now Checking Null values
null = round(100*(df.isnull().sum()/len(df.index)),2).sort_values(ascending = False)
null = null[null!=0]
null

fb_user_6              74.85
arpu_2g_6              74.85
total_rech_data_6      74.85
av_rech_amt_data_6     74.85
arpu_3g_6              74.85
night_pck_user_6       74.85
fb_user_7              74.43
total_rech_data_7      74.43
arpu_2g_7              74.43
night_pck_user_7       74.43
arpu_3g_7              74.43
av_rech_amt_data_7     74.43
arpu_3g_8              73.66
arpu_2g_8              73.66
night_pck_user_8       73.66
fb_user_8              73.66
total_rech_data_8      73.66
av_rech_amt_data_8     73.66
std_og_t2m_mou_8        5.38
std_og_mou_8            5.38
std_og_t2f_mou_8        5.38
std_ic_t2m_mou_8        5.38
std_ic_mou_8            5.38
std_ic_t2f_mou_8        5.38
loc_ic_t2f_mou_8        5.38
std_ic_t2t_mou_8        5.38
isd_og_mou_8            5.38
std_og_t2t_mou_8        5.38
loc_ic_mou_8            5.38
spl_og_mou_8            5.38
og_others_8             5.38
loc_ic_t2m_mou_8        5.38
spl_ic_mou_8            5.38
loc_og_t2c_mou_8        5.38
loc_og_t2m_mou

## 6.4 Imputing Categorical columns

In [45]:
#Impute the missing values in categorical columns by (-1). 
df[cat_cols] = df[cat_cols].apply(lambda x: x.fillna(-1)) 
df[cat_cols] = df[cat_cols].astype('str')
df[cat_cols].sample(2)

Unnamed: 0,night_pck_user_6,night_pck_user_7,night_pck_user_8,fb_user_6,fb_user_7,fb_user_8,age_group
61726,0.0,0.0,0.0,1.0,1.0,1.0,2
53569,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,3


In [46]:
# Now Checking Null values
null = round(100*(df.isnull().sum()/len(df.index)),2).sort_values(ascending = False)
null = null[null!=0]
null

total_rech_data_6      74.85
arpu_2g_6              74.85
av_rech_amt_data_6     74.85
arpu_3g_6              74.85
total_rech_data_7      74.43
av_rech_amt_data_7     74.43
arpu_3g_7              74.43
arpu_2g_7              74.43
total_rech_data_8      73.66
arpu_2g_8              73.66
arpu_3g_8              73.66
av_rech_amt_data_8     73.66
loc_ic_mou_8            5.38
std_og_t2t_mou_8        5.38
std_ic_t2t_mou_8        5.38
loc_og_mou_8            5.38
std_og_t2m_mou_8        5.38
std_ic_t2m_mou_8        5.38
loc_ic_t2f_mou_8        5.38
std_og_t2f_mou_8        5.38
std_og_mou_8            5.38
loc_og_t2c_mou_8        5.38
loc_ic_t2m_mou_8        5.38
isd_og_mou_8            5.38
loc_ic_t2t_mou_8        5.38
std_ic_t2f_mou_8        5.38
loc_og_t2f_mou_8        5.38
std_ic_mou_8            5.38
loc_og_t2m_mou_8        5.38
onnet_mou_8             5.38
offnet_mou_8            5.38
roam_ic_mou_8           5.38
roam_og_mou_8           5.38
loc_og_t2t_mou_8        5.38
ic_others_8   

## 6.5 Dropping columns with more than 40% missing values

More than forty percent of the average revenue per user during both the good and action phases are missing. Since the focus of this research is the identification of consumers who are likely to churn based only on usage-based churn, I may safely exclude these columns without compromising the research.

In [47]:
miss = round(100*(df.isnull().sum()/len(df.index)),2).sort_values(ascending = False) 

miss = pd.DataFrame(miss[miss >= 40])
threshold_col = miss.index

df = df.drop(threshold_col,1)

In [48]:
df.shape

(99999, 152)

## 6.6 Dropping missing values for out going calls for month of June, July, and August

#### For Month of June

In [49]:
#According to my business domain knowledge, the total_og column represents the sum of local, standard, special, and other outgoing calls.
og_call_6 = ['loc_og_mou_6','std_og_mou_6','isd_og_mou_6','spl_og_mou_6','og_others_6','total_og_mou_6']
total_og_6 = df[og_call_6]

#limiting the filtering to clients who have made no outbound calls.
total_og_6.loc[total_og_6['total_og_mou_6']==0].sample(2)

Unnamed: 0,loc_og_mou_6,std_og_mou_6,isd_og_mou_6,spl_og_mou_6,og_others_6,total_og_mou_6
89766,0.0,0.0,0.0,0.0,0.0,0.0
74278,0.0,0.0,0.0,0.0,0.0,0.0


`Observation`: Based on my understanding of the business area and the preceding table, I can deduce that when the total number of outgoing calls for a given month is 0, no outgoing calls have been made. I am going to change the missing values for local outgoing calls, normal outgoing calls, special outgoing calls, and other outgoing calls with 0.

In [50]:
#Let's see whether our assumption is correct or not: the sum of all outgoing calls equals the 'total_og_mou_6'.

df['outgoing_total_6'] = df['loc_og_mou_6']+ df['std_og_mou_6']+df['isd_og_mou_6']+df['spl_og_mou_6']+df['og_others_6']
df[['outgoing_total_6','total_og_mou_6']].dropna().corr()

Unnamed: 0,outgoing_total_6,total_og_mou_6
outgoing_total_6,1.0,1.0
total_og_mou_6,1.0,1.0


`Observation`: From the above table, it is evident that these variables are associated; thus, when the total number of outgoing calls for a given month is 0,that means no outbound calls have been made. as a result, I am going to replace the missing values for 'local outgoing calls', 'normal outgoing calls', 'special outgoing calls', and 'other outgoing calls' with 0.

In [51]:
#Imputing by '0'

df[og_call_6] = df[og_call_6].fillna(0)

### local outgoing calls(T2T,T2M,T2F & T2C)

In [52]:
og_loc_6 = ['loc_og_t2t_mou_6','loc_og_t2m_mou_6','loc_og_t2f_mou_6','loc_og_t2c_mou_6','loc_og_mou_6']
loc_og_6 = df[og_loc_6]

# filtering only those clients who didn't make any 'local outgoing calls'
loc_og_6.loc[loc_og_6 ['loc_og_mou_6']==0].sample(2)

Unnamed: 0,loc_og_t2t_mou_6,loc_og_t2m_mou_6,loc_og_t2f_mou_6,loc_og_t2c_mou_6,loc_og_mou_6
16942,0.0,0.0,0.0,0.0,0.0
56668,,,,,0.0


Local outbound calls in any given month are equal to the total of all local outgoing call kinds, i.e. (T2T,T2M,T2F & T2C)

The monthly total number of local outbound calls (T2T,T2M,T2F, and T2C) is zero. Now, I will replace the missing value with zero.

In [53]:
#Lets verify is the assumption right or not, sum of 'all local outgoing calls' is 
#equal to the the 'total local outgoing call'
#"loc_og_mou_6"

df['loc_outgoing_total_6'] = df['loc_og_t2t_mou_6']+ df['loc_og_t2m_mou_6']+df['loc_og_t2f_mou_6']+df['loc_og_t2c_mou_6']
df[['loc_outgoing_total_6','loc_og_mou_6']].dropna().corr().round()

Unnamed: 0,loc_outgoing_total_6,loc_og_mou_6
loc_outgoing_total_6,1.0,1.0
loc_og_mou_6,1.0,1.0


`Observation:` From the above table, it is evident that these variables are associated; thus, when the total number of local outgoing calls for a given month is 0,that means no local outbound calls have been made. as a result, I am going to replace the missing values for T2T,T2M,T2F & T2C by zero

In [54]:
# Imputing by '0'
df[og_loc_6] = df[og_loc_6].fillna(0)

#### total standard outgoing calls 

In [55]:
og_std_6 = ['std_og_t2t_mou_6','std_og_t2m_mou_6','std_og_t2f_mou_6','std_og_mou_6']
std_og_6 = df[og_std_6]
std_og_6.sample(5)

Unnamed: 0,std_og_t2t_mou_6,std_og_t2m_mou_6,std_og_t2f_mou_6,std_og_mou_6
82107,0.0,8.91,0.0,8.91
71003,0.0,0.0,0.0,0.0
11283,0.0,44.96,0.0,44.96
71242,0.0,8.74,0.0,8.74
19744,0.0,2.5,0.0,2.5


In [56]:
# Imputing by '0'
df[og_std_6] = df[og_std_6].fillna(0)

#### For Month of July

Similar process is applied like June

#### Total Outgoing Call

In [57]:
# From the business domain knowledge, we know that total_og column is the addition of local , standard, special and other outgoing calls
og_call_7 = ['loc_og_mou_7','std_og_mou_7','isd_og_mou_7','spl_og_mou_7','og_others_7','total_og_mou_7']
total_og_7 = df[og_call_7]
# filtering only those clients who have made no outgoing calls
total_og_7.loc[total_og_7['total_og_mou_7']==0].sample(2)

Unnamed: 0,loc_og_mou_7,std_og_mou_7,isd_og_mou_7,spl_og_mou_7,og_others_7,total_og_mou_7
54864,0.0,0.0,0.0,0.0,0.0,0.0
50682,,,,,,0.0


In [58]:
#Lets verify is our assumption right or not, sum of all outgoing calls is equal to the the tota_og_mou_7
df['outgoing_total_7'] = df['loc_og_mou_7']+ df['std_og_mou_7']+df['isd_og_mou_7']+df['spl_og_mou_7']+df['og_others_7']
df[['outgoing_total_7','total_og_mou_7']].dropna().corr()

Unnamed: 0,outgoing_total_7,total_og_mou_7
outgoing_total_7,1.0,1.0
total_og_mou_7,1.0,1.0


In [59]:
# Imputing by '0'
df[og_call_7] = df[og_call_7].fillna(0)

In [60]:
og_loc_7 = ['loc_og_t2t_mou_7','loc_og_t2m_mou_7','loc_og_t2f_mou_7','loc_og_t2c_mou_7','loc_og_mou_7']
loc_og_7 = df[og_loc_7]

# filtering only those clients who didn't made any local outgoing calls
loc_og_7.loc[loc_og_7 ['loc_og_mou_7']==0].sample(3)

Unnamed: 0,loc_og_t2t_mou_7,loc_og_t2m_mou_7,loc_og_t2f_mou_7,loc_og_t2c_mou_7,loc_og_mou_7
19509,0.0,0.0,0.0,0.0,0.0
37097,,,,,0.0
40110,,,,,0.0


#### T2T,T2M,T2F & T2C

In [61]:
#Lets verify is our assumption right or not, sum of all local outgoing calls is equal to 
#the total local outgoing call
#"loc_og_mou_7"

df['loc_outgoing_total_7'] = df['loc_og_t2t_mou_7']+ df['loc_og_t2m_mou_7']+df['loc_og_t2f_mou_7']+df['loc_og_t2c_mou_7']
df[['loc_outgoing_total_7','loc_og_mou_7']].dropna().corr().round()

Unnamed: 0,loc_outgoing_total_7,loc_og_mou_7
loc_outgoing_total_7,1.0,1.0
loc_og_mou_7,1.0,1.0


In [62]:
# Imputing by '0'
df[og_loc_7] = df[og_loc_7].fillna(0)

In [63]:
og_std_7 = ['std_og_t2t_mou_7','std_og_t2m_mou_7','std_og_t2f_mou_7','std_og_mou_7']
std_og_7 = df[og_std_7]
std_og_7.sample(3)

Unnamed: 0,std_og_t2t_mou_7,std_og_t2m_mou_7,std_og_t2f_mou_7,std_og_mou_7
22466,18.58,0.0,0.0,18.58
36947,0.0,16.93,0.0,16.93
52202,106.14,25.48,5.28,136.91


#### Total standard outgoing calls 

In [64]:
#Lets verify is our assumption right or not, sum of all std outgoing calls is equal to 
#the total std outgoing call
#"std_og_mou_7"

df['std_outgoing_total_7'] = df['std_og_t2t_mou_7']+ df['std_og_t2m_mou_7']+df['std_og_t2f_mou_7']
df[['std_outgoing_total_7','std_og_mou_7']].dropna().corr()

Unnamed: 0,std_outgoing_total_7,std_og_mou_7
std_outgoing_total_7,1.0,1.0
std_og_mou_7,1.0,1.0


In [65]:
# Imputing by '0'
df[og_std_7] = df[og_std_7].fillna(0)

#### For Month of August

Similar process is applied like June

#### Total Outgoing Call

In [66]:
# From the business domain knowledge, we know that total_og column is the addition of local , standard, special and other outgoing calls
og_call_8 = ['loc_og_mou_8','std_og_mou_8','isd_og_mou_8','spl_og_mou_8','og_others_8','total_og_mou_8']
total_og_8 = df[og_call_8]
# filtering only those clients who have made no outgoing calls
total_og_8.loc[total_og_8['total_og_mou_8']==0].sample(3)

Unnamed: 0,loc_og_mou_8,std_og_mou_8,isd_og_mou_8,spl_og_mou_8,og_others_8,total_og_mou_8
6291,0.0,0.0,0.0,0.0,0.0,0.0
74175,,,,,,0.0
68723,0.0,0.0,0.0,0.0,0.0,0.0


In [67]:
#Lets verify is our assumption right or not, sum of 
#all outgoing calls is equal to the tota_og_mou_8

df['outgoing_total_8'] = df['loc_og_mou_8']+ df['std_og_mou_8']+df['isd_og_mou_8']+df['spl_og_mou_8']+df['og_others_8']
df[['outgoing_total_8','total_og_mou_8']].dropna().corr()

Unnamed: 0,outgoing_total_8,total_og_mou_8
outgoing_total_8,1.0,1.0
total_og_mou_8,1.0,1.0


In [68]:
# Imputing by '0'
df[og_call_8] = df[og_call_8].fillna(0)

#### T2T,T2M,T2F & T2C

In [69]:
og_loc_8 = ['loc_og_t2t_mou_8','loc_og_t2m_mou_8','loc_og_t2f_mou_8','loc_og_t2c_mou_8','loc_og_mou_8']
loc_og_8 = df[og_loc_8]

# filtering only those clients who have made no local outgoing calls
loc_og_8.loc[loc_og_8 ['loc_og_mou_8']==0].sample(2)

Unnamed: 0,loc_og_t2t_mou_8,loc_og_t2m_mou_8,loc_og_t2f_mou_8,loc_og_t2c_mou_8,loc_og_mou_8
6743,,,,,0.0
77326,,,,,0.0


In [70]:
#Lets verify is our assumption right or not, sum of all local outgoing calls is equal to the the total local outgoing call
#"loc_og_mou_8"

df['loc_outgoing_total_8'] = df['loc_og_t2t_mou_8']+ df['loc_og_t2m_mou_8']+df['loc_og_t2f_mou_8']+df['loc_og_t2c_mou_8']
df[['loc_outgoing_total_8','loc_og_mou_8']].dropna().corr().round()

Unnamed: 0,loc_outgoing_total_8,loc_og_mou_8
loc_outgoing_total_8,1.0,1.0
loc_og_mou_8,1.0,1.0


In [71]:
# Imputing by '0'
df[og_loc_8] = df[og_loc_8].fillna(0)

#### std outgoing calls

In [72]:
og_std_8 = ['std_og_t2t_mou_8','std_og_t2m_mou_8','std_og_t2f_mou_8','std_og_mou_8']
std_og_8 = df[og_std_8]
std_og_8.sample(3)

Unnamed: 0,std_og_t2t_mou_8,std_og_t2m_mou_8,std_og_t2f_mou_8,std_og_mou_8
82923,,,,0.0
2613,0.0,0.0,0.0,0.0
68434,0.0,0.0,0.0,0.0


In [73]:
#Lets verify is our assumption right or not, sum of all std outgoing 
#calls is equal to the total std outgoing call
#"std_og_mou_8"

df['std_outgoing_total_8'] = df['std_og_t2t_mou_8']+ df['std_og_t2m_mou_8']+df['std_og_t2f_mou_8']
df[['std_outgoing_total_8','std_og_mou_8']].dropna().corr()

Unnamed: 0,std_outgoing_total_8,std_og_mou_8
std_outgoing_total_8,1.0,1.0
std_og_mou_8,1.0,1.0


In [74]:
# Imputing by '0'
df[og_std_8] = df[og_std_8].fillna(0)

## 6.7 Dropping missing values for incoming calls for month of June, July, and August

usage for local, standard, special and others¶

#### For month of June

In [75]:
# From the business domain knowledge, we know that total_ic column is the addition of local , standard, special andd other incoming calls
ic_call_6 = ['loc_ic_mou_6','std_ic_mou_6','isd_ic_mou_6','spl_ic_mou_6','ic_others_6','total_ic_mou_6']
total_ic_6 = df[ic_call_6]
# filtering only those clients who have made no outgoing calls
total_ic_6 .loc[total_ic_6 ['total_ic_mou_6']==0].sample(2)

Unnamed: 0,loc_ic_mou_6,std_ic_mou_6,isd_ic_mou_6,spl_ic_mou_6,ic_others_6,total_ic_mou_6
23294,,,,,,0.0
67668,,,,,,0.0


`Observation`: Based on my understanding of the business domain and the aforementioned statistics, I can deduce that when the total number of incoming calls is zero, no incoming calls are received during that month. The missing values can be replaced with 0 in the columns for local incoming calls, regular incoming calls, special incoming calls, and other incoming calls.

In [76]:
#Lets verify is our assumption right or not, sum of all incoming calls is equal to the tota_ic_mou_6

df['incoming_total_6'] = df['loc_ic_mou_6']+ df['std_ic_mou_6']+df['isd_ic_mou_6']+df['spl_ic_mou_6']+df['ic_others_6']
df[['incoming_total_6','total_ic_mou_6']].dropna().corr()

Unnamed: 0,incoming_total_6,total_ic_mou_6
incoming_total_6,1.0,1.0
total_ic_mou_6,1.0,1.0


`Observation`: As can be seen in the table above, when the total number of incoming calls is 0, no outgoing calls have been made during that month. 0 can be substituted for missing values in the columns for local outgoing calls, normal outgoing calls, special incoming calls, and other outgoing calls.

In [77]:
# Imputing by '0'
df[ic_call_6] = df[ic_call_6].fillna(0)

#### T2T,T2M,T2F & T2C

In [78]:
ic_loc_6 = ['loc_ic_t2t_mou_6','loc_ic_t2m_mou_6','loc_ic_t2f_mou_6','loc_ic_mou_6']
loc_ic_6= df[ic_loc_6]

# filtering only those clients who have made no local outgoing calls
loc_ic_6.loc[loc_ic_6['loc_ic_mou_6']==0].sample(3)

Unnamed: 0,loc_ic_t2t_mou_6,loc_ic_t2m_mou_6,loc_ic_t2f_mou_6,loc_ic_mou_6
82601,,,,0.0
33061,,,,0.0
10655,,,,0.0


`Observation`:If the total number of local calls received in a particular month is zero, no local calls were received, i.e. T2T, T2M, T2F, and T2C. I am able to replace '0' for the missing equivalent numbers.

In [79]:
#Lets verify is our assumption right or not, sum of all local outgoing calls is equal to the 
#total local outgoing call
#"loc_ic_mou_6"

df['loc_incoming_total_6'] = df['loc_ic_t2t_mou_6']+ df['loc_ic_t2m_mou_6']+df['loc_ic_t2f_mou_6']
df[['loc_incoming_total_6','loc_ic_mou_6']].dropna().corr()

Unnamed: 0,loc_incoming_total_6,loc_ic_mou_6
loc_incoming_total_6,1.0,1.0
loc_ic_mou_6,1.0,1.0


In [80]:
# Imputing by '0'
df[ic_loc_6] = df[ic_loc_6].fillna(0)

#### std incoming calls

It is obvious from the statistics that standard incoming calls in any given month equal the total of all standard incoming call categories, i.e. (T2T,T2M,T2F & T2C)

In [81]:
ic_std_6 = ['std_ic_t2t_mou_6','std_ic_t2m_mou_6','std_ic_t2f_mou_6','std_ic_mou_6']
std_ic_6 = df[ic_std_6]
std_ic_6.sample(3)

Unnamed: 0,std_ic_t2t_mou_6,std_ic_t2m_mou_6,std_ic_t2f_mou_6,std_ic_mou_6
73272,0.0,0.15,0.0,0.15
65806,0.0,0.0,0.0,0.0
46858,,,,0.0


`Observation`:If the total number of regular outbound calls in a particular month is zero, no T2T, T2M, T2F, or T2C calls were made. I am able to substitute '0' for the missing values that correspond to them.

In [82]:
#Lets verify is our assumption right or not, sum of all std outgoing calls is equal to the total std outgoing call
#"std_ic_mou_6"

df['std_incoming_total_6'] = df['std_ic_t2t_mou_6']+ df['std_ic_t2m_mou_6']+df['std_ic_t2f_mou_6']
df[['std_incoming_total_6','std_ic_mou_6']].dropna().corr()

Unnamed: 0,std_incoming_total_6,std_ic_mou_6
std_incoming_total_6,1.0,1.0
std_ic_mou_6,1.0,1.0


`Observation`:If the total number of standard outbound calls in a particular month is 0, then no standard calls, i.e. T2T, T2M, T2F, and T2C, were made.

In [83]:
# Imputing by '0'
df[ic_std_6] = df[ic_std_6].fillna(0)

#### For month of July

# Total Calls

In [84]:
# From the business domain knowledge, we know that total_ic column is the addition of local , standard, special andd other incoming calls
ic_call_7 = ['loc_ic_mou_7','std_ic_mou_7','isd_ic_mou_7','spl_ic_mou_7','ic_others_7','total_ic_mou_7']
total_ic_7 = df[ic_call_7]
# filtering only those clients who have made no outgoing calls
total_ic_7 .loc[total_ic_7 ['total_ic_mou_7']==0].sample(2)

Unnamed: 0,loc_ic_mou_7,std_ic_mou_7,isd_ic_mou_7,spl_ic_mou_7,ic_others_7,total_ic_mou_7
19983,0.0,0.0,0.0,0.0,0.0,0.0
91902,,,,,,0.0


In [85]:
#Lets verify is our assumption right or not, sum of all local incoming calls is equal to the the total local incoming call
#"loc_ic_mou_7"

df['loc_incoming_total_7'] = df['loc_ic_mou_7']+ df['std_ic_mou_7']+df['isd_ic_mou_7']+df['spl_ic_mou_7']+df['ic_others_7']
df[['loc_incoming_total_7','total_ic_mou_7']].dropna().corr()

Unnamed: 0,loc_incoming_total_7,total_ic_mou_7
loc_incoming_total_7,1.0,1.0
total_ic_mou_7,1.0,1.0


In [86]:
# Imputing by '0'
df[ic_call_7] = df[ic_call_7].fillna(0)

#### T2T,T2M,T2F & T2C

In [87]:
ic_loc_7 = ['loc_ic_t2t_mou_7','loc_ic_t2m_mou_7','loc_ic_t2f_mou_7','loc_ic_mou_7']
loc_ic_7= df[ic_loc_7]

# filtering only those clients who have made no local outgoing calls
loc_ic_7.loc[loc_ic_7['loc_ic_mou_7']==0].sample(3)

Unnamed: 0,loc_ic_t2t_mou_7,loc_ic_t2m_mou_7,loc_ic_t2f_mou_7,loc_ic_mou_7
3394,,,,0.0
80316,0.0,0.0,0.0,0.0
27619,0.0,0.0,0.0,0.0


In [88]:
#Lets verify is our assumption right or not, sum of all local incoming calls is equal to the 
#total local incoming call
#"loc_ic_mou_7"

df['loc_incoming_total_7'] = df['loc_ic_t2t_mou_7']+ df['loc_ic_t2m_mou_7']+df['loc_ic_t2f_mou_7']
df[['loc_incoming_total_7','loc_ic_mou_7']].dropna().corr()

Unnamed: 0,loc_incoming_total_7,loc_ic_mou_7
loc_incoming_total_7,1.0,1.0
loc_ic_mou_7,1.0,1.0


In [89]:
# Imputing by '0'
df[ic_loc_7] = df[ic_loc_7].fillna(0)

#### Standard Outgoing calls

In [90]:
ic_std_7 = ['std_ic_t2t_mou_7','std_ic_t2m_mou_7','std_ic_t2f_mou_7','std_ic_mou_7']
std_ic_7 = df[ic_std_7]
std_ic_7.sample(2)

Unnamed: 0,std_ic_t2t_mou_7,std_ic_t2m_mou_7,std_ic_t2f_mou_7,std_ic_mou_7
84305,0.0,1.83,0.0,1.83
69718,6.61,66.99,0.0,73.61


In [91]:
#Lets verify is our assumption right or not, sum of all local incoming calls is equal to the the total local incoming call
#"std_ic_mou_7"

df['std_incoming_total_7'] = df['std_ic_t2t_mou_7']+ df['std_ic_t2m_mou_7']+df['std_ic_t2f_mou_7']
df[['std_incoming_total_7','std_ic_mou_7']].dropna().corr()

Unnamed: 0,std_incoming_total_7,std_ic_mou_7
std_incoming_total_7,1.0,1.0
std_ic_mou_7,1.0,1.0


In [92]:
# Imputing by '0'
df[ic_std_7] = df[ic_std_7].fillna(0)

#### For August Month

In [93]:
# From the business domain knowledge, we know that total_ic column is the addition of local , standard, special andd other incoming calls
ic_call_8 = ['loc_ic_mou_8','std_ic_mou_8','isd_ic_mou_8','spl_ic_mou_8','ic_others_8','total_ic_mou_8']
total_ic_8 = df[ic_call_8]
# filtering only those clients who have made no outgoing calls
total_ic_8 .loc[total_ic_8['total_ic_mou_8']==0].sample(3)

Unnamed: 0,loc_ic_mou_8,std_ic_mou_8,isd_ic_mou_8,spl_ic_mou_8,ic_others_8,total_ic_mou_8
32010,,,,,,0.0
70267,0.0,0.0,0.0,0.0,0.0,0.0
75919,,,,,,0.0


In [94]:
#Lets verify is our assumption right or not, sum of all incoming calls is equal to the the tota_ic_mou_8

df['incoming_total_8'] = df['loc_ic_mou_8']+ df['std_ic_mou_8']+df['isd_ic_mou_8']+df['spl_ic_mou_8']+df['ic_others_8']
df[['incoming_total_8','total_ic_mou_8']].dropna().corr()

Unnamed: 0,incoming_total_8,total_ic_mou_8
incoming_total_8,1.0,1.0
total_ic_mou_8,1.0,1.0


In [95]:
# Imputing by '0'
df[ic_call_8] = df[ic_call_8].fillna(0)

In [96]:
ic_loc_8 = ['loc_ic_t2t_mou_8','loc_ic_t2m_mou_8','loc_ic_t2f_mou_8','loc_ic_mou_8']
loc_ic_8= df[ic_loc_8]

# filtering only those clients who have made no local outgoing calls
loc_ic_8.loc[loc_ic_8['loc_ic_mou_8']==0].sample(3)

Unnamed: 0,loc_ic_t2t_mou_8,loc_ic_t2m_mou_8,loc_ic_t2f_mou_8,loc_ic_mou_8
69324,0.0,0.0,0.0,0.0
68722,0.0,0.0,0.0,0.0
21345,,,,0.0


In [97]:
#Lets verify is our assumption right or not, sum of all local incoming calls is equal to the the total local incoming call
#"loc_ic_mou_8"

df['loc_incoming_total_8'] = df['loc_ic_t2t_mou_8']+ df['loc_ic_t2m_mou_8']+df['loc_ic_t2f_mou_8']
df[['loc_incoming_total_8','loc_ic_mou_8']].dropna().corr()

Unnamed: 0,loc_incoming_total_8,loc_ic_mou_8
loc_incoming_total_8,1.0,1.0
loc_ic_mou_8,1.0,1.0


In [98]:
# Imputing by '0'
df[ic_loc_8] = df[ic_loc_8].fillna(0)

In [99]:
ic_std_8 = ['std_ic_t2t_mou_8','std_ic_t2m_mou_8','std_ic_t2f_mou_8','std_ic_mou_8']
std_ic_8 = df[ic_std_8]
std_ic_8.sample(3)

Unnamed: 0,std_ic_t2t_mou_8,std_ic_t2m_mou_8,std_ic_t2f_mou_8,std_ic_mou_8
6989,11.99,1446.88,0.99,1459.88
41725,21.84,48.46,0.0,70.31
48923,0.0,3.93,0.0,3.93


In [100]:
#Lets verify is our assumption right or not, sum of all local incoming calls is equal to the the total local incoming call
#"std_ic_mou_8"

df['std_incoming_total_8'] = df['std_ic_t2t_mou_8']+ df['std_ic_t2m_mou_8']+df['std_ic_t2f_mou_8']
df[['std_incoming_total_8','std_ic_mou_8']].dropna().corr()

Unnamed: 0,std_incoming_total_8,std_ic_mou_8
std_incoming_total_8,1.0,1.0
std_ic_mou_8,1.0,1.0


In [101]:
# Imputing by '0'
df[ic_std_8] = df[ic_std_8].fillna(0)

#### Check the Null Values

In [102]:
## Let us now check the % of null values
null = round(100*(df.isnull().sum()/len(df.index)),2).sort_values(ascending = False)
null = null[null!=0]
null

std_incoming_total_8    5.38
offnet_mou_8            5.38
outgoing_total_8        5.38
std_outgoing_total_8    5.38
roam_og_mou_8           5.38
roam_ic_mou_8           5.38
loc_outgoing_total_8    5.38
loc_incoming_total_8    5.38
onnet_mou_8             5.38
incoming_total_8        5.38
offnet_mou_6            3.94
roam_ic_mou_6           3.94
outgoing_total_6        3.94
loc_outgoing_total_6    3.94
roam_og_mou_6           3.94
onnet_mou_6             3.94
std_incoming_total_6    3.94
loc_incoming_total_6    3.94
incoming_total_6        3.94
outgoing_total_7        3.86
offnet_mou_7            3.86
std_outgoing_total_7    3.86
loc_outgoing_total_7    3.86
std_incoming_total_7    3.86
roam_og_mou_7           3.86
onnet_mou_7             3.86
roam_ic_mou_7           3.86
loc_incoming_total_7    3.86
date_of_last_rech_8     3.62
date_of_last_rech_7     1.77
date_of_last_rech_6     1.61
dtype: float64

## 3.9 Drop Null Values from Rows

In [103]:
df = df.dropna(how='all',axis=0) 

In [104]:
df.shape

(99999, 168)

In [105]:
# Now Checking Null values
null = round(100*(df.isnull().sum()/len(df.index)),2).sort_values(ascending = False)
null = null[null!=0]
null

std_incoming_total_8    5.38
offnet_mou_8            5.38
outgoing_total_8        5.38
std_outgoing_total_8    5.38
roam_og_mou_8           5.38
roam_ic_mou_8           5.38
loc_outgoing_total_8    5.38
loc_incoming_total_8    5.38
onnet_mou_8             5.38
incoming_total_8        5.38
offnet_mou_6            3.94
roam_ic_mou_6           3.94
outgoing_total_6        3.94
loc_outgoing_total_6    3.94
roam_og_mou_6           3.94
onnet_mou_6             3.94
std_incoming_total_6    3.94
loc_incoming_total_6    3.94
incoming_total_6        3.94
outgoing_total_7        3.86
offnet_mou_7            3.86
std_outgoing_total_7    3.86
loc_outgoing_total_7    3.86
std_incoming_total_7    3.86
roam_og_mou_7           3.86
onnet_mou_7             3.86
roam_ic_mou_7           3.86
loc_incoming_total_7    3.86
date_of_last_rech_8     3.62
date_of_last_rech_7     1.77
date_of_last_rech_6     1.61
dtype: float64

## 3.10 Drop Null Values from features¶

In [106]:
# Let us drop the date columns as they do not infer anything 
del_date = [i for i in df.columns if 'date' in i]
df= df.drop(del_date,1)


In [107]:
df.shape

(99999, 162)

In [108]:
# Now Checking Null values
null = round(100*(df.isnull().sum()/len(df.index)),2).sort_values(ascending = False)
null = null[null!=0]
null

std_incoming_total_8    5.38
roam_og_mou_8           5.38
loc_incoming_total_8    5.38
incoming_total_8        5.38
outgoing_total_8        5.38
onnet_mou_8             5.38
loc_outgoing_total_8    5.38
std_outgoing_total_8    5.38
offnet_mou_8            5.38
roam_ic_mou_8           5.38
loc_outgoing_total_6    3.94
incoming_total_6        3.94
loc_incoming_total_6    3.94
std_incoming_total_6    3.94
outgoing_total_6        3.94
roam_og_mou_6           3.94
onnet_mou_6             3.94
roam_ic_mou_6           3.94
offnet_mou_6            3.94
roam_og_mou_7           3.86
loc_outgoing_total_7    3.86
std_incoming_total_7    3.86
outgoing_total_7        3.86
std_outgoing_total_7    3.86
onnet_mou_7             3.86
offnet_mou_7            3.86
roam_ic_mou_7           3.86
loc_incoming_total_7    3.86
dtype: float64

### Replacing the remaining missing values¶

Replace NaN values with the column's mean if the missing data percentage is less than 4%. Because the great majority of the data is valid, replacing the remaining values with the mean will have minimal impact on the research.

In [109]:
df[null.index].describe()

Unnamed: 0,std_incoming_total_8,roam_og_mou_8,loc_incoming_total_8,incoming_total_8,outgoing_total_8,onnet_mou_8,loc_outgoing_total_8,std_outgoing_total_8,offnet_mou_8,roam_ic_mou_8,loc_outgoing_total_6,incoming_total_6,loc_incoming_total_6,std_incoming_total_6,outgoing_total_6,roam_og_mou_6,onnet_mou_6,roam_ic_mou_6,offnet_mou_6,roam_og_mou_7,loc_outgoing_total_7,std_incoming_total_7,outgoing_total_7,std_outgoing_total_7,onnet_mou_7,offnet_mou_7,roam_ic_mou_7,loc_incoming_total_7
count,94621.0,94621.0,94621.0,94621.0,94621.0,94621.0,94621.0,94621.0,94621.0,94621.0,96062.0,96062.0,96062.0,96062.0,96062.0,96062.0,96062.0,96062.0,96062.0,96140.0,96140.0,96140.0,96140.0,96140.0,96140.0,96140.0,96140.0,96140.0
mean,33.152136,9.97189,167.423711,210.040473,321.398096,133.018098,142.754745,174.188855,196.574803,7.292981,145.31692,208.325058,167.48232,32.454631,317.631638,13.911337,132.395875,9.950013,197.935577,9.818732,143.031627,33.885242,322.676998,175.218745,133.670805,197.045133,7.149898,167.71071
std,110.125911,64.713221,250.023926,293.412497,485.864968,308.951589,246.156285,411.631534,327.170662,68.402466,252.007954,294.681705,254.122393,106.282273,468.599116,71.443196,297.207406,72.825411,316.851613,58.455762,248.984326,113.719065,485.447634,408.921371,308.794148,325.862803,73.447948,256.241124
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.01,0.0,32.74,49.11,51.58,6.46,18.62,0.0,31.63,0.0,18.3,46.68,30.3825,0.0,54.36,0.0,7.38,0.0,34.73,0.0,18.93,0.0,52.24,0.0,6.66,32.19,0.0,32.4475
50%,5.88,0.0,93.81,124.93,153.79,32.36,65.18,10.41,92.14,0.0,66.23,122.24,92.145,5.89,156.32,0.0,34.31,0.0,96.31,0.0,65.11,5.95,152.64,11.09,32.33,91.735,0.0,92.54
75%,27.69,0.0,207.27,260.66,391.96,115.86,167.65,147.94,228.26,0.0,169.4175,259.84,208.065,26.93,388.59,0.0,118.74,0.0,231.86,0.0,166.09,28.3,395.4125,150.615,115.595,226.815,0.0,205.82
max,5957.13,5337.04,10830.16,10830.37,14043.05,10752.56,11040.56,13980.05,14007.34,13095.36,10645.27,7716.13,7454.62,5712.1,10674.02,3775.11,7376.71,13724.38,8362.36,2812.04,7674.76,6745.75,11365.3,10936.72,8157.78,9667.13,15371.04,9669.89


In [110]:
miss_col = null.index
for i in miss_col:
    df[i] = df[i].fillna(df[i].mean())

In [111]:
## Let us now check the % of null values

null = round(100*(df.isnull().sum()/len(df.index)),2).sort_values(ascending = False)
null = null[null!=0]
null

Series([], dtype: float64)

`Observation`: There is no missing values in this dataset