# <font color='blue'>Telecom Churn Case Study</font>
* Institution: IIIT, Bangalore and UpGrad
* Course: PG Diploma in Machine Lerning and AI March 2018
* Date: 13-Aug-2018
* Submitted by:
    1. Pandinath Siddineni (ID- APFE187000194)
    2. AKNR Chandra Sekhar (ID- APFE187000315)
    3. Brajesh Kumar       (ID- APFE187000149)
    4. Shweta Tiwari


### <font color='blue'>Business Goals:</font>
1. Retaining high profitable customers is the number one business goal.
2. This project is based on the Indian and Southeast Asian market.
3. In the Indian and the southeast Asian market, approximately 80% of revenue comes from the top 20% customers (called high-value customers). Thus, if we can reduce churn of the high-value customers, we will be able to reduce significant revenue leakage.
4. The business objective is to predict the churn in the last (i.e. the ninth) month using the data (features) from the first three months. To do this task well, understanding the typical customer behaviour during churn will be helpful.

### <font color='blue'>Analysis Goals:</font>
1. Predict which customers are at high risk of churn
2. Build predictive models to identify customers at high risk of churn and identify the main indicators of churn.
3. Prepaid is the most common model in India and southeast Asia. Focus on prepaid customers.
3. Curn definition used-- "Usage-based churn: Customers who have not done any usage, either incoming or outgoing - in terms of calls, internet etc. over a period of time." In this project, we will use the usage-based definition to define churn.
4. In this project, you will define high-value customers based on a certain metric (mentioned later below) and predict churn only on high-value customers.
5. especially high-value customers go through  three phases of customer lifecycle: a. The ‘good’ phase, b. The ‘action’ phase, c. The ‘churn’ phase
---------------------------

# <font color='blue'>PART 1: DATA UNDERSTANDING AND CLEANING</font>

1. Understand the properties of loaded dataframe
2. Idnetify Uniquness key
3. Identify bad colums that has no infromation (all entries are null or same)
4. Conver dates to meaningful number of days
5. Remove columns with data that does not make much sense for our analysis
6. Missing value treatment: replace with '0', mean or median; drop rows; drop columns
7. Outlier Treatment
8. Write data into a clean data file. This will be used to create master-df for analysis

#### <font color='red'>TODO: Compute Loss of data at each cleaning step</font>

In [1]:
# Import required libraries
import numpy as np
import pandas as pd

# Until fuction: line seperator
def print_ln():
    print('-'*80, '\n')
    
pd.options.display.float_format = '{:.2f}'.format

# Load csv data file
telecom = pd.read_csv('telecom_churn_data.csv', low_memory=False)

In [2]:
# Understand the properties of loaded dataframe
print('Dataframe Shape: ', telecom.shape); print_ln();
print("Dataframe Info: \n"); telecom.info(); print_ln();
telecom.head(5)

Dataframe Shape:  (99999, 226)
-------------------------------------------------------------------------------- 

Dataframe Info: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99999 entries, 0 to 99998
Columns: 226 entries, mobile_number to sep_vbc_3g
dtypes: float64(179), int64(35), object(12)
memory usage: 172.4+ MB
-------------------------------------------------------------------------------- 



Unnamed: 0,mobile_number,circle_id,loc_og_t2o_mou,std_og_t2o_mou,loc_ic_t2o_mou,last_date_of_month_6,last_date_of_month_7,last_date_of_month_8,last_date_of_month_9,arpu_6,...,sachet_3g_9,fb_user_6,fb_user_7,fb_user_8,fb_user_9,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g,sep_vbc_3g
0,7000842753,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,197.38,...,0,1.0,1.0,1.0,,968,30.4,0.0,101.2,3.58
1,7001865778,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,34.05,...,0,,1.0,1.0,,1006,0.0,0.0,0.0,0.0
2,7001625959,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,167.69,...,0,,,,1.0,1103,0.0,0.0,4.17,0.0
3,7001204172,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,221.34,...,0,,,,,2491,0.0,0.0,0.0,0.0
4,7000142493,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,9/30/2014,261.64,...,0,0.0,,,,1526,0.0,0.0,0.0,0.0


## Fix Dates & convert to meaningful numbers
1. Convert date_of_last_rech_6 --> rech_b4_days_to_month_end_6 (voice rechanrged before number of days to month end)
2. convert date_of_last_rech_data_6 --> rech_b4_days_to_month_end_data_6 (data rechanrged before number of days to month end)

In [3]:
# covert date columns to python datetime format
date_columns = ["last_date_of_month_6", "last_date_of_month_7", "last_date_of_month_8", 
 "date_of_last_rech_6",  "date_of_last_rech_7", "date_of_last_rech_8", 
 "date_of_last_rech_data_6", "date_of_last_rech_data_7",  "date_of_last_rech_data_8"]
telecom[date_columns].head()

Unnamed: 0,last_date_of_month_6,last_date_of_month_7,last_date_of_month_8,date_of_last_rech_6,date_of_last_rech_7,date_of_last_rech_8,date_of_last_rech_data_6,date_of_last_rech_data_7,date_of_last_rech_data_8
0,6/30/2014,7/31/2014,8/31/2014,6/21/2014,7/16/2014,8/8/2014,6/21/2014,7/16/2014,8/8/2014
1,6/30/2014,7/31/2014,8/31/2014,6/29/2014,7/31/2014,8/28/2014,,7/25/2014,8/10/2014
2,6/30/2014,7/31/2014,8/31/2014,6/17/2014,7/24/2014,8/14/2014,,,
3,6/30/2014,7/31/2014,8/31/2014,6/28/2014,7/31/2014,8/31/2014,,,
4,6/30/2014,7/31/2014,8/31/2014,6/26/2014,7/28/2014,8/9/2014,6/4/2014,,


In [4]:
# convert to datetime
for col in date_columns:
    telecom[col] = pd.to_datetime(telecom[col])

print(telecom[date_columns].info())
telecom[date_columns].head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99999 entries, 0 to 99998
Data columns (total 9 columns):
last_date_of_month_6        99999 non-null datetime64[ns]
last_date_of_month_7        99398 non-null datetime64[ns]
last_date_of_month_8        98899 non-null datetime64[ns]
date_of_last_rech_6         98392 non-null datetime64[ns]
date_of_last_rech_7         98232 non-null datetime64[ns]
date_of_last_rech_8         96377 non-null datetime64[ns]
date_of_last_rech_data_6    25153 non-null datetime64[ns]
date_of_last_rech_data_7    25571 non-null datetime64[ns]
date_of_last_rech_data_8    26339 non-null datetime64[ns]
dtypes: datetime64[ns](9)
memory usage: 6.9 MB
None


Unnamed: 0,last_date_of_month_6,last_date_of_month_7,last_date_of_month_8,date_of_last_rech_6,date_of_last_rech_7,date_of_last_rech_8,date_of_last_rech_data_6,date_of_last_rech_data_7,date_of_last_rech_data_8
0,2014-06-30,2014-07-31,2014-08-31,2014-06-21,2014-07-16,2014-08-08,2014-06-21,2014-07-16,2014-08-08
1,2014-06-30,2014-07-31,2014-08-31,2014-06-29,2014-07-31,2014-08-28,NaT,2014-07-25,2014-08-10
2,2014-06-30,2014-07-31,2014-08-31,2014-06-17,2014-07-24,2014-08-14,NaT,NaT,NaT
3,2014-06-30,2014-07-31,2014-08-31,2014-06-28,2014-07-31,2014-08-31,NaT,NaT,NaT
4,2014-06-30,2014-07-31,2014-08-31,2014-06-26,2014-07-28,2014-08-09,2014-06-04,NaT,NaT


In [5]:
# Create new days columns, instead of date

telecom["rech_days_left_6"]      = (telecom.last_date_of_month_6 - telecom.date_of_last_rech_6).astype('timedelta64[D]')
telecom["rech_days_left_data_6"] = (telecom.last_date_of_month_6 - telecom.date_of_last_rech_data_6).astype('timedelta64[D]')
telecom["rech_days_left_7"]      = (telecom.last_date_of_month_7 - telecom.date_of_last_rech_7).astype('timedelta64[D]')
telecom["rech_days_left_data_7"] = (telecom.last_date_of_month_7 - telecom.date_of_last_rech_data_7).astype('timedelta64[D]')
telecom["rech_days_left_8"]      = (telecom.last_date_of_month_8 - telecom.date_of_last_rech_8).astype('timedelta64[D]')
telecom["rech_days_left_data_8"] = (telecom.last_date_of_month_8 - telecom.date_of_last_rech_data_8).astype('timedelta64[D]')

day_columns = ["rech_days_left_6", "rech_days_left_data_6", "rech_days_left_7", "rech_days_left_data_7", "rech_days_left_8", "rech_days_left_data_8"]
#print(telecom[day_columns].head(10))
print(telecom[day_columns].info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99999 entries, 0 to 99998
Data columns (total 6 columns):
rech_days_left_6         98392 non-null float64
rech_days_left_data_6    25153 non-null float64
rech_days_left_7         98232 non-null float64
rech_days_left_data_7    25571 non-null float64
rech_days_left_8         96377 non-null float64
rech_days_left_data_8    26339 non-null float64
dtypes: float64(6)
memory usage: 4.6 MB
None


In [6]:
# Drop all old date columns: add dates columns to drop_column list
drop_columns = date_columns

## Filter high-value customers

In [7]:
# Filter high-value customers
total_rech_amt_6_7 = (telecom["total_rech_amt_6"] + telecom["total_rech_amt_7"]) / 2.0
amont_70_pc = np.percentile(total_rech_amt_6_7, 70.0)
print('70 percentile of first two months avg recharge amount: ', amont_70_pc); print_ln();

telecom = telecom[total_rech_amt_6_7 >= amont_70_pc]
print('Dataframe Shape: ', telecom.shape); print_ln();

70 percentile of first two months avg recharge amount:  368.5
-------------------------------------------------------------------------------- 

Dataframe Shape:  (30011, 232)
-------------------------------------------------------------------------------- 



## Tag churners and remove attributes of the churn phase

In [8]:
# Identify Churn
X = telecom["total_ic_mou_9"] + telecom["total_og_mou_9"] + telecom["vol_2g_mb_9"] + telecom["vol_3g_mb_9"]
telecom["churn"] = np.where(X, 0, 1)
#telecom["churn"].head(30)

# Columns to be dropped: all columns ending with "_9"
drop_columns += [hdr for hdr in list(telecom) if hdr.endswith("_9")]
print('Total number of columns to drop  = ', len(set(drop_columns))); print_ln()

Total number of columns to drop  =  63
-------------------------------------------------------------------------------- 



## Identify columns that have no varience & Drop

In [9]:
# Identify columns that have no varience
telecom_unique_count = telecom.nunique().sort_values(ascending=False)
#print("Dataframe Unique Values: \n", telecom_unique_count); print_ln()

# Identify bad colums that has no infromation (all entries are NA or same)
# Find columns with all NULL entries and add to drop_columns list
telecom_unique_count_is_zero = telecom_unique_count[telecom_unique_count == 0]
print("Dataframe Unique Value Count is ZERO (all null values): \n", telecom_unique_count_is_zero); print_ln();
drop_columns += list(telecom_unique_count_is_zero.index)

# Find columns with all same entries and add to drop_columns list
telecom_unique_count_is_one = telecom_unique_count[telecom_unique_count == 1]
print("Dataframe Unique Value Count is ONE (all same values): \n", telecom_unique_count_is_one); print_ln();
drop_columns += list(telecom_unique_count_is_one.index)

# #Don't drop columns used for caluculations ["last_date_of_month_6", "last_date_of_month_7","last_date_of_month_8"]
# drop_columns -= ["last_date_of_month_6", "last_date_of_month_7","last_date_of_month_8"]

print('Number of columns to drop  = ', len(set(drop_columns)))

Dataframe Unique Value Count is ZERO (all null values): 
 Series([], dtype: int64)
-------------------------------------------------------------------------------- 

Dataframe Unique Value Count is ONE (all same values): 
 last_date_of_month_8    1
circle_id               1
loc_og_t2o_mou          1
std_og_t2o_mou          1
loc_ic_t2o_mou          1
last_date_of_month_6    1
last_date_of_month_7    1
std_og_t2c_mou_6        1
last_date_of_month_9    1
std_og_t2c_mou_7        1
std_ic_t2o_mou_8        1
std_ic_t2o_mou_7        1
std_ic_t2o_mou_6        1
std_og_t2c_mou_9        1
std_og_t2c_mou_8        1
std_ic_t2o_mou_9        1
dtype: int64
-------------------------------------------------------------------------------- 

Number of columns to drop  =  73


In [10]:
# Additional colums to be dropped
# "sep_vbc_3g": this data belongs to fourth month, thus dropping it
# "mobile_number": not dropping as we need member-identification later
#drop_columns += ["mobile_number"]
drop_columns += ["sep_vbc_3g"]

In [11]:
# drop all identified columns
print('Comuns to be droped  = ', set(drop_columns))
print('Number of columns to drop  = ', len(set(drop_columns)))

telecom.drop(set(drop_columns), axis=1, inplace=True)
print('Dataframe Shape: ', telecom.shape); print_ln();
print("Dataframe Info: \n"); telecom.info(); print_ln();
telecom.head(5) 

Comuns to be droped  =  {'std_og_t2c_mou_8', 'last_date_of_month_9', 'total_ic_mou_9', 'onnet_mou_9', 'last_day_rch_amt_9', 'count_rech_3g_9', 'arpu_2g_9', 'std_ic_mou_9', 'offnet_mou_9', 'count_rech_2g_9', 'monthly_2g_9', 'std_og_t2c_mou_7', 'loc_ic_t2t_mou_9', 'std_ic_t2m_mou_9', 'std_ic_t2o_mou_9', 'ic_others_9', 'og_others_9', 'sachet_3g_9', 'spl_ic_mou_9', 'fb_user_9', 'loc_ic_t2o_mou', 'isd_og_mou_9', 'vol_2g_mb_9', 'loc_ic_t2m_mou_9', 'night_pck_user_9', 'loc_og_t2m_mou_9', 'date_of_last_rech_data_7', 'total_rech_data_9', 'total_rech_num_9', 'av_rech_amt_data_9', 'std_ic_t2f_mou_9', 'last_date_of_month_6', 'isd_ic_mou_9', 'std_ic_t2o_mou_7', 'std_og_mou_9', 'loc_og_mou_9', 'date_of_last_rech_data_8', 'monthly_3g_9', 'total_rech_amt_9', 'std_ic_t2t_mou_9', 'last_date_of_month_7', 'std_og_t2f_mou_9', 'std_og_t2o_mou', 'last_date_of_month_8', 'date_of_last_rech_6', 'date_of_last_rech_9', 'std_og_t2t_mou_9', 'spl_og_mou_9', 'arpu_3g_9', 'std_ic_t2o_mou_8', 'loc_og_t2c_mou_9', 'roam_

Unnamed: 0,mobile_number,arpu_6,arpu_7,arpu_8,onnet_mou_6,onnet_mou_7,onnet_mou_8,offnet_mou_6,offnet_mou_7,offnet_mou_8,...,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g,rech_days_left_6,rech_days_left_data_6,rech_days_left_7,rech_days_left_data_7,rech_days_left_8,rech_days_left_data_8,churn
7,7000701601,1069.18,1349.85,3171.48,57.84,54.68,52.29,453.43,567.16,325.91,...,57.74,19.38,18.74,3.0,,6.0,,5.0,,1
8,7001524846,378.72,492.22,137.36,413.69,351.03,35.08,94.66,80.63,136.48,...,21.03,910.65,122.16,5.0,,0.0,0.0,1.0,8.0,0
13,7002191713,492.85,205.67,593.26,501.76,108.39,534.24,413.31,119.28,482.46,...,0.0,0.0,0.0,10.0,,9.0,,1.0,1.0,0
16,7000875565,430.98,299.87,187.89,50.51,74.01,70.61,296.29,229.74,162.76,...,0.0,2.45,21.89,0.0,,0.0,,17.0,,0
17,7000187447,690.01,18.98,25.5,1185.91,9.28,7.79,61.64,0.0,5.54,...,0.0,0.0,0.0,0.0,,1.0,,6.0,,0


## Missing Value Treatment
1. Delete: Delete the missing values 
2. Impute: 
    - Imputing by a simple statistic: Replace the missing values by another value, commonly the mean, median, mode etc. 
    - Predictive techniques: Use statistical models such as k-NN, SVM etc. to predict and impute missing values

#### COLUMN-WISE: MISSING VALUES

In [18]:
df = telecom.copy()

# summing up the missing values (column-wise)
df.isnull().sum()

mobile_number                0
arpu_6                       0
arpu_7                       0
arpu_8                       0
onnet_mou_6                316
onnet_mou_7                303
onnet_mou_8                938
offnet_mou_6               316
offnet_mou_7               303
offnet_mou_8               938
roam_ic_mou_6              316
roam_ic_mou_7              303
roam_ic_mou_8              938
roam_og_mou_6              316
roam_og_mou_7              303
roam_og_mou_8              938
loc_og_t2t_mou_6           316
loc_og_t2t_mou_7           303
loc_og_t2t_mou_8           938
loc_og_t2m_mou_6           316
loc_og_t2m_mou_7           303
loc_og_t2m_mou_8           938
loc_og_t2f_mou_6           316
loc_og_t2f_mou_7           303
loc_og_t2f_mou_8           938
loc_og_t2c_mou_6           316
loc_og_t2c_mou_7           303
loc_og_t2c_mou_8           938
loc_og_mou_6               316
loc_og_mou_7               303
                         ...  
arpu_2g_8                18257
night_pc

In [19]:
# Percentage of missing values (column-wise)
round(100*(df.isnull().sum()/len(df.index)), 2)

mobile_number            0.00
arpu_6                   0.00
arpu_7                   0.00
arpu_8                   0.00
onnet_mou_6              1.05
onnet_mou_7              1.01
onnet_mou_8              3.13
offnet_mou_6             1.05
offnet_mou_7             1.01
offnet_mou_8             3.13
roam_ic_mou_6            1.05
roam_ic_mou_7            1.01
roam_ic_mou_8            3.13
roam_og_mou_6            1.05
roam_og_mou_7            1.01
roam_og_mou_8            3.13
loc_og_t2t_mou_6         1.05
loc_og_t2t_mou_7         1.01
loc_og_t2t_mou_8         3.13
loc_og_t2m_mou_6         1.05
loc_og_t2m_mou_7         1.01
loc_og_t2m_mou_8         3.13
loc_og_t2f_mou_6         1.05
loc_og_t2f_mou_7         1.01
loc_og_t2f_mou_8         3.13
loc_og_t2c_mou_6         1.05
loc_og_t2c_mou_7         1.01
loc_og_t2c_mou_8         3.13
loc_og_mou_6             1.05
loc_og_mou_7             1.01
                         ... 
arpu_2g_8               60.83
night_pck_user_6        62.02
night_pck_

In [22]:
# Columns with more than 60% missing values
colmns_missing_data = round(100*(df.isnull().sum()/len(df.index)), 2)
colmns_missing_data[colmns_missing_data >= 60]

total_rech_data_6       62.02
total_rech_data_7       61.14
total_rech_data_8       60.83
max_rech_data_6         62.02
max_rech_data_7         61.14
max_rech_data_8         60.83
count_rech_2g_6         62.02
count_rech_2g_7         61.14
count_rech_2g_8         60.83
count_rech_3g_6         62.02
count_rech_3g_7         61.14
count_rech_3g_8         60.83
av_rech_amt_data_6      62.02
av_rech_amt_data_7      61.14
av_rech_amt_data_8      60.83
arpu_3g_6               62.02
arpu_3g_7               61.14
arpu_3g_8               60.83
arpu_2g_6               62.02
arpu_2g_7               61.14
arpu_2g_8               60.83
night_pck_user_6        62.02
night_pck_user_7        61.14
night_pck_user_8        60.83
fb_user_6               62.02
fb_user_7               61.14
fb_user_8               60.83
rech_days_left_data_6   62.02
rech_days_left_data_7   61.14
rech_days_left_data_8   60.83
dtype: float64

In [23]:
drop_columns = colmns_missing_data[colmns_missing_data>60].index
df.drop(set(drop_columns), axis=1, inplace=True)
df.shape

(30011, 129)

#### ROW-WISE: MISSING VALUES

In [24]:
# sum it up to check how many rows have all missing values
print("Rows with all NULL values =",  df.isnull().all(axis=1).sum())

# sum of misisng values in each row
rows_missing_data = df.isnull().sum(axis=1)
rows_missing_data[rows_missing_data > 0]

Rows with all NULL values = 0


77       27
111      27
143      27
188      28
191       1
358      27
364      27
375      27
423      27
490      56
527      27
539      54
578      28
588       1
603       2
679       1
690      28
723      28
788      27
845      28
895      27
933      27
934      27
1187     27
1255     27
1374      1
1397     27
1442      1
1489     54
1524     28
         ..
98420    27
98468     1
98612    27
98635    28
98714     1
98753     1
98789     1
98790    55
98827    27
98838     1
98872    28
98943    81
98962    27
98971     1
99000    27
99059    27
99070    27
99224     2
99246    27
99248    27
99323    27
99349     1
99436    28
99515     1
99611    27
99672     1
99700    27
99713     1
99790    55
99827    27
Length: 1524, dtype: int64

In [25]:
# drop rows with any zero values
df = df[df.isnull().sum(axis=1) < 27]
df.shape

(28861, 129)

In [27]:
rows_missing_data = df.isnull().sum(axis=1)
rows_missing_data[rows_missing_data > 0]

191      1
588      1
603      2
679      1
1374     1
1442     1
1576     1
1708     1
1913     1
2777     1
3170     1
3963     1
4284     1
4694     1
5096     1
5187     1
5449     1
5798     1
5926     1
6027     1
6185     1
6713     1
7399     1
7567     1
7662     1
7914     1
8159     1
8169     2
8454     1
8680     1
        ..
93871    1
94057    1
94089    1
94241    1
94719    1
94970    1
95531    1
95638    1
95709    1
96159    1
96387    1
96403    1
96480    1
96524    1
96715    2
97001    2
97077    1
97158    1
97926    1
98468    1
98714    1
98753    1
98789    1
98838    1
98971    1
99224    2
99349    1
99515    1
99672    1
99713    1
Length: 374, dtype: int64

In [34]:
# look at the summary again
X = round(100*(df.isnull().sum()/len(df.index)), 2)
X[X>0]

rech_days_left_6   0.11
rech_days_left_7   0.21
rech_days_left_8   1.08
dtype: float64

In [35]:
df['rech_days_left_6'].describe()

count   28830.00
mean        3.10
std         4.13
min         0.00
25%         0.00
50%         2.00
75%         4.00
max        29.00
Name: rech_days_left_6, dtype: float64

In [37]:
# imputing Lattitude and Longitude by mean values
df.loc[np.isnan(df['rech_days_left_6']), ['rech_days_left_6']] = df['rech_days_left_6'].mean()
df.loc[np.isnan(df['rech_days_left_7']), ['rech_days_left_7']] = df['rech_days_left_7'].mean()
df.loc[np.isnan(df['rech_days_left_8']), ['rech_days_left_8']] = df['rech_days_left_8'].mean()

round(100*(df.isnull().sum()/len(df.index)), 2)

mobile_number        0.00
arpu_6               0.00
arpu_7               0.00
arpu_8               0.00
onnet_mou_6          0.00
onnet_mou_7          0.00
onnet_mou_8          0.00
offnet_mou_6         0.00
offnet_mou_7         0.00
offnet_mou_8         0.00
roam_ic_mou_6        0.00
roam_ic_mou_7        0.00
roam_ic_mou_8        0.00
roam_og_mou_6        0.00
roam_og_mou_7        0.00
roam_og_mou_8        0.00
loc_og_t2t_mou_6     0.00
loc_og_t2t_mou_7     0.00
loc_og_t2t_mou_8     0.00
loc_og_t2m_mou_6     0.00
loc_og_t2m_mou_7     0.00
loc_og_t2m_mou_8     0.00
loc_og_t2f_mou_6     0.00
loc_og_t2f_mou_7     0.00
loc_og_t2f_mou_8     0.00
loc_og_t2c_mou_6     0.00
loc_og_t2c_mou_7     0.00
loc_og_t2c_mou_8     0.00
loc_og_mou_6         0.00
loc_og_mou_7         0.00
                     ... 
max_rech_amt_8       0.00
last_day_rch_amt_6   0.00
last_day_rch_amt_7   0.00
last_day_rch_amt_8   0.00
vol_2g_mb_6          0.00
vol_2g_mb_7          0.00
vol_2g_mb_8          0.00
vol_3g_mb_6 

In [38]:
df.shape

(28861, 129)

# Oulier Treatment
- Use data distribution to find outliers

In [39]:
# Checking outliers at 25%,50%,75%,90%,95% and 99%
df.describe(percentiles=[.25,.5,.75,.90,.95,.99])

Unnamed: 0,mobile_number,arpu_6,arpu_7,arpu_8,onnet_mou_6,onnet_mou_7,onnet_mou_8,offnet_mou_6,offnet_mou_7,offnet_mou_8,...,sachet_3g_7,sachet_3g_8,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g,rech_days_left_6,rech_days_left_7,rech_days_left_8,churn
count,28861.0,28861.0,28861.0,28861.0,28861.0,28861.0,28861.0,28861.0,28861.0,28861.0,...,28861.0,28861.0,28861.0,28861.0,28861.0,28861.0,28861.0,28861.0,28861.0,28861.0
mean,7001228551.09,583.57,594.56,548.3,296.27,309.16,275.87,418.63,431.59,386.86,...,0.15,0.14,1282.08,131.47,135.59,119.94,3.1,3.31,3.98,0.06
std,681553.91,429.89,461.75,488.67,457.34,481.61,468.95,464.79,487.78,477.77,...,0.95,0.99,979.49,391.95,408.96,386.3,4.13,4.12,4.95,0.24
min,7000000074.0,-2258.71,-2014.05,-945.81,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,180.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7000650518.0,364.0,369.71,304.76,43.19,44.03,32.61,141.73,144.03,108.28,...,0.0,0.0,484.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
50%,7001241209.0,492.82,496.63,461.02,127.93,130.09,106.69,284.98,289.03,251.98,...,0.0,0.0,936.0,0.0,0.0,0.0,2.0,2.0,2.0,0.0
75%,7001815446.0,697.07,702.95,679.0,354.23,368.36,310.49,522.53,540.19,493.58,...,0.0,0.0,1970.0,5.9,2.28,0.0,4.0,5.0,6.0,0.0
90%,7002165918.0,983.95,997.88,991.09,791.03,831.39,742.71,909.14,940.69,865.33,...,0.0,0.0,2866.0,449.05,456.99,390.22,9.0,9.0,10.0,0.0
95%,7002287299.0,1228.88,1262.47,1264.21,1147.39,1218.76,1126.64,1257.96,1293.79,1200.18,...,1.0,1.0,3194.0,826.84,845.79,749.99,13.0,12.0,14.0,1.0
99%,7002386132.0,1943.34,1994.87,1988.52,2153.39,2228.66,2196.88,2294.61,2417.01,2224.47,...,3.0,3.0,3651.0,1816.44,1945.34,1843.8,18.0,18.0,25.0,1.0


In [None]:
# TODO: DO WE NEED TO ROMOVE enteris due to outliers?

### Checking the Churn Rate

In [42]:
churn = (sum(df['churn'])/len(df['churn'].index))*100
churn

6.261044315858771

We have almost 6.26% churn rate

### Save the clened data in new file

In [40]:
# write treated telecom file
df.to_csv("telecom_churn_data_clean.csv", sep=',', index=False)

# <font color='blue'>SUMMARY: DATA CLEANING</font>

