# Data Preparation

This notebook details the initial steps of preparing the `Telco Customer Churn` dataset for analysis. It includes data loading, understanding, and cleaning processes to ensure the data is in a format for further analysis. Key tasks involve handling missing values, encoding categorical variables, normalizing numerical features, and addressing any data quality issues.

**By the end of this notebook, the dataset will be transformed into a refined version ready for exploratory data analysis and model development.**

### Loading Toolkit

In [204]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:,.2f}'.format)

### Loading Dataset

In [205]:
df = pd.read_csv('../data/telco_customer_churn.csv')

### Understanding The Data
- Dataframe `shape`
- `info`
- `head` and `tail`
- `describe`
- `unique` values

In [206]:
df.shape

(7043, 21)

In [207]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [208]:
df

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,No,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


In [209]:
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.16,32.37,64.76
std,0.37,24.56,30.09
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [210]:
for column, rows in df.items():
    print('----------')
    print(f'{column} --- {df[column].unique()} --- {df[column].dtype}')

----------
customerID --- ['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' ... '4801-JZAZL' '8361-LTMKD'
 '3186-AJIEK'] --- object
----------
gender --- ['Female' 'Male'] --- object
----------
SeniorCitizen --- [0 1] --- int64
----------
Partner --- ['Yes' 'No'] --- object
----------
Dependents --- ['No' 'Yes'] --- object
----------
tenure --- [ 1 34  2 45  8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27
  5 46 11 70 63 43 15 60 18 66  9  3 31 50 64 56  7 42 35 48 29 65 38 68
 32 55 37 36 41  6  4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26  0
 39] --- int64
----------
PhoneService --- ['No' 'Yes'] --- object
----------
MultipleLines --- ['No phone service' 'No' 'Yes'] --- object
----------
InternetService --- ['DSL' 'Fiber optic' 'No'] --- object
----------
OnlineSecurity --- ['No' 'Yes' 'No internet service'] --- object
----------
OnlineBackup --- ['Yes' 'No' 'No internet service'] --- object
----------
DeviceProtection --- ['No' 'Yes' 'No internet service'] --- object
--------

It looks like some add-on services are marked as `No internet service`. We need to make sure that when a customer has `No` for `InternetService`, all related services also say `No internet service`. This will make sure our data is consistent before we apply `One Hot Encoding` for better analysis and use in our churn model.

In [211]:
services = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
for column in services:
    assert df[(df['InternetService'] == 'No') & (df[column] != 'No internet service')].empty
for column in services:
    assert (df[df['InternetService'] != 'No'][column] != 'No internet service').all()

Now we will replace all `No internet service` entries with `No` since not having internet also means not having the add-on services. We already track the `InternetService` separately so future analysis should still differentiate users with and without internet service. We verify No internet service exists with the following assertion code.

In [212]:
df.replace('No internet service', 'No', inplace=True)
assert df[df['DeviceProtection']=='No internet service'].empty

Similarly, we will replace all `No phone service` entries with `No` under `MultipleLines` because not having phone service means not having multiple lines. We already track `PhoneService` separately.

In [213]:
df.replace('No phone service', 'No', inplace=True)
assert df[df['MultipleLines'] == 'No phone service'].empty