<a href="https://colab.research.google.com/github/nd823/data-cleaning/blob/master/telco_data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import libraries

In [0]:
import pandas as pd
import numpy as np

In [0]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


# Import data

In [0]:
df = pd.read_csv("https://github.com/treselle-systems/customer_churn_analysis/raw/master/WA_Fn-UseC_-Telco-Customer-Churn.csv")

# Intial check


## Preview data

In [0]:
df.head().T

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


## Check column data types

In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID          7043 non-null object
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: float64(1), int64(2), obj

Need to:
- Drop `customerID` column
- Convert `SeniorCitizen` column to `category` type
- Encode `SeniorCitizen` column to "Yes/No"
- Convert `TotalCharges` column to `float64` type

# Data cleaning

In [0]:
df = df.drop(['customerID'], axis = 1)

## Convert and encode `SeniorCitizen` column

In [0]:
df['SeniorCitizen'] = np.where(df['SeniorCitizen'] == 1, 'Yes', 'No')

df['SeniorCitizen'] = df['SeniorCitizen'].astype('object', copy=False)

## Convert `TotalCharges` column to `float64` type

In [0]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
gender              7043 non-null object
SeniorCitizen       7043 non-null object
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7032 non-null float64
Churn               7043 non-null object
dtypes: float64(2), int64(1), object(17)
memory usage: 1.1+ MB


## Create new `TotalCharges` column

There are missing data in the `TotalCharges` column. We will create new `Calculated_TotalCharges` column by multiplying the `MonthlyCharges` and `tenure` columns.


In [0]:
## Drop the TotalCharges column
df.drop(['TotalCharges'], axis=1, inplace=True)

## Create a Calculated_TotalCharges column from tenure and MonthlyCharges, which will not have missing values
df["Calculated_TotalCharges"] = df['MonthlyCharges'] * df['tenure']

A final check:

In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
gender                     7043 non-null object
SeniorCitizen              7043 non-null object
Partner                    7043 non-null object
Dependents                 7043 non-null object
tenure                     7043 non-null int64
PhoneService               7043 non-null object
MultipleLines              7043 non-null object
InternetService            7043 non-null object
OnlineSecurity             7043 non-null object
OnlineBackup               7043 non-null object
DeviceProtection           7043 non-null object
TechSupport                7043 non-null object
StreamingTV                7043 non-null object
StreamingMovies            7043 non-null object
Contract                   7043 non-null object
PaperlessBilling           7043 non-null object
PaymentMethod              7043 non-null object
MonthlyCharges             7043 non-null float64
Churn                      70

In [0]:
df.to_feather('./data/telco_cleaned_May31')