## Import the Raw Data

Having explored the initial source file provided by Jack Chang in Kaggle entitled 'telco_churn.csv', it's time to start cleaning or pre-processing the dataset. This is probably one of the more time consuming elements creating a pipeline but is an essential phaze before performing EDA.

In [1]:
# Import all necessary libraries and modules
import numpy as np
import pandas as pd

Read the CSV source file into a dataframe and assign it to a variable called telco_churn.

In [2]:
telco_churn = pd.read_csv("C:/Users/lynst/Documents/Datasets/Kaggle/Jack Chang/telco_churn.csv")
telco_churn.head()

Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.30742,Female,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,67,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,...,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,...,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,1,84,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,...,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,1,89,5340,Competitor had better devices


## Data Pre-processing and Cleaning

### Dimensionality Reduction

First I need to remove the columns I don't want using the drop( ) method.

In [5]:
telco_churn = telco_churn.drop(['CustomerID','Count','Country','State','Latitude','Longitude','Churn Label'], axis=1, inplace=True)
telco_churn[0:5]

AttributeError: 'NoneType' object has no attribute 'drop'

In [4]:
telco_churn.shape

AttributeError: 'NoneType' object has no attribute 'shape'

So there are a total of 7043 row entries or instances and 26 columns or features.

To list the names of the columns and their data type:

In [None]:
telco_churn.info()

To remove any white space in the column names using the replace method:

In [None]:
telco_churn.columns = telco_churn.columns.str.replace(' ','_')
telco_churn.columns

Although the purpose of having unique identifiers to label each entry or customer id becomes useful when manipulating data in SQL, it will not provide any insight or potential relationships if included in a machine learning model so it was more prudent to drop the 'CustomerID' column. (In structured relational databases a unique identifier column of values becomes useful for establishing relationships in a Schema table object).

### Remove White Space
This can be achieved using the 'replace( )' method on a DataFrame.

Counting the number of unique values in the 'City' column as first.

In [None]:
telco_churn['City'].nunique()

Later on a Decision Tree will be created using GraphViz but in order to draw a tree properly it is not preferable to have any whitespace in the column values for 'City', so these should be replaced with an underscore character. Looking at the first five values for 'City' using slicing.

In [None]:
telco_churn['City'].unique()[0:5]

Now check the first 5 rows of the 'City' column using the 'head( )' method having replaced whitespaces with underscore characters.

In [None]:
telco_churn['City'].replace(' ', '_', regex=True, inplace=True)
telco_churn['City'].head()

### Converting Data Types

Next I would like to convert any columns with string object datatypes into numeric values. Binary values will be converted by using the map( ) method for dictionary values and string values will be changed using label and one-hot encoding.

Gender needs converting to '1' for Male and '0' for Female for simplicity. 

In [None]:
gender_dict = {'Male': 1, 'Female': 0}

Now map the gender dictionary to the 'Gender' column:

In [None]:
telco_churn['Gender'] = telco_churn['Gender'].map(gender_dict)
telco_churn['Gender'].head()

'Senior Citizen' values can be converted to '1' for yes and '0' for no.

In [None]:
senior_dict = {'Yes': 1, 'No': 0}
telco_churn['Senior_Citizen'] = telco_churn['Senior_Citizen'].map(senior_dict)

Same for 'Partner':

In [None]:
partner_dict = {'Yes': 1, 'No': 0}
telco_churn['Partner'] = telco_churn['Partner'].map(partner_dict)

And 'Dependents':

In [None]:
dependents_dict = {'Yes': 1, 'No': 0}
telco_churn['Dependents'] = telco_churn['Dependents'].map(dependents_dict)

'Tenure Months' are already numeric integers so this is fine. 'Phone Service' can be converted:

In [None]:
phone_dict = {'Yes': 1, 'No': 0}
telco_churn['Phone_Service'] = telco_churn['Phone_Service'].map(phone_dict)

And 'Multiple Lines':

In [None]:
multi_dict = {'Yes': 1, 'No': 0}
telco_churn['Multiple_Lines'] = telco_churn['Multiple_Lines'].map(multi_dict)

### One-Hot Encoding

Looking at the data value counts for categorical features:

In [None]:
for column in telco_churn.columns.values.tolist():
    print(column)
    print (telco_churn[column].value_counts())
    print("")

In [None]:
from numpy import array
from numpy import argmax
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

data_city = telco_churn['City']
values_city = array(data_city)
print(values_city)

In [None]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded_city = label_encoder.fit_transform(values_city)
print(integer_encoded_city)

In [None]:
# encode the values from internet service to numbers
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded_city = integer_encoded_city.reshape(len(integer_encoded_city), 1)
onehot_encoded_city = onehot_encoder.fit_transform(integer_encoded_city)
print(onehot_encoded_city)

In [None]:
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded_city[0, :])])
print(inverted)

'Internet Service' has categorical values so categorical or 'one-hot' encoding will be used here to assign different numeric values for each option. I will assign the 'DSL' value as 1, 'Fibre optic' as 2, 'No' as 3.

In [None]:
from numpy import array
from numpy import argmax
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

data_internet = telco_churn['Internet_Service']
values_internet = array(data_internet)
print(values_internet)

In [None]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded_internet = label_encoder.fit_transform(values_internet)
print(integer_encoded_internet)

In [None]:
# encode the values from internet service to numbers
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded_internet = integer_encoded_internet.reshape(len(integer_encoded_internet), 1)
onehot_encoded_internet_service = onehot_encoder.fit_transform(integer_encoded_internet)
print(onehot_encoded_internet_service)

Test the first value.

In [None]:
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded_internet_service[0, :])])
print(inverted)

Repeating for the 'Contract' column which is also a categorical feature:

In [None]:
data_contract = telco_churn['Contract']
values_contract = array(data_contract)
print(values_contract)

In [None]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded_contract = label_encoder.fit_transform(values_contract)
print(integer_encoded_contract)

In [None]:
# encode the values from internet service to numbers
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded_contract = integer_encoded_contract.reshape(len(integer_encoded_contract), 1)
onehot_encoded_contract = onehot_encoder.fit_transform(integer_encoded_contract)
print(onehot_encoded_contract)

Test the first value again.

In [None]:
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded_internet_service[0, :])])
print(inverted)

In [None]:
data_payment = telco_churn['Payment_Method']
values_payment = array(data_payment)
print(values_payment)

In [None]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded_payment = label_encoder.fit_transform(values_payment)
print(integer_encoded_payment)

In [None]:
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded_payment = integer_encoded_payment.reshape(len(integer_encoded_payment), 1)
onehot_encoded_payment = onehot_encoder.fit_transform(integer_encoded_payment)
print(onehot_encoded_payment)

### Binary Encoding

In [None]:
security_dict = {'Yes': 1, 'No': 0}
telco_churn['Online_Security'] = telco_churn['Online_Security'].map(security_dict)

In [None]:
backup_dict = {'Yes':1, 'No':0}
telco_churn['Online_Backup'] = telco_churn['Online_Backup'].map(backup_dict)

In [None]:
device_protection_dict = {'Yes': 1, 'No': 0}
telco_churn['Device_Protection'] = telco_churn['Device_Protection'].map(device_protection_dict)

In [None]:
tech_support_dict = {'Yes': 1, 'No': 0}
telco_churn['Tech_Support'] = telco_churn['Tech_Support'].map(tech_support_dict)

In [None]:
stream_tv_dict = {'Yes': 1, 'No': 0}
telco_churn['Streaming_TV'] = telco_churn['Streaming_TV'].map(stream_tv_dict)

In [None]:
stream_movies_dict = {'Yes': 1, 'No': 0}
telco_churn['Streaming_Movies'] = telco_churn['Streaming_Movies'].map(stream_movies_dict)

In [None]:
data = telco_churn['Contract']
values = array(data)
print(values)

In [None]:
paperless_bill_dict = {'Yes': 1, 'No': 0}
telco_churn['Paperless_Billing'] = telco_churn['Paperless_Billing'].map(paperless_bill_dict)

'Payment_Method' also has categorical features which can be converted using one-hot encoding.

'Monthly_Charges' are already of float64 datatype, but 'Total_Charges' aren't so they need to be converted from string object to float64.

In [None]:
telco_churn['Total_Charges'] = pd.to_numeric(telco_churn['Total_Charges'], errors='coerce')
telco_churn['Total_Charges'].head()

The 'Churn_Value' will be designated as the target vector for the purpose of my model and it currently has a simple integer datatype labeled as '1' for churn and '0' for no churn. I can already use these labels.

'Churn_Score' remains as an integer. It should be highly correlated with 'Churn_Value' so I will check for this next.

Finally, 'CLTV' which is the Customer Lifetime Value is an indicator of the customers importance during this period. This value should be compared to 'Churn_Value' or 'Churn_Score' for any association.

In addition to converting all the binary decision outcomes for certain columns above, it may be more prudent to create a for loop to iterate through these columns in a subset to speed up the algorithm.

In [None]:
# only include all columns with binary outcomes to be converted to integer values
# telco_subset = telco_churn['Gender','Senior_Citizen','Partner','Dependents', 'Phone_Service','Multiple_Lines','Online_Security','Online_Backup','Device_Protection','Tech_Support','Streaming_TV','Streaming_Movies','Paperless_Billing']
# i=0
# define a new function to iterate through the data values and change them
# for i in telco_subset:
    # dict = {'Yes': 1, 'No': 0}
    # telco_subset = telco_subset.map(dict)
    # i+=1

### Identify Missing Values
Looking for missing values and removing them is the next phase, although it may be better to replace them with 0, or impute average values. First I want to calculate the number of entries.

In [None]:
telco_churn.index

There are a total of 7043 rows in the dataset. Having removed the 'CustomerID' column I want to view all the columns in the dataframe once more.

In [None]:
telco_churn.columns

In [None]:
telco_missing = pd.isnull(telco_churn).sum()
print(telco_missing)

Within the dataframe I need to count the location of the rows where the 'Total_Charges' column which have no entry is identical to 'True'. This counts the number of blank spaces in the particular column.

In [None]:
len(telco_churn.loc[telco_churn['Total_Charges'] == ' '])

So this means that none of the entries have any blank spaces. Try printing these entries out.

In [None]:
# print these rows
telco_churn.loc[telco_churn['Total_Charges'] == ' ']

Perhaps they contain another value such as NaN, or 0 already. 

Using the isnull( ) method I may be able to find the sum of these entries and print them out.

In [None]:
telco_churn['Total_Charges'].isnull().sum()

In [None]:
telco_churn[telco_churn['Total_Charges'].isnull()]

I can see all eleven Null values and decide whether to remove these rows from the dataframe completely, or impute some kind of average value or a 0. In this case they've been assigned values of 'NaN', or 'Not a Number'.

I've decided to set these missing values to 0 for now. I can always remove the values later and try running the model again to see if there is any difference to the scoring metric.

In [None]:
telco_churn.loc[(telco_churn['Total_Charges'] == 'NaN'), 'Total_Charges'] = 0

Let me see if this has worked and the 'Total_Charges' column with NaN values have been converted to 0.

In [None]:
telco_churn[telco_churn['Total_Charges'].isnull()]

In [None]:
telco_churn['Total_Charges'].unique()

Print the cleaned dataset which has undergone the first part of the Transformation phase in the ETL pipeline.

In [None]:
print(telco_churn)

The next phase of transformation should include some form of Normalization or Scaling so all values within the dataframe fall within a certain range of values. This can really help with the modeling phase by reducing any bias from really large values.

### Normalization or Re-Scaling

In [None]:
from sklearn import preprocessing

In [None]:
telco_churn = telco_churn.drop(columns=['Date'])

In [None]:
# convert entries to float type
X = telco_churn.values.astype(float)

# define min max scaler
min_max_scaler = preprocessing.MinMaxScaler()

# transform data
X_scaled = min_max_scaler.fit_transform(X)

telco_churn_scaled = pd.DataFrame(X_scaled, columns=telco_churn.columns)
print(telco_churn)

Finally, convert the dataframe to a CSV file now it has been pre-processed.

In [None]:
telco_churn_cleaned = telco_churn_scaled
telco_churn_cleaned.to_csv()