## Import the Raw Data

Having explored the initial source file provided by Jack Chang in Kaggle entitled 'telco_churn.csv', it's time to start cleaning or pre-processing the dataset. This is probably one of the more time consuming elements creating a pipeline but is an essential phaze before performing EDA.

In [1]:
# Import all necessary libraries and modules
import numpy as np
import pandas as pd

Read the CSV source file into a dataframe and assign it to a variable called telco_churn.

In [2]:
telco_churn = pd.read_csv("C:/Users/lynst/Documents/Datasets/Kaggle/Jack Chang/telco_churn.csv")
telco_churn.head()

Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.30742,Female,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,67,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,...,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,...,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,1,84,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,...,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,1,89,5340,Competitor had better devices


## Data Pre-processing and Cleaning

### Dimensionality Reduction

First I need to remove the columns I don't want using the drop( ) method.

In [3]:
telco_churn.drop(['CustomerID','Count','Country','State','Zip Code','Latitude','Longitude','Churn Label'], axis=1, inplace=True)
telco_churn.head()

Unnamed: 0,City,Lat Long,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,Multiple Lines,Internet Service,...,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Value,Churn Score,CLTV,Churn Reason
0,Los Angeles,"33.964131, -118.272783",Male,No,No,No,2,Yes,No,DSL,...,No,Month-to-month,Yes,Mailed check,53.85,108.15,1,86,3239,Competitor made better offer
1,Los Angeles,"34.059281, -118.30742",Female,No,No,Yes,2,Yes,No,Fiber optic,...,No,Month-to-month,Yes,Electronic check,70.7,151.65,1,67,2701,Moved
2,Los Angeles,"34.048013, -118.293953",Female,No,No,Yes,8,Yes,Yes,Fiber optic,...,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,1,86,5372,Moved
3,Los Angeles,"34.062125, -118.315709",Female,No,Yes,Yes,28,Yes,Yes,Fiber optic,...,Yes,Month-to-month,Yes,Electronic check,104.8,3046.05,1,84,5003,Moved
4,Los Angeles,"34.039224, -118.266293",Male,No,No,Yes,49,Yes,Yes,Fiber optic,...,Yes,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,1,89,5340,Competitor had better devices


In [4]:
telco_churn.shape

(7043, 25)

So there are a total of 7043 row entries or instances and 26 columns or features.

To list the names of the columns and their data type:

In [5]:
telco_churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 25 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   City               7043 non-null   object 
 1   Lat Long           7043 non-null   object 
 2   Gender             7043 non-null   object 
 3   Senior Citizen     7043 non-null   object 
 4   Partner            7043 non-null   object 
 5   Dependents         7043 non-null   object 
 6   Tenure Months      7043 non-null   int64  
 7   Phone Service      7043 non-null   object 
 8   Multiple Lines     7043 non-null   object 
 9   Internet Service   7043 non-null   object 
 10  Online Security    7043 non-null   object 
 11  Online Backup      7043 non-null   object 
 12  Device Protection  7043 non-null   object 
 13  Tech Support       7043 non-null   object 
 14  Streaming TV       7043 non-null   object 
 15  Streaming Movies   7043 non-null   object 
 16  Contract           7043 

To remove any white space in the column names using the replace method:

In [6]:
telco_churn.columns = telco_churn.columns.str.replace(' ','_')
telco_churn.columns

Index(['City', 'Lat_Long', 'Gender', 'Senior_Citizen', 'Partner', 'Dependents',
       'Tenure_Months', 'Phone_Service', 'Multiple_Lines', 'Internet_Service',
       'Online_Security', 'Online_Backup', 'Device_Protection', 'Tech_Support',
       'Streaming_TV', 'Streaming_Movies', 'Contract', 'Paperless_Billing',
       'Payment_Method', 'Monthly_Charges', 'Total_Charges', 'Churn_Value',
       'Churn_Score', 'CLTV', 'Churn_Reason'],
      dtype='object')

Although the purpose of having unique identifiers to label each entry or customer id becomes useful when manipulating data in SQL, it will not provide any insight or potential relationships if included in a machine learning model so it was more prudent to drop the 'CustomerID' column. (In structured relational databases a unique identifier column of values becomes useful for establishing relationships in a Schema table object).

Now check the first 5 rows of the 'City' column using the 'head( )' method having replaced whitespaces with underscore characters.

In [7]:
telco_churn['City'].replace(' ', '_', regex=True, inplace=True)
telco_churn['City'].head()

0    Los_Angeles
1    Los_Angeles
2    Los_Angeles
3    Los_Angeles
4    Los_Angeles
Name: City, dtype: object

### Converting Data Types

Next I would like to convert any columns with string object datatypes into numeric values. Binary values will be converted by using the map( ) method for dictionary values and string values will be changed using label and one-hot encoding.

Gender needs converting to '1' for Male and '0' for Female for simplicity. 

In [8]:
gender_dict = {'Male': 1, 'Female': 0}

Now map the gender dictionary to the 'Gender' column:

In [9]:
telco_churn['Gender'] = telco_churn['Gender'].map(gender_dict)
telco_churn['Gender'].head()

0    1
1    0
2    0
3    0
4    1
Name: Gender, dtype: int64

'Senior Citizen' values can be converted to '1' for yes and '0' for no.

In [10]:
senior_dict = {'Yes': 1, 'No': 0}
telco_churn['Senior_Citizen'] = telco_churn['Senior_Citizen'].map(senior_dict)

Same for 'Partner':

In [11]:
partner_dict = {'Yes': 1, 'No': 0}
telco_churn['Partner'] = telco_churn['Partner'].map(partner_dict)

And 'Dependents':

In [12]:
dependents_dict = {'Yes': 1, 'No': 0}
telco_churn['Dependents'] = telco_churn['Dependents'].map(dependents_dict)

'Tenure Months' are already numeric integers so this is fine. 'Phone Service' can be converted:

In [13]:
phone_dict = {'Yes': 1, 'No': 0}
telco_churn['Phone_Service'] = telco_churn['Phone_Service'].map(phone_dict)

And 'Multiple Lines':

In [14]:
multi_dict = {'Yes': 1, 'No': 0}
telco_churn['Multiple_Lines'] = telco_churn['Multiple_Lines'].map(multi_dict)

### One-Hot Encoding

Counting the number of unique values in the 'City' column as first.

In [15]:
from numpy import array
from numpy import argmax
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

In [16]:
telco_churn['City'].nunique()

1129

Later on a Decision Tree will be created using GraphViz but in order to draw a tree properly it is not preferable to have any whitespace in the column values for 'City', so these should be replaced with an underscore character. Looking at the first five values for 'City' using slicing.

In [17]:
telco_churn['City'].unique()[0:5]

array(['Los_Angeles', 'Beverly_Hills', 'Huntington_Park', 'Lynwood',
       'Marina_Del_Rey'], dtype=object)

Looking at the data value counts for categorical features:

In [18]:
for column in telco_churn.columns.values.tolist():
    print(column)
    print (telco_churn[column].value_counts())
    print("")

City
Los_Angeles       305
San_Diego         150
San_Jose          112
Sacramento        108
San_Francisco     104
                 ... 
Healdsburg          4
Jenner              4
Philo               4
Point_Arena         4
Olympic_Valley      4
Name: City, Length: 1129, dtype: int64

Lat_Long
33.964131, -118.272783    5
34.152875, -118.486056    5
32.912664, -116.635387    5
32.64164, -116.985026     5
32.607964, -117.059459    5
                         ..
37.4695, -120.672724      4
38.055562, -120.456298    4
38.244806, -120.417301    4
38.264262, -120.515133    4
39.191797, -120.212401    4
Name: Lat_Long, Length: 1652, dtype: int64

Gender
1    3555
0    3488
Name: Gender, dtype: int64

Senior_Citizen
0    5901
1    1142
Name: Senior_Citizen, dtype: int64

Partner
0    3641
1    3402
Name: Partner, dtype: int64

Dependents
0    5416
1    1627
Name: Dependents, dtype: int64

Tenure_Months
1     613
72    362
2     238
3     200
4     176
     ... 
28     57
39     56
44     51
36

#### City Encoding

'City' has categorical values so categorical or 'one-hot' encoding will be used here to assign different numeric values for each option (in this case there are 1129 different outcomes for city).

In [19]:
data_city = telco_churn['City']
values_city = array(data_city)
print(values_city)

['Los_Angeles' 'Los_Angeles' 'Los_Angeles' ... 'Amboy' 'Angelus_Oaks'
 'Apple_Valley']


In [20]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded_city = label_encoder.fit_transform(values_city)
print(integer_encoded_city)

[562 562 562 ...  22  26  32]


In [21]:
# encode the values from internet service to numbers
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded_city = integer_encoded_city.reshape(len(integer_encoded_city), 1)
onehot_encoded_city = onehot_encoder.fit_transform(integer_encoded_city)
print(onehot_encoded_city)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]




In [22]:
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded_city[0, :])])
print(inverted)

['Los_Angeles']


In [23]:
data_internet = telco_churn['Internet_Service']
values_internet = array(data_internet)
print(values_internet)

['DSL' 'Fiber optic' 'Fiber optic' ... 'Fiber optic' 'DSL' 'Fiber optic']


In [24]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded_internet = label_encoder.fit_transform(values_internet)
print(integer_encoded_internet)

[0 1 1 ... 1 0 1]


In [25]:
# encode the values from internet service to numbers
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded_internet = integer_encoded_internet.reshape(len(integer_encoded_internet), 1)
onehot_encoded_internet_service = onehot_encoder.fit_transform(integer_encoded_internet)
print(onehot_encoded_internet_service)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 ...
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]]




Test the first value.

In [26]:
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded_internet_service[0, :])])
print(inverted)

['DSL']


#### Contract Encoding

In [27]:
data_contract = telco_churn['Contract']
values_contract = array(data_contract)
print(values_contract)

['Month-to-month' 'Month-to-month' 'Month-to-month' ... 'One year'
 'Month-to-month' 'Two year']


In [28]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded_contract = label_encoder.fit_transform(values_contract)
print(integer_encoded_contract)

[0 0 0 ... 1 0 2]


In [29]:
# encode the values from internet service to numbers
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded_contract = integer_encoded_contract.reshape(len(integer_encoded_contract), 1)
onehot_encoded_contract = onehot_encoder.fit_transform(integer_encoded_contract)
print(onehot_encoded_contract)

[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 ...
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]




Test the first value again.

In [30]:
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded_internet_service[0, :])])
print(inverted)

['Month-to-month']


#### Payment Method Encoding

In [31]:
data_payment = telco_churn['Payment_Method']
values_payment = array(data_payment)
print(values_payment)

['Mailed check' 'Electronic check' 'Electronic check' ...
 'Credit card (automatic)' 'Electronic check' 'Bank transfer (automatic)']


In [32]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded_payment = label_encoder.fit_transform(values_payment)
print(integer_encoded_payment)

[3 2 2 ... 1 2 0]


In [33]:
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded_payment = integer_encoded_payment.reshape(len(integer_encoded_payment), 1)
onehot_encoded_payment = onehot_encoder.fit_transform(integer_encoded_payment)
print(onehot_encoded_payment)

[[0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 ...
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]]




In [34]:
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded_payment[0, :])])
print(inverted)

['Mailed check']


#### Churn Reason Encoding

In [35]:
data_reason = telco_churn['Churn_Reason']
values_reason = array(data_reason)
print(values_reason)

['Competitor made better offer' 'Moved' 'Moved' ... nan nan nan]


In [36]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded_reason = label_encoder.fit_transform(values_reason)
print(integer_encoded_reason)

[ 3 13 13 ... 20 20 20]


In [37]:
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded_reason = integer_encoded_reason.reshape(len(integer_encoded_reason), 1)
onehot_encoded_reason = onehot_encoder.fit_transform(integer_encoded_reason)
print(onehot_encoded_reason)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]]




In [38]:
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded_reason[0, :])])
print(inverted)

['Competitor made better offer']


### Binary Encoding

In [39]:
security_dict = {'Yes': 1, 'No': 0}
telco_churn['Online_Security'] = telco_churn['Online_Security'].map(security_dict)

In [40]:
backup_dict = {'Yes':1, 'No':0}
telco_churn['Online_Backup'] = telco_churn['Online_Backup'].map(backup_dict)

In [41]:
device_protection_dict = {'Yes': 1, 'No': 0}
telco_churn['Device_Protection'] = telco_churn['Device_Protection'].map(device_protection_dict)

In [42]:
tech_support_dict = {'Yes': 1, 'No': 0}
telco_churn['Tech_Support'] = telco_churn['Tech_Support'].map(tech_support_dict)

In [43]:
stream_tv_dict = {'Yes': 1, 'No': 0}
telco_churn['Streaming_TV'] = telco_churn['Streaming_TV'].map(stream_tv_dict)

In [44]:
stream_movies_dict = {'Yes': 1, 'No': 0}
telco_churn['Streaming_Movies'] = telco_churn['Streaming_Movies'].map(stream_movies_dict)

In [45]:
data = telco_churn['Contract']
values = array(data)
print(values)

['Month-to-month' 'Month-to-month' 'Month-to-month' ... 'One year'
 'Month-to-month' 'Two year']


In [46]:
paperless_bill_dict = {'Yes': 1, 'No': 0}
telco_churn['Paperless_Billing'] = telco_churn['Paperless_Billing'].map(paperless_bill_dict)

'Payment_Method' also has categorical features which can be converted using one-hot encoding.

'Monthly_Charges' are already of float64 datatype, but 'Total_Charges' aren't so they need to be converted from string object to float64.

In [47]:
telco_churn['Total_Charges'] = pd.to_numeric(telco_churn['Total_Charges'], errors='coerce')
telco_churn['Total_Charges'].head()

0     108.15
1     151.65
2     820.50
3    3046.05
4    5036.30
Name: Total_Charges, dtype: float64

The 'Churn_Value' will be designated as the target vector for the purpose of my model and it currently has a simple integer datatype labeled as '1' for churn and '0' for no churn. I can already use these labels.

'Churn_Score' remains as an integer. It should be highly correlated with 'Churn_Value' so I will check for this next.

Finally, 'CLTV' which is the Customer Lifetime Value is an indicator of the customers importance during this period. This value should be compared to 'Churn_Value' or 'Churn_Score' for any association.

In addition to converting all the binary decision outcomes for certain columns above, it may be more prudent to create a for loop to iterate through these columns in a subset to speed up the algorithm.

In [48]:
# only include all columns with binary outcomes to be converted to integer values
# telco_subset = telco_churn['Gender','Senior_Citizen','Partner','Dependents', 'Phone_Service','Multiple_Lines','Online_Security','Online_Backup','Device_Protection','Tech_Support','Streaming_TV','Streaming_Movies','Paperless_Billing']
# i=0
# define a new function to iterate through the data values and change them
# for i in telco_subset:
    # dict = {'Yes': 1, 'No': 0}
    # telco_subset = telco_subset.map(dict)
    # i+=1

### Identify Missing Values
Looking for missing values and removing them is the next phase, although it may be better to replace them with 0, or impute average values. First I want to calculate the number of entries.

In [49]:
telco_churn.index

RangeIndex(start=0, stop=7043, step=1)

There are a total of 7043 rows in the dataset. Having removed the 'CustomerID' column I want to view all the columns in the dataframe once more.

In [50]:
telco_churn.columns

Index(['City', 'Lat_Long', 'Gender', 'Senior_Citizen', 'Partner', 'Dependents',
       'Tenure_Months', 'Phone_Service', 'Multiple_Lines', 'Internet_Service',
       'Online_Security', 'Online_Backup', 'Device_Protection', 'Tech_Support',
       'Streaming_TV', 'Streaming_Movies', 'Contract', 'Paperless_Billing',
       'Payment_Method', 'Monthly_Charges', 'Total_Charges', 'Churn_Value',
       'Churn_Score', 'CLTV', 'Churn_Reason'],
      dtype='object')

In [51]:
telco_missing = pd.isnull(telco_churn).sum()
print(telco_missing)

City                    0
Lat_Long                0
Gender                  0
Senior_Citizen          0
Partner                 0
Dependents              0
Tenure_Months           0
Phone_Service           0
Multiple_Lines        682
Internet_Service        0
Online_Security      1526
Online_Backup        1526
Device_Protection    1526
Tech_Support         1526
Streaming_TV         1526
Streaming_Movies     1526
Contract                0
Paperless_Billing       0
Payment_Method          0
Monthly_Charges         0
Total_Charges          11
Churn_Value             0
Churn_Score             0
CLTV                    0
Churn_Reason         5174
dtype: int64


Within the dataframe I need to count the location of the rows where the 'Total_Charges' column which have no entry is identical to 'True'. This counts the number of blank spaces in the particular column.

In [52]:
len(telco_churn.loc[telco_churn['Total_Charges'] == ' '])

0

So this means that none of the entries have any blank spaces. Try printing these entries out.

In [53]:
# print these rows
telco_churn.loc[telco_churn['Total_Charges'] == ' ']

Unnamed: 0,City,Lat_Long,Gender,Senior_Citizen,Partner,Dependents,Tenure_Months,Phone_Service,Multiple_Lines,Internet_Service,...,Streaming_Movies,Contract,Paperless_Billing,Payment_Method,Monthly_Charges,Total_Charges,Churn_Value,Churn_Score,CLTV,Churn_Reason


Perhaps they contain another value such as NaN, or 0 already. 

Using the isnull( ) method I may be able to find the sum of these entries and print them out.

In [54]:
telco_churn['Total_Charges'].isnull().sum()

11

In [55]:
telco_churn[telco_churn['Total_Charges'].isnull()]

Unnamed: 0,City,Lat_Long,Gender,Senior_Citizen,Partner,Dependents,Tenure_Months,Phone_Service,Multiple_Lines,Internet_Service,...,Streaming_Movies,Contract,Paperless_Billing,Payment_Method,Monthly_Charges,Total_Charges,Churn_Value,Churn_Score,CLTV,Churn_Reason
2234,San_Bernardino,"34.084909, -117.258107",0,0,1,0,0,0,,DSL,...,0.0,Two year,1,Bank transfer (automatic),52.55,,0,36,2578,
2438,Independence,"36.869584, -118.189241",1,0,0,0,0,1,0.0,No,...,,Two year,0,Mailed check,20.25,,0,68,5504,
2568,San_Mateo,"37.590421, -122.306467",0,0,1,0,0,1,0.0,DSL,...,1.0,Two year,0,Mailed check,80.85,,0,45,2048,
2667,Cupertino,"37.306612, -122.080621",1,0,1,1,0,1,1.0,No,...,,Two year,0,Mailed check,25.75,,0,48,4950,
2856,Redcrest,"40.363446, -123.835041",0,0,1,0,0,0,,DSL,...,0.0,Two year,0,Credit card (automatic),56.05,,0,30,4740,
4331,Los_Angeles,"34.089953, -118.294824",1,0,1,1,0,1,0.0,No,...,,Two year,0,Mailed check,19.85,,0,53,2019,
4687,Sun_City,"33.739412, -117.173334",1,0,1,1,0,1,1.0,No,...,,Two year,0,Mailed check,25.35,,0,49,2299,
5104,Ben_Lomond,"37.078873, -122.090386",0,0,1,1,0,1,0.0,No,...,,Two year,0,Mailed check,20.0,,0,27,3763,
5719,La_Verne,"34.144703, -117.770299",1,0,1,1,0,1,0.0,No,...,,One year,1,Mailed check,19.7,,0,69,4890,
6772,Bell,"33.970343, -118.171368",0,0,1,1,0,1,1.0,DSL,...,0.0,Two year,0,Mailed check,73.35,,0,44,2342,


I can see all eleven Null values and decide whether to remove these rows from the dataframe completely, or impute some kind of average value or a 0. In this case they've been assigned values of 'NaN', or 'Not a Number'.

I've decided to set these missing values to 0 for now. I can always remove the values later and try running the model again to see if there is any difference to the scoring metric.

In [56]:
telco_churn.loc[(telco_churn['Total_Charges'] == 'NaN'), 'Total_Charges'] = 0

Let me see if this has worked and the 'Total_Charges' column with NaN values have been converted to 0.

In [57]:
telco_churn[telco_churn['Total_Charges'].isnull()]

Unnamed: 0,City,Lat_Long,Gender,Senior_Citizen,Partner,Dependents,Tenure_Months,Phone_Service,Multiple_Lines,Internet_Service,...,Streaming_Movies,Contract,Paperless_Billing,Payment_Method,Monthly_Charges,Total_Charges,Churn_Value,Churn_Score,CLTV,Churn_Reason
2234,San_Bernardino,"34.084909, -117.258107",0,0,1,0,0,0,,DSL,...,0.0,Two year,1,Bank transfer (automatic),52.55,,0,36,2578,
2438,Independence,"36.869584, -118.189241",1,0,0,0,0,1,0.0,No,...,,Two year,0,Mailed check,20.25,,0,68,5504,
2568,San_Mateo,"37.590421, -122.306467",0,0,1,0,0,1,0.0,DSL,...,1.0,Two year,0,Mailed check,80.85,,0,45,2048,
2667,Cupertino,"37.306612, -122.080621",1,0,1,1,0,1,1.0,No,...,,Two year,0,Mailed check,25.75,,0,48,4950,
2856,Redcrest,"40.363446, -123.835041",0,0,1,0,0,0,,DSL,...,0.0,Two year,0,Credit card (automatic),56.05,,0,30,4740,
4331,Los_Angeles,"34.089953, -118.294824",1,0,1,1,0,1,0.0,No,...,,Two year,0,Mailed check,19.85,,0,53,2019,
4687,Sun_City,"33.739412, -117.173334",1,0,1,1,0,1,1.0,No,...,,Two year,0,Mailed check,25.35,,0,49,2299,
5104,Ben_Lomond,"37.078873, -122.090386",0,0,1,1,0,1,0.0,No,...,,Two year,0,Mailed check,20.0,,0,27,3763,
5719,La_Verne,"34.144703, -117.770299",1,0,1,1,0,1,0.0,No,...,,One year,1,Mailed check,19.7,,0,69,4890,
6772,Bell,"33.970343, -118.171368",0,0,1,1,0,1,1.0,DSL,...,0.0,Two year,0,Mailed check,73.35,,0,44,2342,


In [58]:
telco_churn['Total_Charges'].unique()

array([ 108.15,  151.65,  820.5 , ..., 7362.9 ,  346.45, 6844.5 ])

In [None]:
telco_churn['Total_Charges'] = pd.to_numeric(telco_churn['Total_Charges'])

In [None]:
telco_churn['Multiple_Lines'].isnull().sum()

In [None]:
telco_churn[telco_churn['Multiple_Lines'].isnull()]

In [None]:
telco_churn['Online_Security'].isnull().sum()

In [None]:
telco_churn[telco_churn['Online_Security'].isnull()]

In [None]:
telco_churn['Online_Backup'].isnull().sum()

In [None]:
telco_churn[telco_churn['Online_Backup'].isnull()]

In [None]:
telco_churn['Device_Protection'].isnull().sum()

In [None]:
telco_churn[telco_churn['Device_Protection'].isnull()]

In [None]:
telco_churn['Tech_Support'].isnull().sum()

In [None]:
telco_churn[telco_churn['Tech_Support'].isnull()]

In [None]:
telco_churn['Streaming_TV'].isnull().sum()

In [None]:
telco_churn[telco_churn['Streaming_TV'].isnull()]

In [None]:
telco_churn['Streaming_Movies'].isnull().sum()

In [None]:
telco_churn[telco_churn['Streaming_Movies'].isnull()]

Based on the missing values listed above for the following columns: Multiple_Lines, Online_Security, Online_Backup, Device_Protection, Tech_Support, Streaming_TV, Streaming_Movies, Total_Charges, Churn_Reason, these entries will be dropped from the DataFrame.

In [None]:
telco_churn['Multiple_Lines'] = telco_churn['Multiple_Lines'].dropna()

In [None]:
telco_churn['Online_Security'] = telco_churn['Online_Security'].dropna()

In [None]:
telco_churn['Online_Backup'] = telco_churn['Online_Backup'].dropna()

In [None]:
telco_churn['Device_Protection'] = telco_churn['Device_Protection'].dropna()

In [None]:
telco_churn['Tech_Support'] = telco_churn['Tech_Support'].dropna()

In [None]:
telco_churn['Streaming_TV'] = telco_churn['Streaming_TV'].dropna()

In [None]:
telco_churn['Streaming_Movies'] = telco_churn['Streaming_Movies'].dropna()

In [None]:
telco_churn['Total_Charges'] = telco_churn['Total_Charges'].dropna()

In [None]:
telco_churn['Churn_Reason'] = telco_churn['Churn_Reason'].dropna()

### Export the Data

Print the cleaned dataset which has undergone the first part of the Transformation phase in the ETL pipeline.

In [59]:
print(telco_churn)

              City                Lat_Long  Gender  Senior_Citizen  Partner  \
0      Los_Angeles  33.964131, -118.272783       1               0        0   
1      Los_Angeles   34.059281, -118.30742       0               0        0   
2      Los_Angeles  34.048013, -118.293953       0               0        0   
3      Los_Angeles  34.062125, -118.315709       0               0        1   
4      Los_Angeles  34.039224, -118.266293       1               0        0   
...            ...                     ...     ...             ...      ...   
7038       Landers  34.341737, -116.539416       0               0        0   
7039      Adelanto  34.667815, -117.536183       1               0        1   
7040         Amboy  34.559882, -115.637164       0               0        1   
7041  Angelus_Oaks     34.1678, -116.86433       0               0        1   
7042  Apple_Valley  34.424926, -117.184503       1               0        0   

      Dependents  Tenure_Months  Phone_Service  Mul

The next phase of transformation should include some form of Normalization or Scaling so all values within the dataframe fall within a certain range of values. This can really help with the modeling phase by reducing any bias from really large values.

### Normalization or Re-Scaling

In [60]:
from sklearn import preprocessing

In [61]:
telco_churn.head()

Unnamed: 0,City,Lat_Long,Gender,Senior_Citizen,Partner,Dependents,Tenure_Months,Phone_Service,Multiple_Lines,Internet_Service,...,Streaming_Movies,Contract,Paperless_Billing,Payment_Method,Monthly_Charges,Total_Charges,Churn_Value,Churn_Score,CLTV,Churn_Reason
0,Los_Angeles,"33.964131, -118.272783",1,0,0,0,2,1,0.0,DSL,...,0.0,Month-to-month,1,Mailed check,53.85,108.15,1,86,3239,Competitor made better offer
1,Los_Angeles,"34.059281, -118.30742",0,0,0,1,2,1,0.0,Fiber optic,...,0.0,Month-to-month,1,Electronic check,70.7,151.65,1,67,2701,Moved
2,Los_Angeles,"34.048013, -118.293953",0,0,0,1,8,1,1.0,Fiber optic,...,1.0,Month-to-month,1,Electronic check,99.65,820.5,1,86,5372,Moved
3,Los_Angeles,"34.062125, -118.315709",0,0,1,1,28,1,1.0,Fiber optic,...,1.0,Month-to-month,1,Electronic check,104.8,3046.05,1,84,5003,Moved
4,Los_Angeles,"34.039224, -118.266293",1,0,0,1,49,1,1.0,Fiber optic,...,1.0,Month-to-month,1,Bank transfer (automatic),103.7,5036.3,1,89,5340,Competitor had better devices


In [62]:
# convert entries to float type
X = telco_churn.values.astype(float)

# define min max scaler
min_max_scaler = preprocessing.MinMaxScaler()

# transform data
X_scaled = min_max_scaler.fit_transform(X)

telco_churn_scaled = pd.DataFrame(X_scaled, columns=telco_churn.columns)
print(telco_churn_scaled)

ValueError: could not convert string to float: 'Los_Angeles'

Finally, convert the dataframe to a CSV file now it has been pre-processed.

In [None]:
telco_churn_cleaned = telco_churn_scaled
telco_churn_cleaned.to_csv()