# XGBoost Model

## Introduction

XGBoost is an ensemble learning method which is based on a collection of ensemble tree models. Aggregate the predictions from either a group of classifiers, or regressors. In this example I have chosen to classify the results from the 'Telco Customer Churn' dataset providing a Boolean result of True or False with respect to Churn.

The model is trained on a group of Decision Tree Classifiers first, with each classifier being trained on a random subset of the overall training data set. Then the predictions from all individual trees are then aggregated to predict the overall class that gets the most votes.

In terms of the sequence of events, XGBoost is normally applied at the last stage of a prediction or classification model. The other prediction/classification models are all aggregated to predict the prediction/class with the most votes - known as a 'hard voting predictor/classifier'.

This is all achieved using a series of 'boosting', 'bagging' and 'stacking' applications but first, I need to import the necessary libraries.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import xgboost as xgb
from sklearn.linear_model import LogisticRegression

# Split and train the data
from sklearn.model_selection import train_test_split

# I will construct a pipeline containing my chosen models
from sklearn.pipeline import Pipeline

# I need to score predicted versus actual values
from sklearn.metrics import balanced_accuracy_score, roc_auc_score, make_scorer, confusion_matrix, ConfusionMatrixDisplay

# I need to cross-validate and evaluate results
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV


## Exploratory Data Analysis

What is apparent about the data at first glance? How can I shape the data? Can I perform dimensionality reduction? Can I perform feature engineering to improve the data quality? What relationships and insights can I gain from the information and does it require cleaning, re-scaling or pre-processing of any sort?

Which are the columns to be used in the feature subset and target column? At this stage I know that the 'Churn' column will be my target vector, but the predictor feature subset may require some work so it's best to try and understand each and every one of these columns in their entirety.

### Summarize the Data

I extracted this dataset from the Kaggle web site. In the Dataset option in the Navigation menu on the left hand side, I am provided with a Search option. Requesting 'Telco-Customer-Churn' using search provides me with a list of options ranked according to 'Hotness' which is some measure of popularity. So the data sources on this page are all in flat file format such as XLSX or CSV which makes it easier to read the tabular structured data into Pandas, or a SQL DBMS.

I have decided to check a couple of variations of this IBM Telco Customer Churn dataset; the file provided by 'BlastChar' entitled 'Telco Customer Churn' and that of 'Jack Chang' entitled 'Telco customer churn (11.1.3+)'. Taking a look at any differences I have to decide which file would be more suitable for the purpose of this classification algorithm and where I can find the most comprehensive information summarizing the dataset? The reason I like using Kaggle is because they categorize their datasets according to popularity

Understanding how churn works is key to this project. It is a measure of whether or not customers are leaving, (the rate of loss, the attrition rate) or their dropout rate compared to the entire set of customers. This particular dataset is based on a fictional telecom company but discovering the rate of churn in general can be extremely useful if it's compared to that of other companies within the same industry. It can be used as a tool to monitor fluctuating consumer tastes and the effectiveness of competing companies. Ultimately the churn rate can be used to try and retain customers by predicting their behaviour.

List of Columns in 'BlastChar' CSV file:
 - CustomerID
 - Gender
 - SeniorCitizen
 - Partner
 - Dependents
 - Tenure
 - PhoneService
 - MultipleLines
 - InternetService
 - OnlineSecurity
 - OnlineBackup
 - DeviceProtection
 - TechSupport
 - StreamingTV
 - StreamingMovies
 - Contract
 - PaperlessBilling
 - PaymentMethod
 - MonthlyCharges
 - TotalCharges
 - Churn
 
 List of Columns in 'Jack Chang' XLSX file:
 - CustomerID
 - Count
 - Country
 - State
 - City
 - Zip Code
 - Lat Long
 - Latitude
 - Longitude
 - Gender
 - Senior Citizen
 - Partner
 - Dependents
 - Tenure Months
 - Phone Service
 - Multiple Lines
 - Internet Service
 - Online Security
 - Online Backup
 - Device Protection
 - Tech Support
 - Streaming TV
 - Streaming Movies
 - Contract
 - Paperless Billing
 - Payment Method
 - Monthly Charges
 - Total Charges
 - Churn Label
 - Churn Value
 - Churn Score
 - CLTV
 - Churn Reason
 
The 'BlastChar' list of columns is much shorter indicating some of the data has been pre-processed already. Let's see which columns might have been removed from the 'BlastChar' dataset and why? Read the dataset from 'Jack Chang' called 'telco_churn.csv'. This file was in XLSX format when I extracted it from Kaggle, but I loaded it into my Bronze Zone storage (my local drive) and saved it as a CSV file ready to be pushed to the Silver Zone for the transformation process to begin.


### Import the Dataset

Having assessed both the files I have decided to use the source file provided by Jack Chang in Kaggle entitled 'telco_churn.csv'. Read the data in using Pandas having selected the source and perform some EDA before cleaning or pre-processing.

In [2]:
telco_churn = pd.read_csv("C:/Users/lynst/Documents/Datasets/Kaggle/Jack Chang/telco_churn.csv")
telco_churn.head()

Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.30742,Female,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,67,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,...,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,...,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,1,84,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,...,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,1,89,5340,Competitor had better devices


### Explore the Dataset

Exploring the dataset further I can produce a summary of mean values and their variance.

In [3]:
telco_churn.describe()

Unnamed: 0,Count,Zip Code,Latitude,Longitude,Tenure Months,Monthly Charges,Churn Value,Churn Score,CLTV
count,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0
mean,1.0,93521.964646,36.282441,-119.79888,32.371149,64.761692,0.26537,58.699418,4400.295755
std,0.0,1865.794555,2.455723,2.157889,24.559481,30.090047,0.441561,21.525131,1183.057152
min,1.0,90001.0,32.555828,-124.301372,0.0,18.25,0.0,5.0,2003.0
25%,1.0,92102.0,34.030915,-121.815412,9.0,35.5,0.0,40.0,3469.0
50%,1.0,93552.0,36.391777,-119.730885,29.0,70.35,0.0,61.0,4527.0
75%,1.0,95351.0,38.224869,-118.043237,55.0,89.85,1.0,75.0,5380.5
max,1.0,96161.0,41.962127,-114.192901,72.0,118.75,1.0,100.0,6500.0


So immediately I realize that this 'describe( )' method will only summarize numeric data. I need to figure out which columns to use and whether or not the data can be converted into numeric values to improve the data quality and provide a more comprehensive prediction subset. Currently there are only 3 columns with numeric data which means I am limited in terms of the number of mathematical operations I can perform.

I can achieve a more data friendly and comprehensive dataset by removing the columns I don't need and converting more of the remaining columns into numeric data types.

In [4]:
telco_churn.columns

Index(['CustomerID', 'Count', 'Country', 'State', 'City', 'Zip Code',
       'Lat Long', 'Latitude', 'Longitude', 'Gender', 'Senior Citizen',
       'Partner', 'Dependents', 'Tenure Months', 'Phone Service',
       'Multiple Lines', 'Internet Service', 'Online Security',
       'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV',
       'Streaming Movies', 'Contract', 'Paperless Billing', 'Payment Method',
       'Monthly Charges', 'Total Charges', 'Churn Label', 'Churn Value',
       'Churn Score', 'CLTV', 'Churn Reason'],
      dtype='object')

'CustomerID' is the very first column which appears to contain unique identifiers for all Customers. This can be useful, especially if the dataset is loaded into a SQL Database. 

'Count' is the next column and contains the value '1' in every single instance. This merely just counts the entry for that particular row and is of no significance to the table.
 
'Country' is the next column. These entries are all the same with the value: 'United States'. Again, this information carries no real value if all rows are identical.
 
'State' contains identical values: 'California'. Again, this doesn't add any value to our dataset.

'City' does actually contain a lot of different values, but in this case there are hundreds or thousands of different Cities in California. This data column can be kept as it will be useful for drawing a decision tree.

'Zip Code' isn't that dissimilar to the City attribute. This can go!

'Lat Long' is just the Latitude and Longitude values combined into a co-ordinate. Geographical location or GPS co-ordinates are unlikely to have any effect on a customers churn rate unless there is an issue with network capacity shortfalls and signal outages.

'Latitude' should also be removed as these co-ordinates have no relationship with the overall dependent target vector.

'Longitude' as well.

'Gender' may provide some insight but it's unlikely. I will convert these entries into binary values to see.

'Senior Citizen' is similar. I believe that senior's are more likely to keep a phone, internet service or TV contract going so long as they're not moving location all the time. Is the customer a senior citizen, perhaps someone of pensionable age, or someone who is no longer a part of the labour force in a full-time capacity. The value is binary.

'Partner' I don't believe will bring any additional information to the model, however it may be more likely for a couple to retain services if they combine their income so I'll keep it in. This value would also be binary in nature.

'Dependents' should be included. This data explains if a customer has any children or co-habitant dependents who are reliant on them, but also because their preferences with respect to different services may be a big factor.

'Tenure Months' must relate to the number of months the individual has already been a customer. This could have an effect on why someone decides to cancel or not. Perhaps prices have increased or the customer has been with the same Telecomm's company for too long. Sometimes the competition have better offers so maybe the longer the tenure, the more likely they are to cancel.

'Phone Service' is quite simply a landline or cell phone contract. This is a simple Yes or No value so I'll convert these to binary.

'Multiple Lines' just means the customer may have had more than one phone line installed. This is a binary value.

'Internet Service' is a connection to the world wide web. This is binary in nature and may be a good factor in deciding if a customer wants to retain their service.

'Online Security' indicates if a customer has antivirus and other safeguards. This is also a 'Yes' or 'No' answer.

'Device Protection' would include an insurance policy for any damage or faulty manufacture of the product. This is another binary value.

'Tech Support' means the customer may have purchased additional help for any technical issues.

'Streaming TV' is a basic binary choice. Do the customers have a TV streaming package?

'Streaming Movies' is the same.

'Contract' is one fixed term which usually involves a monthly fee to purchase the cost of the phone, any cell phone charges and possibly data usage for accessing the cell phone network when no internet is available.

'Paperless Billing' means electronic only but this is just a yes or no answer once again. It's not clear that this would necessarily have any effect on the overall churn rate for customers.

'Payment Method' includes information about the customers' payment preference such as checks, credit card or bank transfers.

'Monthly Charges' are the monthly bill. These are float datatype and should be included.

'Total Charges' include all charges since the beginning of the respective agreement or contract. These are float datatype and should also be included.

'Churn Label' is either yes or no and indicates if the customer left in the last month or not. This can be removed so I can just use the 'Churn Value' instead, otherwise information is just being duplicated and will introduce deliberate bias.

'Churn Value' is merely a binary representation of the churn label, using 1 for yes and 0 for no.

'Churn Score' is a numeric value representing the churn rate. The higher the number, the more like that customer is to cancel so this should be included. According to the information provided in Kaggle, it's a value from 0-100 that is calculated using the predictive tool IBM SPSS Modeler. The model includes several reasons known to cause churn.

'CLTV' is Customer Lifetime Value. A predicted CLTV is calculated using corporate formulas and existing data. The higher the value, the more valuable the customer. High value customers should be monitored for churn.

'Churn Reason' is the customers explanation for leaving the service. This is directly related to the churn score, but I am dropping this column which is a string object.


### Dimensionality Reduction

First I need to remove the columns I don't want using the drop( ) method. Be careful here as any attempt to assign this operation to the existing 'telco_churn' variable name will result in an error.

In [5]:
telco_churn.drop(['CustomerID','Count','Country','Latitude','Longitude','State','Churn Label'], axis=1, inplace=True)

telco_churn.head()

Unnamed: 0,City,Zip Code,Lat Long,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,Multiple Lines,...,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Value,Churn Score,CLTV,Churn Reason
0,Los Angeles,90003,"33.964131, -118.272783",Male,No,No,No,2,Yes,No,...,No,Month-to-month,Yes,Mailed check,53.85,108.15,1,86,3239,Competitor made better offer
1,Los Angeles,90005,"34.059281, -118.30742",Female,No,No,Yes,2,Yes,No,...,No,Month-to-month,Yes,Electronic check,70.7,151.65,1,67,2701,Moved
2,Los Angeles,90006,"34.048013, -118.293953",Female,No,No,Yes,8,Yes,Yes,...,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,1,86,5372,Moved
3,Los Angeles,90010,"34.062125, -118.315709",Female,No,Yes,Yes,28,Yes,Yes,...,Yes,Month-to-month,Yes,Electronic check,104.8,3046.05,1,84,5003,Moved
4,Los Angeles,90015,"34.039224, -118.266293",Male,No,No,Yes,49,Yes,Yes,...,Yes,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,1,89,5340,Competitor had better devices


In [6]:
telco_churn.shape

(7043, 26)

So there are a total of 7043 row entries or instances and 26 columns or features.

To list the names of the columns and their data type:

In [7]:
telco_churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   City               7043 non-null   object 
 1   Zip Code           7043 non-null   int64  
 2   Lat Long           7043 non-null   object 
 3   Gender             7043 non-null   object 
 4   Senior Citizen     7043 non-null   object 
 5   Partner            7043 non-null   object 
 6   Dependents         7043 non-null   object 
 7   Tenure Months      7043 non-null   int64  
 8   Phone Service      7043 non-null   object 
 9   Multiple Lines     7043 non-null   object 
 10  Internet Service   7043 non-null   object 
 11  Online Security    7043 non-null   object 
 12  Online Backup      7043 non-null   object 
 13  Device Protection  7043 non-null   object 
 14  Tech Support       7043 non-null   object 
 15  Streaming TV       7043 non-null   object 
 16  Streaming Movies   7043 

To remove any white space in the column names using the replace method:

In [8]:
telco_churn.columns = telco_churn.columns.str.replace(' ','_')
telco_churn.columns

Index(['City', 'Zip_Code', 'Lat_Long', 'Gender', 'Senior_Citizen', 'Partner',
       'Dependents', 'Tenure_Months', 'Phone_Service', 'Multiple_Lines',
       'Internet_Service', 'Online_Security', 'Online_Backup',
       'Device_Protection', 'Tech_Support', 'Streaming_TV', 'Streaming_Movies',
       'Contract', 'Paperless_Billing', 'Payment_Method', 'Monthly_Charges',
       'Total_Charges', 'Churn_Value', 'Churn_Score', 'CLTV', 'Churn_Reason'],
      dtype='object')

Although the purpose of having unique identifiers to label each entry or customer id becomes useful when manipulating data in SQL, it will not provide any insight or potential relationships if included in a machine learning model so it was more prudent to drop the 'CustomerID' column. (In structured relational databases a unique identifier column of values becomes useful for establishing relationships in a Schema table object).

### Remove White Space
This can be achieved using the 'replace( )' method on a DataFrame.

Counting the number of unique values in the 'City' column as first.

In [9]:
telco_churn['City'].nunique()

1129

Later on a Decision Tree will be created using GraphViz but in order to draw a tree properly it is not preferable to have any whitespace in the column values for 'City', so these should be replaced with an underscore character. Looking at the first five values for 'City' using slicing.

In [10]:
telco_churn['City'].unique()[0:5]

array(['Los Angeles', 'Beverly Hills', 'Huntington Park', 'Lynwood',
       'Marina Del Rey'], dtype=object)

Now check the first 5 rows of the 'City' column using the 'head( )' method having replaced whitespaces with underscore characters.

In [11]:
telco_churn['City'].replace(' ', '_', regex=True, inplace=True)
telco_churn['City'].head()

0    Los_Angeles
1    Los_Angeles
2    Los_Angeles
3    Los_Angeles
4    Los_Angeles
Name: City, dtype: object

### Converting Data Types

Next I would like to convert any columns with string object datatypes into numeric values, all except for the 'City' column.

Gender needs converting to '1' for Male and '0' for Female for simplicity. 

In [12]:
gender_dict = {'Male': 1, 'Female': 0}

Now map the gender dictionary to the 'Gender' column:

In [13]:
telco_churn['Gender'] = telco_churn['Gender'].map(gender_dict)
telco_churn['Gender'].head()

0    1
1    0
2    0
3    0
4    1
Name: Gender, dtype: int64

'Senior Citizen' values can be converted to '1' for yes and '0' for no.

In [14]:
senior_dict = {'Yes': 1, 'No': 0}
telco_churn['Senior_Citizen'] = telco_churn['Senior_Citizen'].map(senior_dict)

Same for 'Partner':

In [15]:
partner_dict = {'Yes': 1, 'No': 0}
telco_churn['Partner'] = telco_churn['Partner'].map(partner_dict)

And 'Dependents':

In [16]:
dependents_dict = {'Yes': 1, 'No': 0}
telco_churn['Dependents'] = telco_churn['Dependents'].map(dependents_dict)

'Tenure Months' are already numeric integers so this is fine. 'Phone Service' can be converted:

In [17]:
phone_dict = {'Yes': 1, 'No': 0}
telco_churn['Phone_Service'] = telco_churn['Phone_Service'].map(phone_dict)

And 'Multiple Lines':

In [18]:
multi_dict = {'Yes': 1, 'No': 0}
telco_churn['Multiple_Lines'] = telco_churn['Multiple_Lines'].map(multi_dict)

'Internet Service' has categorical values so categorical or 'one-hot' encoding will be used here to assign different numeric values for each option. I will assign the 'DSL' value as 1, 'Fibre optic' as 2, 

In [19]:
from numpy import array
from numpy import argmax
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

data_internet = telco_churn['Internet_Service']
values_internet = array(data_internet)
print(values_internet)

['DSL' 'Fiber optic' 'Fiber optic' ... 'Fiber optic' 'DSL' 'Fiber optic']


In [20]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded_internet = label_encoder.fit_transform(values_internet)
print(integer_encoded_internet)

[0 1 1 ... 1 0 1]


In [21]:
# encode the values from internet service to numbers
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded_internet = integer_encoded_internet.reshape(len(integer_encoded_internet), 1)
onehot_encoded_internet_service = onehot_encoder.fit_transform(integer_encoded_internet)
print(onehot_encoded_internet_service)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 ...
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]]




Test the first value.

In [22]:
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded_internet_service[0, :])])
print(inverted)

['DSL']


Repeating for the 'Contract' column which is also a categorical feature:

In [23]:
data_contract = telco_churn['Contract']
values_contract = array(data_contract)
print(values_contract)

['Month-to-month' 'Month-to-month' 'Month-to-month' ... 'One year'
 'Month-to-month' 'Two year']


In [24]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded_contract = label_encoder.fit_transform(values_contract)
print(integer_encoded_contract)

[0 0 0 ... 1 0 2]


In [25]:
# encode the values from internet service to numbers
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded_contract = integer_encoded_contract.reshape(len(integer_encoded_contract), 1)
onehot_encoded_contract = onehot_encoder.fit_transform(integer_encoded_contract)
print(onehot_encoded_contract)

[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 ...
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]




Test the first value again.

In [26]:
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded_internet_service[0, :])])
print(inverted)

['Month-to-month']


In [27]:
security_dict = {'Yes': 1, 'No': 0}
telco_churn['Online_Security'] = telco_churn['Online_Security'].map(security_dict)

In [28]:
backup_dict = {'Yes':1, 'No':0}
telco_churn['Online_Backup'] = telco_churn['Online_Backup'].map(backup_dict)

In [29]:
device_protection_dict = {'Yes': 1, 'No': 0}
telco_churn['Device_Protection'] = telco_churn['Device_Protection'].map(device_protection_dict)

In [30]:
tech_support_dict = {'Yes': 1, 'No': 0}
telco_churn['Tech_Support'] = telco_churn['Tech_Support'].map(tech_support_dict)

In [31]:
stream_tv_dict = {'Yes': 1, 'No': 0}
telco_churn['Streaming_TV'] = telco_churn['Streaming_TV'].map(stream_tv_dict)

In [32]:
stream_movies_dict = {'Yes': 1, 'No': 0}
telco_churn['Streaming_Movies'] = telco_churn['Streaming_Movies'].map(stream_movies_dict)

In [33]:
data = telco_churn['Contract']
values = array(data)
print(values)

['Month-to-month' 'Month-to-month' 'Month-to-month' ... 'One year'
 'Month-to-month' 'Two year']


In [34]:
paperless_bill_dict = {'Yes': 1, 'No': 0}
telco_churn['Paperless_Billing'] = telco_churn['Paperless_Billing'].map(paperless_bill_dict)

'Payment_Method' also has categorical features which can be converted using one-hot encoding.

In [35]:
data = telco_churn['Payment_Method']
values = array(data)
print(values)

['Mailed check' 'Electronic check' 'Electronic check' ...
 'Credit card (automatic)' 'Electronic check' 'Bank transfer (automatic)']


In [36]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded_payment = label_encoder.fit_transform(values)
print(integer_encoded_payment)

[3 2 2 ... 1 2 0]


In [37]:
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded_payment = integer_encoded_payment.reshape(len(integer_encoded_payment), 1)
onehot_encoded_payment = onehot_encoder.fit_transform(integer_encoded_payment)
print(onehot_encoded_payment)

[[0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 ...
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]]




'Monthly_Charges' are already of float64 datatype, but 'Total_Charges' aren't so they need to be converted from string object to float64.

In [38]:
telco_churn['Total_Charges'] = pd.to_numeric(telco_churn['Total_Charges'], errors='coerce')
telco_churn['Total_Charges'].head()

0     108.15
1     151.65
2     820.50
3    3046.05
4    5036.30
Name: Total_Charges, dtype: float64

The 'Churn_Value' will be designated as the target vector for the purpose of my model and it currently has a simple integer datatype labeled as '1' for churn and '0' for no churn. I can already use these labels.

'Churn_Score' remains as an integer. It should be highly correlated with 'Churn_Value' so I will check for this next.

Finally, 'CLTV' which is the Customer Lifetime Value is an indicator of the customers importance during this period. This value should be compared to 'Churn_Value' or 'Churn_Score' for any association.

In addition to converting all the binary decision outcomes for certain columns above, it may be more prudent to create a for loop to iterate through these columns in a subset to speed up the algorithm.

In [39]:
# only include all columns with binary outcomes to be converted to integer values
# telco_subset = telco_churn['Gender','Senior_Citizen','Partner','Dependents', 'Phone_Service','Multiple_Lines','Online_Security','Online_Backup','Device_Protection','Tech_Support','Streaming_TV','Streaming_Movies','Paperless_Billing']
# i=0
# define a new function to iterate through the data values and change them
# for i in telco_subset:
    # dict = {'Yes': 1, 'No': 0}
    # telco_subset = telco_subset.map(dict)
    # i+=1

### Identify Missing Values
Looking for missing values and removing them is the next phase, although it may be better to replace them with 0, or impute average values. First I want to calculate the number of entries.

In [40]:
telco_churn.index

RangeIndex(start=0, stop=7043, step=1)

There are a total of 7043 rows in the dataset. Having removed the 'CustomerID' column I want to view all the columns in the dataframe once more.

In [41]:
telco_churn.columns

Index(['City', 'Zip_Code', 'Lat_Long', 'Gender', 'Senior_Citizen', 'Partner',
       'Dependents', 'Tenure_Months', 'Phone_Service', 'Multiple_Lines',
       'Internet_Service', 'Online_Security', 'Online_Backup',
       'Device_Protection', 'Tech_Support', 'Streaming_TV', 'Streaming_Movies',
       'Contract', 'Paperless_Billing', 'Payment_Method', 'Monthly_Charges',
       'Total_Charges', 'Churn_Value', 'Churn_Score', 'CLTV', 'Churn_Reason'],
      dtype='object')

In [42]:
telco_missing = pd.isnull(telco_churn).sum()
print(telco_missing)

City                    0
Zip_Code                0
Lat_Long                0
Gender                  0
Senior_Citizen          0
Partner                 0
Dependents              0
Tenure_Months           0
Phone_Service           0
Multiple_Lines        682
Internet_Service        0
Online_Security      1526
Online_Backup        1526
Device_Protection    1526
Tech_Support         1526
Streaming_TV         1526
Streaming_Movies     1526
Contract                0
Paperless_Billing       0
Payment_Method          0
Monthly_Charges         0
Total_Charges          11
Churn_Value             0
Churn_Score             0
CLTV                    0
Churn_Reason         5174
dtype: int64


Within the dataframe I need to count the location of the rows where the 'Total_Charges' column which have no entry is identical to 'True'. This counts the number of blank spaces in the particular column.

In [43]:
len(telco_churn.loc[telco_churn['Total_Charges'] == ' '])

0

So this means that none of the entries have any blank spaces. Try printing these entries out.

In [44]:
# print these rows
telco_churn.loc[telco_churn['Total_Charges'] == ' ']

Unnamed: 0,City,Zip_Code,Lat_Long,Gender,Senior_Citizen,Partner,Dependents,Tenure_Months,Phone_Service,Multiple_Lines,...,Streaming_Movies,Contract,Paperless_Billing,Payment_Method,Monthly_Charges,Total_Charges,Churn_Value,Churn_Score,CLTV,Churn_Reason


Perhaps they contain another value such as NaN, or 0 already. 

Using the isnull( ) method I may be able to find the sum of these entries and print them out.

In [None]:
telco_churn['Total_Charges'].isnull().sum()

In [None]:
telco_churn[telco_churn['Total_Charges'].isnull()]

I can see all eleven Null values and decide whether to remove these rows from the dataframe completely, or impute some kind of average value or a 0. In this case they've been assigned values of 'NaN', or 'Not a Number'.

I've decided to set these missing values to 0 for now. I can always remove the values later and try running the model again to see if there is any difference to the scoring metric.

In [None]:
telco_churn.loc[(telco_churn['Total_Charges'] == 'NaN'), 'Total_Charges'] = 0

Let me see if this has worked and the 'Total_Charges' column with NaN values have been converted to 0.

In [None]:
telco_churn[telco_churn['Total_Charges'].isnull()]

In [None]:
telco_churn['Total_Charges'].unique()

In [None]:
telco_churn.columns

In [None]:
for column in telco_churn.columns.values.tolist():
    print(column)
    print (telco_churn[column].value_counts())
    print("")