# XGBoost Model

## Introduction

XGBoost is an ensemble learning method which is based on a collection of ensemble tree models. Aggregate the predictions from either a group of classifiers, or regressors. In this example I have chosen to classify the results from the 'Telco Customer Churn' dataset providing a Boolean result of True or False with respect to Churn.

The model is trained on a group of Decision Tree Classifiers first, with each classifier being trained on a random subset of the overall training data set. Then the predictions from all individual trees are then aggregated to predict the overall class that gets the most votes.

In terms of the sequence of events, XGBoost is normally applied at the last stage of a prediction or classification model. The other prediction/classification models are all aggregated to predict the prediction/class with the most votes - known as a 'hard voting predictor/classifier'.

This is all achieved using a series of 'boosting', 'bagging' and 'stacking' applications but first, I need to import the necessary libraries.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import xgboost
from sklearn.linear_model import LogisticRegression

# Split and train the data
from sklearn.model_selection import train_test_split

# I will construct a pipeline containing my chosen models
from sklearn.pipeline import Pipeline

# I need to score predicted versus actual values
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_curve, precision_recall_curve

# I need to cross-validate and evaluate results
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

## Exploratory Data Analysis

What is apparent about the data at first glance? How can I shape the data? Can I perform dimensionality reduction? Can I perform feature engineering to improve the data quality? What relationships and insights can I gain from the information and does it require cleaning, re-scaling or pre-processing of any sort?

Which are the columns to be used in the feature subset and target column? At this stage I know that the 'Churn' column will be my target vector, but the predictor feature subset may require some work so it's best to try and understand each and every one of these columns in their entirety.

### Summarize the Data

I extracted this dataset from the Kaggle web site. In the Dataset option in the Navigation menu on the left hand side, I am provided with a Search option. Requesting 'Telco-Customer-Churn' using search provides me with a list of options ranked according to 'Hotness' which is some measure of popularity. So the data sources on this page are all in flat file format such as XLSX or CSV which makes it easier to read the tabular structured data into Pandas, or a SQL DBMS.

I have decided to check a couple of variations of this IBM Telco Customer Churn dataset; the file provided by 'BlastChar' entitled 'Telco Customer Churn' and that of 'Jack Chang' entitled 'Telco customer churn (11.1.3+)'. Taking a look at any differences I have to decide which file would be more suitable for the purpose of this classification algorithm and where I can find the most comprehensive information summarizing the dataset? The reason I like using Kaggle is because they categorize their datasets according to popularity

Understanding how churn works is key to this project. It is a measure of whether or not customers are leaving, (the rate of loss, the attrition rate) or their dropout rate compared to the entire set of customers. This particular dataset is based on a fictional telecom company but discovering the rate of churn in general can be extremely useful if it's compared to that of other companies within the same industry. It can be used as a tool to monitor fluctuating consumer tastes and the effectiveness of competing companies. Ultimately the churn rate can be used to try and retain customers by predicting their behaviour.

List of Columns in 'BlastChar' CSV file:
 - CustomerID
 - Gender
 - SeniorCitizen
 - Partner
 - Dependents
 - Tenure
 - PhoneService
 - MultipleLines
 - InternetService
 - OnlineSecurity
 - OnlineBackup
 - DeviceProtection
 - TechSupport
 - StreamingTV
 - StreamingMovies
 - Contract
 - PaperlessBilling
 - PaymentMethod
 - MonthlyCharges
 - TotalCharges
 - Churn
 
 List of Columns in 'Jack Chang' XLSX file:
 - CustomerID
 - Count
 - Country
 - State
 - City
 - Zip Code
 - Lat Long
 - Latitude
 - Longitude
 - Gender
 - Senior Citizen
 - Partner
 - Dependents
 - Tenure Months
 - Phone Service
 - Multiple Lines
 - Internet Service
 - Online Security
 - Online Backup
 - Device Protection
 - Tech Support
 - Streaming TV
 - Streaming Movies
 - Contract
 - Paperless Billing
 - Payment Method
 - Monthly Charges
 - Total Charges
 - Churn Label
 - Churn Value
 - Churn Score
 - CLTV
 - Churn Reason
 
The 'BlastChar' list of columns is much shorter indicating some of the data has been pre-processed already. Let's see which columns might have been removed from the 'BlastChar' dataset and why? Read the dataset from 'Jack Chang' called 'telco_churn.csv'. This file was in XLSX format when I extracted it from Kaggle, but I loaded it into my Bronze Zone storage (my local drive) and saved it as a CSV file ready to be pushed to the Silver Zone for the transformation process to begin.

## Import the Dataset

Having assessed both the files I have decided to use the source file provided by Jack Chang in Kaggle entitled 'telco_churn.csv'. Read the data in using Pandas having selected the source and perform some EDA before cleaning or pre-processing.

In [2]:
telco_churn = pd.read_csv("C:/Users/lynst/Documents/Datasets/Kaggle/Jack Chang/telco_churn.csv")
telco_churn.head()

Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.30742,Female,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,67,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,...,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,...,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,1,84,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,...,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,1,89,5340,Competitor had better devices


In [3]:
telco_churn.columns

Index(['CustomerID', 'Count', 'Country', 'State', 'City', 'Zip Code',
       'Lat Long', 'Latitude', 'Longitude', 'Gender', 'Senior Citizen',
       'Partner', 'Dependents', 'Tenure Months', 'Phone Service',
       'Multiple Lines', 'Internet Service', 'Online Security',
       'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV',
       'Streaming Movies', 'Contract', 'Paperless Billing', 'Payment Method',
       'Monthly Charges', 'Total Charges', 'Churn Label', 'Churn Value',
       'Churn Score', 'CLTV', 'Churn Reason'],
      dtype='object')

'CustomerID' is the very first column which appears to contain unique identifiers for all Customers. This can be useful, especially if the dataset is loaded into a SQL Database. 

'Count' is the next column and contains the value '1' in every single instance. This merely just counts the entry for that particular row and is of no significance to the table.
 
'Country' is the next column. These entries are all the same with the value: 'United States'. Again, this information carries no real value if all rows are identical.
 
'State' contains identical values: 'California'. Again, this doesn't add any value to our dataset.

'City' does actually contain a lot of different values, but in this case there are hundreds or thousands of different Cities in California. This data column can be removed also.

'Zip Code' isn't that dissimilar to the City attribute. This can go!

'Lat Long' is just the Latitude and Longitude values combined into a co-ordinate. Geographical location or GPS co-ordinates are unlikely to have any effect on a customers churn rate unless there is an issue with network capacity shortfalls and signal outages.

'Latitude' should also be removed as these co-ordinates have no relationship with the overall dependent target vector.

'Longitude' as well.

'Gender' may provide some insight but it's unlikely. I will convert these entries into binary values to see.

'Senior Citizen' is similar. I believe that senior's are more likely to keep a phone, internet service or TV contract going so long as they're not moving location all the time. Is the customer a senior citizen, perhaps someone of pensionable age, or someone who is no longer a part of the labour force in a full-time capacity. The value is binary.

'Partner' I don't believe will bring any additional information to the model, however it may be more likely for a couple to retain services if they combine their income so I'll keep it in. This value would also be binary in nature.

'Dependents' should be included. This data explains if a customer has any children or co-habitant dependents who are reliant on them, but also because their preferences with respect to different services may be a big factor.

'Tenure Months' must relate to the number of months the individual has already been a customer. This could have an effect on why someone decides to cancel or not. Perhaps prices have increased or the customer has been with the same Telecomm's company for too long. Sometimes the competition have better offers so maybe the longer the tenure, the more likely they are to cancel.

'Phone Service' is quite simply a landline or cell phone contract. This is a simple Yes or No value so I'll convert these to binary.

'Multiple Lines' just means the customer may have had more than one phone line installed. This is a binary value.

'Internet Service' is a connection to the world wide web. This is binary in nature and may be a good factor in deciding if a customer wants to retain their service.

'Online Security' indicates if a customer has antivirus and other safeguards. This is also a 'Yes' or 'No' answer.

'Device Protection' would include an insurance policy for any damage or faulty manufacture of the product. This is another binary value.

'Tech Support' means the customer may have purchased additional help for any technical issues.

'Streaming TV' is a basic binary choice. Do the customers have a TV streaming package?

'Streaming Movies' is the same.

'Contract' is one fixed term which usually involves a monthly fee to purchase the cost of the phone, any cell phone charges and possibly data usage for accessing the cell phone network when no internet is available.

'Paperless Billing' means electronic only but this is just a yes or no answer once again. It's not clear that this would necessarily have any effect on the overall churn rate for customers.

'Payment Method' includes information about the customers' payment preference such as checks, credit card or bank transfers.

'Monthly Charges' are the monthly bill. These are float datatype and should be included.

'Total Charges' include all charges since the beginning of the respective agreement or contract. These are float datatype and should also be included.

'Churn Label' is either yes or no and indicates if the customer left in the last month or not.

'Churn Value' is merely a binary representation of the churn label, using 1 for yes and 0 for no.

'Churn Score' is a numeric value representing the churn rate. The higher the number, the more like that customer is to cancel so this should be included. According to the information provided in Kaggle, it's a value from 0-100 that is calculated using the predictive tool IBM SPSS Modeler. The model includes several reasons known to cause churn.

'CLTV' is Customer Lifetime Value. A predicted CLTV is calculated using corporate formulas and existing data. The higher the value, the more valuable the customer. High value customers should be monitored for churn.

'Churn Reason' is the customers explanation for leaving the service. This is directlt related to the churn score.

### Dimensionality Reduction

First I need to remove the columns I don't want.

In [4]:
telco_churn = telco_churn[['CustomerID','Gender','Senior Citizen',
       'Partner', 'Dependents', 'Tenure Months', 'Phone Service',
       'Multiple Lines', 'Internet Service', 'Online Security',
       'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV',
       'Streaming Movies', 'Contract', 'Paperless Billing', 'Payment Method',
       'Monthly Charges', 'Total Charges', 'Churn Value',
       'Churn Score', 'CLTV']]

telco_churn.head()

Unnamed: 0,CustomerID,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,Multiple Lines,Internet Service,Online Security,...,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Value,Churn Score,CLTV
0,3668-QPYBK,Male,No,No,No,2,Yes,No,DSL,Yes,...,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,1,86,3239
1,9237-HQITU,Female,No,No,Yes,2,Yes,No,Fiber optic,No,...,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,1,67,2701
2,9305-CDSKC,Female,No,No,Yes,8,Yes,Yes,Fiber optic,No,...,Yes,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,1,86,5372
3,7892-POOKP,Female,No,Yes,Yes,28,Yes,Yes,Fiber optic,No,...,Yes,Yes,Month-to-month,Yes,Electronic check,104.8,3046.05,1,84,5003
4,0280-XJGEX,Male,No,No,Yes,49,Yes,Yes,Fiber optic,No,...,Yes,Yes,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,1,89,5340


In [5]:
telco_churn.shape

(7043, 23)

So there are a total of 7043 row entries or instances and 23 columns or features.

To list the names of the columns and their data type:

In [6]:
telco_churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 23 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         7043 non-null   object 
 1   Gender             7043 non-null   object 
 2   Senior Citizen     7043 non-null   object 
 3   Partner            7043 non-null   object 
 4   Dependents         7043 non-null   object 
 5   Tenure Months      7043 non-null   int64  
 6   Phone Service      7043 non-null   object 
 7   Multiple Lines     7043 non-null   object 
 8   Internet Service   7043 non-null   object 
 9   Online Security    7043 non-null   object 
 10  Online Backup      7043 non-null   object 
 11  Device Protection  7043 non-null   object 
 12  Tech Support       7043 non-null   object 
 13  Streaming TV       7043 non-null   object 
 14  Streaming Movies   7043 non-null   object 
 15  Contract           7043 non-null   object 
 16  Paperless Billing  7043 

### Converting Datatypes

Next I would like to convert any columns with string object datatypes into numeric values. Starting from left to right I can see that although the 'CustomerID' values are unique, they appear to be alpha-numeric in nature and it might be easier to assign index values to each to simplify the dataset.

Counting the number of unique values in this column as follows.

In [7]:
telco_churn['CustomerID'].nunique()

7043

This is a good start and means all the CustomerID instances are completely unique and there are no duplicates because all 7043 rows are returned. Any number smaller than 7043 would indicate either missing values or duplicate values possibly.

Although the purpose of having unique identifiers to label each entry or customer id becomes useful when manipulating data in SQL, it will not provide any insight or potential relationships if included in a machine learning model so it may be more prudent to drop this column. (In structured relational databases a unique identifier column of values becomes useful for establishing relationships in Schema's).

In [8]:
telco_churn = telco_churn.drop(columns=['CustomerID'])

Gender needs converting to '1' for Male and '0' for Female for simplicity. 

In [9]:
gender_dict = {'Male': 1, 'Female': 0}

Now map the gender dictionary to the 'Gender' column:

In [10]:
telco_churn['Gender'] = telco_churn['Gender'].map(gender_dict)
telco_churn['Gender'].head()

0    1
1    0
2    0
3    0
4    1
Name: Gender, dtype: int64

'Senior Citizen' values can be converted to '1' for yes and '0' for no.

In [11]:
senior_dict = {'Yes': 1, 'No': 0}
telco_churn['Senior Citizen'] = telco_churn['Senior Citizen'].map(senior_dict)

Same for 'Partner':

In [12]:
partner_dict = {'Yes': 1, 'No': 0}
telco_churn['Partner'] = telco_churn['Partner'].map(partner_dict)

And 'Dependents':

In [13]:
dependents_dict = {'Yes': 1, 'No': 0}
telco_churn['Dependents'] = telco_churn['Dependents'].map(dependents_dict)

'Tenure Months' are already numeric integers so this is fine. 'Phone Service' can be converted:

In [14]:
phone_dict = {'Yes': 1, 'No': 0}
telco_churn['Phone Service'] = telco_churn['Phone Service'].map(phone_dict)

And 'Multiple Lines':

In [15]:
multi_dict = {'Yes': 1, 'No': 0}
telco_churn['Multiple Lines'] = telco_churn['Multiple Lines'].map(multi_dict)

'Internet Service' has categorical values so categorical or 'one-hot' encoding will be used here to assign different numeric values for each option. I will assign the 'DSL' value as 1, 'Fibre optic' as 2, 

In [16]:
from numpy import array
from numpy import argmax
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

data = telco_churn['Internet Service']
values = array(data)
print(values)

['DSL' 'Fiber optic' 'Fiber optic' ... 'Fiber optic' 'DSL' 'Fiber optic']


In [17]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)

[0 1 1 ... 1 0 1]


In [18]:
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
internet_service_onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(internet_service_onehot_encoded)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 ...
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]]




In [19]:
# invert first example
inverted = label_encoder.inverse_transform([argmax(internet_service_onehot_encoded[0, :])])
print(inverted)

['DSL']


In [21]:
security_dict = {'Yes': 1, 'No': 0}
telco_churn['Online Security'] = telco_churn['Online Security'].map(security_dict)

In [22]:
backup_dict = {'Yes':1, 'No':0}
telco_churn['Online Backup'] = telco_churn['Online Backup'].map(backup_dict)

In [23]:
device_protection_dict = {'Yes': 1, 'No': 0}
telco_churn['Device Protection'] = telco_churn['Device Protection'].map(device_protection_dict)

In [24]:
tech_support_dict = {'Yes': 1, 'No': 0}
telco_churn['Tech Support'] = telco_churn['Tech Support'].map(tech_support_dict)

In [25]:
stream_tv_dict = {'Yes': 1, 'No': 0}
telco_churn['Streaming TV'] = telco_churn['Streaming TV'].map(stream_tv_dict)

In [26]:
stream_movies_dict = {'Yes': 1, 'No': 0}
telco_churn['Streaming Movies'] = telco_churn['Streaming Movies'].map(stream_movies_dict)

'Contract' also has categorical values which can be converted.

In [27]:
data = telco_churn['Contract']
values = array(data)
print(values)

['Month-to-month' 'Month-to-month' 'Month-to-month' ... 'One year'
 'Month-to-month' 'Two year']


In [28]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)

[0 0 0 ... 1 0 2]


In [29]:
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
contract_onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(contract_onehot_encoded)

[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 ...
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]




### Explore the Dataset

Exploring the dataset further I can produce a summary of mean values and their variance.

In [None]:
telco_churn.describe()

So immediately I realize that this 'describe( )' method will only summarize numeric data. I need to figure out which columns to use and whether or not the data can be converted into numeric values to improve the data quality and provide a more comprehensive prediction subset. Currently there are only 3 columns with numeric data which means I am limited in terms of the number of mathematical operations I can perform.

I want to ascertain if there's any correlation between any of the attributes from the very beginning.

In [None]:
telco_churn.corr()

## Visualizing the Data

A visual summary of the dispersion of data about its mean using matplotlib.

In [None]:
df.hist(bins=20, figsize=(20,10))

This provides a histogram splitting the data into separate buckets, or intervals, then counting the frequency distribution within them. It also provides a nice display of the dispersion of data, but now I want to see how it's distributed around the average values and the degree to which it's distributed around the average (fat, thin, or bell-shaped).

In [None]:
X, y = telco_churn.data, telco_churn.target

In [None]:
from xgboost import XGBClassifier