# Exercise 1 - Customer Churn Prediction with XGBoost - Data Preparation

### Your task is to make this notebook run succesfully and fill all the cells marked with `INSERT YOUR CODE HERE`


---

## Background

Losing customers is costly for any business.  Identifying unhappy customers early on gives you a chance to offer them incentives to stay.  This notebook describes using machine learning (ML) for the automated identification of unhappy customers, also known as customer churn prediction. ML models rarely give perfect predictions though, so this notebook is also about how to incorporate the relative costs of prediction mistakes when determining the financial outcome of using ML.

We use a familiar example of churn: leaving a mobile phone operator.  Seems like one can always find fault with their provider du jour! And if the provider knows that a customer is thinking of leaving, it can offer timely incentives - such as a phone upgrade or perhaps having a new feature activated – and the customer may stick around. Incentives are often much more cost-effective than losing and reacquiring a customer.



Next, we'll import the Python libraries we'll need for the remainder of the example.

In [None]:
import pandas as pd
import numpy as np
import boto3
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
%matplotlib inline

---
## Data

Mobile operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct an ML model of one mobile operator’s churn using a process called training. After training the model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have the model predict whether this customer is going to churn. Of course, we expect the model to make mistakes. After all, predicting the future is tricky business! But we'll learn how to deal with prediction errors.

The dataset we use is publicly available and was mentioned in the book [Discovering Knowledge in Data](https://www.amazon.com/dp/0470908742/) by Daniel T. Larose. It is attributed by the author to the University of California Irvine Repository of Machine Learning Datasets.  Let's download and read that dataset in now:

In [None]:
s3 = boto3.client("s3")
s3.download_file(f"sagemaker-sample-files", "datasets/tabular/synthetic/churn.txt", "churn.txt")

In [None]:
churn = pd.read_csv("./churn.txt")
pd.set_option("display.max_columns", 500)
churn

By modern standards, it’s a relatively small dataset, with only 5,000 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes are:

- `State`: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ
- `Account Length`: the number of days that this account has been active
- `Area Code`: the three-digit area code of the corresponding customer’s phone number
- `Phone`: the remaining seven-digit phone number
- `Int’l Plan`: whether the customer has an international calling plan: yes/no
- `VMail Plan`: whether the customer has a voice mail feature: yes/no
- `VMail Message`: the average number of voice mail messages per month
- `Day Mins`: the total number of calling minutes used during the day
- `Day Calls`: the total number of calls placed during the day
- `Day Charge`: the billed cost of daytime calls
- `Eve Mins, Eve Calls, Eve Charge`: the billed cost for calls placed during the evening
- `Night Mins`, `Night Calls`, `Night Charge`: the billed cost for calls placed during nighttime
- `Intl Mins`, `Intl Calls`, `Intl Charge`: the billed cost for international calls
- `CustServ Calls`: the number of calls placed to Customer Service
- `Churn?`: whether the customer left the service: true/false

The last attribute, `Churn?`, is known as the target attribute: the attribute that we want the ML model to predict.  Because the target attribute is binary, our model will be performing binary prediction, also known as binary classification.

Let's begin exploring the data:

In [None]:
# Frequency tables for each categorical feature
# INSERT YOUR CODE HERE


In [None]:
# describe statistics for the dataframe
#INSERT YOUR CODE HERE

In [None]:
#plot histograms for each column
#INSERT YOUR CODE HERE

We can see immediately that:
- `State` appears to be quite evenly distributed.
- `Phone` takes on too many unique values to be of any practical use.  It's possible that parsing out the prefix could have some value, but without more context on how these are allocated, we should avoid using it.
- Most of the numeric features are surprisingly nicely distributed, with many showing bell-like `gaussianity`.  `VMail Message` is a notable exception (and `Area Code` showing up as a feature we should convert to non-numeric).

In [None]:
# Drop the "Phone" column 
churn = #INSERT YOUR CODE HERE

# Convert "Area Code" to "object" data type
churn["Area Code"] = #INSERT YOUR CODE HERE

# Display data types of each column in data frame
#INSERT YOUR CODE HERE

Next let's look at the relationship between each of the features and our target variable.

In [None]:
for column in churn.select_dtypes(include=["object"]).columns:
    if column != "Churn?":
        display(pd.crosstab(index=churn[column], columns=churn["Churn?"], normalize="columns"))

for column in churn.select_dtypes(exclude=["object"]).columns:
    print(column)
    hist = churn[[column, "Churn?"]].hist(by="Churn?", bins=30)
    plt.show()

In [None]:
display(churn.corr())
pd.plotting.scatter_matrix(churn, figsize=(12, 12))
plt.show()

We see several features that essentially have 100% correlation with one another.  Including these feature pairs in some machine learning algorithms can create catastrophic problems, while in others it will only introduce minor redundancy and bias.  Let's remove one feature from each of the highly correlated pairs: `Day Charge` from the pair with `Day Mins`, `Night Charge` from the pair with `Night Mins`, `Intl Charge` from the pair with `Intl Mins`:

In [None]:
# Lets drop following features: "Day Charge", "Eve Charge", "Night Charge", "Intl Charge"
churn = #INSERT YOUR CODE HERE

As a final step let's convert all categorical into numerical feature

In [None]:
# Convert categorical to numeric variables and save in a new dataframe named 'model_data'
model_data = #INSERT YOUR CODE HERE


# Drop "Churn?_False." column  as data is redundant with target column
model_data = #INSERT YOUR CODE HERE


assert len(model_data.columns) == 100 , "data has wrong number of columns"

And now let's split the data into training, validation, and test sets.  This will help prevent us from overfitting the model, and allow us to test the model's accuracy on data it hasn't already seen.

In [None]:
from sklearn.model_selection import train_test_split

#split of 70%/30% into train/validation set
train, validation = #INSERT YOUR CODE HERE

# split of 1/3 from validation set into test set
validation, test = #INSERT YOUR CODE HERE

print (f'Training size: {train.shape}, Validation size: {validation.shape}, Test size: {test.shape}')
assert len(train) == 3500 , "Training set size should be 3500"
assert len(validation) == 1000 , "Training set size should be 1000"
assert len(test) == 500 , "Training set size should be 500"

Finally lets write the data to disk

In [None]:
# Write train csv with index set to False
#INSERT YOUR CODE HERE
# Write validation csv with index set to False
#INSERT YOUR CODE HERE
# Write test csv with index set to False
#INSERT YOUR CODE HERE

from os.path import exists

# ensure files are written
assert exists("train.csv") , "train.csv does not exist"
assert exists("validation.csv") , "validation.csv does not exist"
assert exists("test.csv") , "test.csv does not exist"


Our data is now ready for training our machine learning model