### Overview of Customer Churn Dataset

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dataset_path = "data/customer_churn_data.csv"

dataset = pd.read_csv(dataset_path)
dataset.head()

In [None]:
dataset.info()

The dataset does not contain any null values.

The feature `Total charges` has dtype = `object` but it is actually a numerical features.
Also, `SeniorCitizen` has dtype = `int64` but it is actually a categorical feature.

In [None]:
dataset["SeniorCitizen"] = dataset["SeniorCitizen"].astype(object)
dataset['TotalCharges'] = pd.to_numeric(dataset['TotalCharges'],errors='coerce')

In [None]:
# dropping customer ID since it is of little value
dataset = dataset.drop("customerID", axis=1)

In [None]:
# Frequency tables for each categorical feature 
for column in dataset.select_dtypes(include=["object"]).columns:
    display(pd.crosstab(index=dataset[column], columns="% observations", normalize="columns"))
    

The last table shows that the dataset is imbalanced, with 73.4% No Churn and 26.5% Churned customers.

In [None]:
# Histogram of numerical features
%matplotlib inline
hist = dataset.hist(bins=30, sharey=True, figsize=(10, 10))

Checking out the relationship between features and target variable

In [None]:
for column in dataset.select_dtypes(include=["object"]).columns:
    if column != "Churn":
        display(pd.crosstab(index=dataset[column], columns=dataset["Churn"], normalize="columns"))

for column in dataset.select_dtypes(exclude=["object"]).columns:
    print(column)
    hist = dataset[[column, "Churn"]].hist(by="Churn", bins=30)
    plt.show()

In [None]:
# Checking if numerical features are correlated with each other

display(dataset.corr())
pd.plotting.scatter_matrix(dataset, figsize=(6, 6))
plt.show()

### Data Preparation for Modelling with Gradient Boosted Tree

XGBoost uses gradient boosted trees which naturally account for non-linear relationships between features and the target variable, as well as accommodating complex interactions between features.

Amazon SageMaker provides an XGBoost container that we can use to train in a managed, distributed setting, and then host as a real-time prediction endpoint.

Sagemaker requires the predictor variable in first row and no header row.
Making these changes in our churn dataset.
But first, let's convert the categorical features to numerical features by one-hot encoding using pandas dummies function.

In [None]:
model_data = pd.get_dummies(dataset)
model_data

In [None]:
model_data.columns

In [None]:
model_data = pd.concat(
    [model_data["Churn_Yes"], model_data.drop(["Churn_Yes", "Churn_No"], axis=1)], axis=1)


Splitting the data into train, validation and test set.

### Splitting the dataset

In [None]:
train_data, validation_data, test_data = np.split(
    model_data.sample(frac=1, random_state=42),
    [int(0.7 * len(model_data)), int(0.9 * len(model_data))],
)

In [None]:
# saving the data to s3
import sagemaker
bucket = "sagemaker-data113"

sess = sagemaker.Session(
    default_bucket = bucket
)

train_data.to_csv("s3://{}/train.csv".format(bucket), header=False, index=False)
validation_data.to_csv("s3://{}/validation.csv".format(bucket), header=False, index=False)
test_data.to_csv("s3://{}/test.csv".format(bucket), header=False, index=False)
