## Predicting Customer Churn


### History
Churn is a measurement of the percentage of accounts that cancel or choose not to renew their subscriptions. A high churn rate can negatively impact Monthly Recurring Revenue (MRR) and can also indicate dissatisfaction with a product or service.

Churn is the measure of how many customers stop using a product. This can be measured based on actual usage or failure to renew (when the product is sold using a subscription model). Often evaluated for a specific period of time, there can be a monthly, quarterly, or annual churn rate.

<h3 style="text-align:left;">How is Churn Calculated? </h3>
<p style = "text-align:left;">
In its most simplistic form, the churn rate is the percentage of total customers that stop using/paying over a period of time. So, if there were 10,000 total customers in March and 1,000 of them stopped being customers, the monthly churn rate would be 10%.
</p>
<img src = "https://www.productplan.com/uploads/Churn-Rate-1024x536.png" style="width=500px;height:300px"/>

### Assignment:

Build a Machine Learning Pipeline, to engineer the features in the data set and predict the next leaver.

In [None]:
import re

# to handle datasets
import pandas as pd
import numpy as np

# for visualization
import matplotlib.pyplot as plt

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import StandardScaler

# to build the models
from sklearn.linear_model import LogisticRegression

# to evaluate the models
from sklearn.metrics import accuracy_score, roc_auc_score

# to persist the model and the scaler
import joblib

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

<br>

<br>

## Prepare the data set

In [None]:
# load the data - it is available open source and online

data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [None]:
# cast target variable to int

data['Churn'] = data['Churn'].map({'Yes': 1, 'No': 0})

In [None]:
# Cast TotalCharges to float

data['TotalCharges'] = pd.to_numeric(data['TotalCharges'],errors = 'coerce')

In [None]:
# drop unnecessary variables

data.drop(labels=['customerID'], axis=1, inplace=True)

In [None]:
# display data
data.head()

<br>

<br>

## Data Exploration

### Find numerical and categorical variables

In [None]:
vars_num = ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']

vars_cat = ['gender', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Contract', 
            'DeviceProtection', 'InternetService', 'MultipleLines', 'OnlineBackup', 'OnlineSecurity', 
            'PaymentMethod', 'StreamingMovies', 'StreamingTV', 'TechSupport']

In [None]:
print('Number of numerical variables: {}'.format(len(vars_num)))
print('Number of categorical variables: {}'.format(len(vars_cat)))

### Find missing values in variables

In [None]:
# first in numerical variables

data[vars_num].isnull().mean()

In [None]:
# now in categorical variables

data[vars_cat].isnull().mean()

### Determine cardinality of categorical variables

In [None]:
data[vars_cat].nunique(dropna=False).sort_values(ascending=True)

### Determine the distribution of numerical variables

In [None]:
data[vars_num].hist(bins=30, figsize=(10,10))
plt.show()

<br>

<br>

## Separate data into train and test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('Churn', axis=1),  # predictors
    data['Churn'],  # target
    test_size=0.2,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

<br>

<br>

## Feature Engineering


### Fill in Missing data in numerical variables:

- Add a binary missing indicator
- Fill NA in original variable with 0

In [None]:
# add missing indicator
X_train['TotalCharges_NA'] = np.where(X_train['TotalCharges'].isnull(), 1, 0)
X_test[['TotalCharges_NA']] = np.where(X_test['TotalCharges'].isnull(), 1, 0)

In [None]:
X_train['TotalCharges'].fillna(0, inplace=True)
X_test['TotalCharges'].fillna(0, inplace=True)

### Perform one hot encoding of categorical variables into k-1 binary variables

- k-1, means that if the variable contains 9 different categories, we create 8 different binary variables
- Remember to drop the original categorical variable (the one with the strings) after the encoding

In [None]:
for var in vars_cat:    
    
    # to create the binary variables, we use get_dummies from pandas    
    
    X_train = pd.concat([X_train, 
                         pd.get_dummies(X_train[var], prefix=var, drop_first=True)], 
                         axis=1)    
    
    X_test = pd.concat([X_test, 
                        pd.get_dummies(X_test[var], prefix=var, drop_first=True)], 
                        axis=1)
    

In [None]:
X_train.drop(labels=vars_cat, axis=1, inplace=True)
X_test.drop(labels=vars_cat, axis=1, inplace=True)

<br>

In [None]:
X_train.shape, X_test.shape

In [None]:
X_train.head()

In [None]:
X_test.head()

<br>

<br>

## Scale the variables

- Use the standard scaler from Scikit-learn

In [None]:
variables = [c  for c in X_train.columns]

In [None]:
# create scaler
scaler = StandardScaler()

#  fit  the scaler to the train set
scaler.fit(X_train[variables]) 

# transform the train and test set
X_train = scaler.transform(X_train[variables])

X_test = scaler.transform(X_test[variables])

<br>

<br>

# How to Treat Imbalanced Datasets

There are many ways of dealing with imbalanced data. We will focus in the following approaches:

1. Oversampling - `SMOTE`
2. Upsampling & Downsampling - `sklearn.utils.resample`

In [None]:
# from sklearn.utils import resample
# from imblearn.over_sampling import SMOTE 

# # Upsample minority class
# X_train_u, y_train_u = resample(X_train[y_train == 1],
#                                 y_train[y_train == 1],
#                                 replace=True,
#                                 n_samples=X_train[y_train == 0].shape[0],
#                                 random_state=1)
# X_train_u = np.concatenate((X_train[y_train == 0], X_train_u))
# y_train_u = np.concatenate((y_train[y_train == 0], y_train_u))



# # Downsample majority class
# X_train_d, y_train_d = resample(X_train[y_train == 0],
#                                 y_train[y_train == 0],
#                                 replace=True,
#                                 n_samples=X_train[y_train == 1].shape[0],
#                                 random_state=1)
# X_train_d = np.concatenate((X_train[y_train == 1], X_train_d))
# y_train_d = np.concatenate((y_train[y_train == 1], y_train_d))



# # Upsample using SMOTE
# sm = SMOTE(random_state=12, sampling_strategy = 1.0)
# X_train_sm, y_train_sm = sm.fit_resample(X_train, y_train)


In [None]:
# print("Downsampled shape:", X_train_d.shape, y_train_d.shape)
# print("Original shape:", X_train.shape, y_train.shape)
# print("Upsampled shape:", X_train_u.shape, y_train_u.shape)
# print ("SMOTE sample shape:", X_train_sm.shape, y_train_sm.shape)

In [None]:
# from sklearn.linear_model import LogisticRegression
# from sklearn.pipeline import make_pipeline
# from sklearn.preprocessing import StandardScaler
# from sklearn.model_selection import cross_val_score

# # Create the Original, Upsampled, and Downsampled training sets
# methods_data = {"Original": (X_train, y_train),
#                 "Upsampled": (X_train_u, y_train_u),
#                 "Downsampled": (X_train_d, y_train_d),
#                 "SMOTE":(X_train_sm, y_train_sm)}

In [None]:
# # Loop through each type of training sets and apply 5-Fold CV using Logistic Regression
# # By default in cross_val_score StratifiedCV is used
# for method in methods_data.keys():
#     lr_results = cross_val_score(LogisticRegression(), 
#                                  methods_data[method][0], 
#                                  methods_data[method][1], 
#                                  cv=5, 
#                                  scoring='f1')
#     print(f"The best F1 Score for {method} data:")
#     print (lr_results.mean())
 

<br>

<br>

## Train the Logistic Regression model

- Set the regularization parameter to 0.0005
- Set the seed to 0

In [None]:
# set up the model
# remember to set the random_state / seed

model = LogisticRegression(C=0.0005, random_state=0)

# train the model
model.fit(X_train, y_train)

<br>

<br>

## Make predictions and evaluate model performance

Determine:
- roc-auc
- accuracy

**Important, remember that to determine the accuracy, you need the outcome 0, 1, referring to survived or not. But to determine the roc-auc you need the probability of survival.**

In [None]:
# make predictions for test set
class_ = model.predict(X_train)
pred = model.predict_proba(X_train)[:,1]

# determine mse and rmse
print('train roc-auc: {}'.format(roc_auc_score(y_train, pred)))
print('train accuracy: {}'.format(accuracy_score(y_train, class_)))
print()

# make predictions for test set
class_ = model.predict(X_test)
pred = model.predict_proba(X_test)[:,1]

# determine mse and rmse
print('test roc-auc: {}'.format(roc_auc_score(y_test, pred)))
print('test accuracy: {}'.format(accuracy_score(y_test, class_)))
print()

```
train roc-auc: 0.8410141283341551
train accuracy: 0.7816826411075612

test roc-auc: 0.8180757423881719
test accuracy: 0.7693399574166075
```

That's it! Well done

**Keep this code safe, as we will use this notebook later on, to build production code, in our next assignement!!**