# Predicting customer churn

Churn prediction--predicting whether a customer will stay or leave a company--is one of the more popular applications of machine learning for business, especially among consulting companies trying to sell their services.

Typically the performance of a churn classifier (0 for customer stays, 1 for customer leaves, i.e. churns) is evaluated by a standard metric such as accuracy, precision, recall or ROC-AUC. In real-life, these metrics can be misleading, as they do not reflect the costs and benefits of the different outcomes being summarized by a given metric.

In the case of churn, these costs and benefits can be made very explicit in terms of the classifier's confusion matrix, https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html.

\begin{equation*}
C =  
\begin{pmatrix}
\mathrm{true \, positives} &  \mathrm{false \, positives}  \\
\mathrm{false \, negatives} &  \mathrm{true \, negatives}
\end{pmatrix},
\end{equation*}

or, more generally, for a classifer with $n$ outcomes, the entries of the confusion matrix $C = (C_{ij})$ are the counts of observations known to be in class $i$ and predicted to be in class $j$.

To calculate a business-relevant metric, we need to know the cost for trying to retain a customer and the benefit of retaining a customer.

## Standard metrics

* Accuracy
* Sensitivity (aka ``recall``)

## Churn reward: simplest case

The first case we consider is for a single action of sending customers an email. The reward is the revenue from the customer over the next year minux expenses per customer. Let's make the assumptions more explicit, and flag the ones that are reasonable or not as an approximation of reality.

* action $a$ is defined by $a \in (0,1) \leftrightarrow (\mathrm{no\,email\,sent}, \mathrm{email\,sent})$ has a fixed cost for all customers (reasonable),
* there are no costs except the marketing action above (unreasonable)
* revenue $\mathrm{rev}$ is the same for all customers (unreasonable):

\begin{equation*}
\mathrm{rev} = \begin{cases}
0,  & \text{if customer churns} \\
\mathrm{rev}_1, & \text{if customer stays}
\end{cases}
\end{equation*}

The first task is to cast this setup as a *Markov Decision Process*

In [None]:
import os
from pathlib import Path

import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelBinarizer, StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn import svm
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, log_loss

from sklearn_pandas import DataFrameMapper

from risk_learning.config import filenames
from risk_learning.risk_learning import get_classifier_family_name

%matplotlib inline

In [None]:
df = pd.read_csv(filenames.fake_churn_simple)
print(df.info())

In [None]:
df.head()

## Split off test set

In [None]:
# Look at records per year for time split
df.groupby('year').size()

In [None]:
def split_churn_data_target(df):
    split_year = 2015
    test = df.loc[df['year']>=split_year, :]
    train_validate = df.loc[df['year'] < split_year]
    
    data_train_validate = train_validate[[c for c in df.columns if c != 'churn']]
    lb = LabelBinarizer()
    target_train_validate = lb.fit_transform(train_validate['churn']).ravel()

    data_test = test[[c for c in df.columns if c != 'churn']]
    target_test = lb.transform(test['churn']).ravel()
    
    return data_train_validate, target_train_validate, data_test, target_test


data_train_validate, target_train_validate, data_test, target_test = split_churn_data_target(df)

## Put preprocessing and model selection in a pipeline

In [None]:
class ChurnPipeline:
    def __init__(self, data, target, mapper, test_size=0.25):
        self._set_train_validate(data, target, test_size)
        self.mapper = mapper
        
    def _set_train_validate(self, data, target, test_size):
        X_train, X_validate, y_train, y_validate = train_test_split(
            data, target, test_size=0.25, random_state=42, stratify=target
        )
        self.X_train = X_train
        self.y_train = y_train
        self.X_validate = X_validate
        self.y_validate = y_validate
        
    def hyperparameter_grid_select(self, clf_family_dict, param_grid):
        family_name = clf_family_dict.get('name')
        print('Hyperparameter fitting for {}'.format(family_name))
        clf_family = clf_family_dict.get('clf')
    
        pipe = Pipeline([
            ('featurize', self.mapper),
            (family_name, clf_family)
            ])

        # Hyperparameter search
        clf_select = GridSearchCV(pipe, param_grid, iid=False, cv=5, refit=True)
        # Fit for cross validation folds across hyperparameter values
        clf_select.fit(self.X_train, self.y_train)
        print("Best parameter (CV score=%0.3f): {}".format(clf_select.best_score_))
        print(clf_select.best_params_)

        return clf_select
    
    def clf_validation_score(self, clf):
        print('\nEvaluate score on validation set')
        res = clf.score(self.X_validate, self.y_validate)
        return res
    
    def clf_log_loss(self, clf):
        print('\nEvaluate log-loss on validation set')
        res = log_loss(clf.predict(self.X_validate), self.y_validate)
        return res
    
    def clf_confusion_matrix(self, clf):
        print('\nEvaluate confusion matrix on validation set')
        res = confusion_matrix(self.y_validate, clf.predict(self.X_validate))
        return res
        

In [None]:
# Preprocessing
mapper = DataFrameMapper([
    ('gender', LabelBinarizer()),
    (['profession'], OneHotEncoder()), 
])

print("Transformed features:")
pd.DataFrame(
    mapper.fit_transform(data_train_validate),
    columns=mapper.transformed_names_
).head()

### Create churn pipeline

In [None]:
churn = ChurnPipeline(data_train_validate, target_train_validate, mapper)

In [None]:
# Logistic regression
clf_family_dict = {
    'name': 'lr',
    'clf': LogisticRegression(solver='lbfgs', fit_intercept=True)
}
param_grid = {clf_family_dict.get('name') + '__C': np.logspace(1, 3, 20)}

lr_clf = churn.hyperparameter_grid_select(clf_family_dict, param_grid)
print(churn.clf_validation_score(lr_clf))
print(churn.clf_log_loss(lr_clf))
print(churn.clf_confusion_matrix(lr_clf))

In [None]:
# Decision tree
clf_family_dict = {
    'name': 'dt',
    'clf': tree.DecisionTreeClassifier()
}
param_grid = {
    clf_family_dict.get('name') + '__max_depth': range(1, 10, 1)
}
dt_clf = churn.hyperparameter_grid_select(clf_family_dict, param_grid)
print(churn.clf_validation_score(dt_clf))
print(churn.clf_log_loss(dt_clf))
print(churn.clf_confusion_matrix(dt_clf))

In [None]:
# Gradient Boosted Trees
clf_family_dict = {
    'name': 'gbc',
    'clf': GradientBoostingClassifier()
}
param_grid = {
    clf_family_dict.get('name') + '__n_estimators': range(5, 10, 1)
}

gbc_clf = churn.hyperparameter_grid_select(clf_family_dict, param_grid)
print(churn.clf_validation_score(gbc_clf))
print(churn.clf_log_loss(gbc_clf))
print(churn.clf_confusion_matrix(gbc_clf))


## More complicated churn data

In [None]:
df = pd.read_csv(filenames.fake_churn)
data_train_validate, target_train_validate, data_test, target_test = split_churn_data_target(df)

In [None]:
# Preprocessing
mapper = DataFrameMapper([
    ('gender', LabelBinarizer()),
    (['age'], StandardScaler()),
    (['profession'], OneHotEncoder()), 
])

churn = ChurnPipeline(data_train_validate, target_train_validate, mapper)

In [None]:
# Logistic regression
clf_family_dict = {
    'name': 'lr',
    'clf': LogisticRegression(solver='lbfgs', fit_intercept=False)
}
param_grid = {clf_family_dict.get('name') + '__C': np.logspace(-4, 2, 20)}

lr_clf = churn.hyperparameter_grid_select(clf_family_dict, param_grid)
print(churn.clf_validation_score(lr_clf))
print(churn.clf_log_loss(lr_clf))
print(churn.clf_confusion_matrix(lr_clf))

In [None]:
# Decision tree
clf_family_dict = {
    'name': 'dt',
    'clf': tree.DecisionTreeClassifier()
}
param_grid = {
    clf_family_dict.get('name') + '__max_depth': range(1, 10, 1)
}
dt_clf = churn.hyperparameter_grid_select(clf_family_dict, param_grid)
print(churn.clf_validation_score(dt_clf))
print(churn.clf_log_loss(dt_clf))
print(churn.clf_confusion_matrix(dt_clf))

In [None]:
# Gradient Boosted Trees
clf_family_dict = {
    'name': 'gbc',
    'clf': GradientBoostingClassifier()
}
param_grid = {
  #  clf_family_dict.get('name') + '__min_samples_leaf': range(3,10),
    clf_family_dict.get('name') + '__n_estimators': range(5, 10, 1)
}

gbc_clf = churn.hyperparameter_grid_select(clf_family_dict, param_grid)
print(churn.clf_validation_score(gbc_clf))
print(churn.clf_log_loss(gbc_clf))
print(churn.clf_confusion_matrix(gbc_clf))