<img src="http://mashey.io/wp-content/uploads/2016/02/Mashey-Logo.jpg">

## Profit Curves Applied to Churn Prediction
### Samuel Sherman

Now that we have explored churn prediction and how it is of value to a business, I thought it might be important to introduce the idea of profit curves. As before, we were able to examine confusion matrices, which display how a model is performing amongst the different classes and the values that were mislabeled. You can imagine that no model will be perfect and that misclassification is a necessary evil. However, we can use this to our advantage. 

If we were to assign a dollar value that represents either a cost or beneifit associated with each type of classification, then we can iterate over different thresholds to determine what threshold and model are most valuable towards a company. As applied to churn, it would not make logical sense to invest time into customer retention just because a customer has a churn confidence of 51%. Therefore, having a higher threshold to classify churn is of greater value. With profit curves we will apply an algorithm to pinpoint the exact location of the threshold and model that produces the maximum benefit/profit.  

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score, precision_score, average_precision_score, roc_curve, roc_auc_score
from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score, StratifiedShuffleSplit, StratifiedKFold, train_test_split
from sklearn.preprocessing import StandardScaler
from unbalanced_dataset.over_sampling import SMOTE
from unbalanced_dataset.under_sampling import UnderSampler
from scipy import interp
from scipy.io.arff import loadarff
import xgboost as xgb
%matplotlib inline


def prepare_data(filename):
    churn = loadarff(filename)
    churn_df = pd.DataFrame(churn[0])
    
    # Clean up categorical columns
    churn_df['LEAVE'] = (churn_df['LEAVE'] == 'LEAVE').astype(int)
    churn_df['COLLEGE'] = (churn_df['COLLEGE'] == "one").astype(int)
    churn_df = pd.concat([churn_df,pd.get_dummies(churn_df.REPORTED_SATISFACTION)], axis = 1)
    churn_df.drop('avg', axis = 1, inplace = True)
    churn_df = pd.concat([churn_df,pd.get_dummies(churn_df.REPORTED_USAGE_LEVEL)], axis = 1)
    churn_df.drop('avg', axis = 1, inplace = True)
    churn_df = pd.concat([churn_df,pd.get_dummies(churn_df.CONSIDERING_CHANGE_OF_PLAN)], axis = 1)
    churn_df.drop('never_thought', axis = 1, inplace = True)
    churn_df.drop('REPORTED_SATISFACTION', axis = 1, inplace = True)
    churn_df.drop('REPORTED_USAGE_LEVEL', axis = 1, inplace = True)
    churn_df.drop('CONSIDERING_CHANGE_OF_PLAN', axis = 1, inplace = True)
    
    # set label array
    y = churn_df.pop('LEAVE').values
    
    # set feature matrix
    X = churn_df.values
    
    # Train test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)
    
    # Scale data
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    return X_train, X_test, y_train, y_test

I first define the same function that I defined before. It will read the data, extract the appropriate columns, dummify the categorical columns, normalize the data, and return the dependent and independent variables as separate training and testing sets.

In [2]:
def standard_confusion_matrix(y_true, y_predict):
    [[tn, fp], [fn, tp]] = confusion_matrix(y_true, y_predict)
    return np.array([[tp, fp], [fn, tn]])


def profit_curve(cost_benefit_matrix, probabilities, y_true):
    thresholds = sorted(probabilities)
    thresholds.append(1.0)
    profits = []
    for threshold in thresholds:
        y_predict = probabilities >= threshold
        confusion_mat = standard_confusion_matrix(y_true, y_predict)
        profit = np.sum(confusion_mat * cost_benefit_matrix) / float(len(y_true))
        profits.append(profit)
    return thresholds, profits

The next two functions will build the confusion matrix and the profit curve. The confusion matrix will take in the real and predicted values and determine how many values are assoicated with each of the four confusion matrix components (true positive, true negative, false positive, false negative). The profit curve function will take the cost benefit matrix, a set of probabilities, and their true labels. It will iterate through the different probabilities for classifying what will be a one or zero and calculate a total profit for each iteration.  

In [3]:
def run_profit_curve(model, costbenefit, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    probabilities = model.predict_proba(X_test)[:, 1]
    thresholds, profits = profit_curve(costbenefit, probabilities, y_test)
    return thresholds, profits


def plot_profit_models(models, costbenefit, X_train, X_test, y_train, y_test):
    percentages = np.linspace(0, 100, len(y_test) + 1)
    profit_dict = {}
    for model in models:
        thresholds, profits = run_profit_curve(model,
                                               costbenefit,
                                               X_train, X_test,
                                               y_train, y_test)
        profit_dict['thresholds'] = thresholds
        profit_dict[str(model)[:12]] = profits
    return profit_dict

run_profit_curve will take a given model, train the model, determine the predicted probabilities, and then use these values to build the profit curve from the previous function. The next function will take a set of different models, the cost benefit matrix and the training and testing data and build the profit curves for each model.

In [4]:
def find_best_threshold(models, costbenefit, X_train, X_test, y_train, y_test):
    max_model = None
    max_threshold = None
    max_profit = None
    for model in models:
        thresholds, profits = run_profit_curve(model, costbenefit,
                                               X_train, X_test,
                                               y_train, y_test)
        max_index = np.argmax(profits)
        if not max_model or profits[max_index] > max_profit:
            max_model = model.__class__.__name__
            max_threshold = thresholds[max_index]
            max_profit = profits[max_index]
    return max_model, max_threshold, max_profit

Finally, the find_best_threshold function will take the same data from the previous function and find the threshold and model associated with the highest possible profit.

In [12]:
def main():
    X_train, X_test, y_train, y_test = prepare_data('data/churn.arff')
    costbenefit = np.array([[80, -70], [-10, 0]])
    models = [RF(random_state = 2, n_estimators = 100, n_jobs = -1), 
              LR(random_state = 2, n_jobs = -1), 
              xgb.XGBClassifier(learning_rate = 0.01, max_depth = 5, n_estimators = 500), 
              SVC(probability=True)]
    profit_dict = plot_profit_models(models, costbenefit,
                       X_train, X_test, y_train, y_test)
    max_model, max_threshold, max_profit = find_best_threshold(models, costbenefit,
                              X_train, X_test, y_train, y_test)
    return max_model, max_threshold, max_profit, profit_dict

I use this function to pull together everything that I defined before. In this case I am using a random forest model, a logistic regresssion model, a gradient boosted model from xgboost, and support vector machine. It will return the the model and threshold associated with the max profit, the max profit, and dictionary of profits for all thresholds in each model.

In [13]:
max_model, max_threshold, max_profit, profit_dict = main()
profit_df = pd.DataFrame.from_dict(profit_dict)
thresholds = profit_df.pop('thresholds')

In [14]:
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.tools as tls
py.sign_in('scsherm', 'ml0wer7f1s')

data = []
for model in profit_df.columns.values:
    profit_vals = go.Scatter(x = thresholds, y = profit_df[model], mode = 'lines', name = model)
    data.append(profit_vals)
    
layout = go.Layout(
    title='Profit Curve per Model',
    xaxis=dict(
        title='Threshold',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    yaxis=dict(
        title='Profit',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    )
)
Profit_Curve = go.Figure(data=data, layout=layout)    
py.iplot(Profit_Curve, filename='Profit_Curve')

In [15]:
max_model, max_threshold, max_profit

('XGBClassifier', 0.34050757, 17.737500000000001)

Observing the graph above, you can see that the best model was the gradient boosted at a threshold of 34%, with an associated profit of 17.7. A do nothing approach would be similar to classifying everything as a zero or having a threshold of 100%. Well we can see that the do nothing approach, with the threshold 1 on the graph, has a profit of -5. Therefore, we can conclude that, on the other end of the spectrum, classifying everything as one (threshold 0), already provides a substantial improvement from doing nothing. I would claim that this is justification in itself for using this type of algorithm to give your company a competitive edge.  

In [1]:
from IPython.core.display import HTML
import urllib2
HTML(urllib2.urlopen('http://bit.ly/1Bf5Hft').read())