# Predicting Customer Churn - Overfitting

## Setup

### Coding constants

In [4]:
GRAPH_WIDTH = 10
GRAPH_HEIGHT = 5

### Common imports

In [5]:
# Data processing 
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Data visualization 
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# Data modeling 
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

### Data loading

In [6]:
df = pd.read_csv('data/customer-churn.csv')
df.head()

Unnamed: 0,COLLEGE,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION,REPORTED_USAGE_LEVEL,CONSIDERING_CHANGE_OF_PLAN,LEAVE
0,0,31953,0,6,313378,161,0,4,3,3,1,1
1,1,36147,0,13,800586,244,0,6,3,3,2,1
2,1,27273,230,0,305049,201,16,15,3,4,3,1
3,0,120070,38,33,788235,780,3,2,3,0,2,0
4,1,29215,208,85,224784,241,21,1,4,3,0,1


In [7]:
df.shape

(20000, 12)

In [8]:
df.columns

Index(['COLLEGE', 'INCOME', 'OVERAGE', 'LEFTOVER', 'HOUSE', 'HANDSET_PRICE',
       'OVER_15MINS_CALLS_PER_MONTH', 'AVERAGE_CALL_DURATION',
       'REPORTED_SATISFACTION', 'REPORTED_USAGE_LEVEL',
       'CONSIDERING_CHANGE_OF_PLAN', 'LEAVE'],
      dtype='object')

## Data Modeling

We will keep using our simple measure of performance that computes the ratio between correct and wrong predictions:

Other, more robust ways to measure a classification model's performance involve concepts such as recall and precision.

In [29]:
def evaluate_performance(model, training, labels):
    labels_pred = model.predict(training)
    n_correct = sum(labels_pred == labels)   
    performance = n_correct / len(labels_pred)
  
    return performance

Scikit-Learn provides a few functions to split datasets into multiple subsets in various ways. The simplest function is train_test_split(). This function has a random_state parameter that allows you to set the random generator seed, useful for obtaining the same split every time we call train_test_split with the same random state.

We will split our data into two sets:
- One set called training set that we will use for training several model types
- One set called test set that we wil use for estimating the final generalization performance

In [60]:
training_set, test_set = train_test_split(df, test_size=0.2, random_state=15)

print(training_set.shape)
print(test_set.shape)

(16000, 12)
(4000, 12)


Split the training set into two dataframes:
- One dataframe containing all the attributes
- One dataframe containing all the labels, i.e., the values of the target variable (LEAVE)

In [61]:
training_attr = training_set.loc[:, training_set.columns != 'LEAVE']
training_labels = training_set['LEAVE']

Perform the same task for the test set

In [62]:
test_attr = test_set.loc[:, test_set.columns != 'LEAVE']
test_labels = test_set['LEAVE']

### Measuring models performance using cross-validation  

Cross-validation is a technique that we can use to better estimate the generalization performance by evaluating the model's accuracy on multiple holdout sets. We can use those evaluations to compute statistics on the estimated performance, such as the mean and variance.

Create a function that displays the results of K-fold cross-validation:

In [63]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In this Notebook, we will use cross-validation to estimate and compare the generalization performance of several models. In this way, we can decide which model is our best choice, and then estimate the generalization performance on a final holdout test set. 

> **Note this is a schematic use of cross-validation.** We will use the same cross-validation technique to compare models with different parameters (i.e., trees with different depths) and from other families (e.g., decision trees and linear classifiers). In reality, more sophisticated techniques are available to estimate the optimal parameters of a model.

### Decision tree with depth = 5

Create a decision tree with depth = 5.

In [64]:
tree_depth_5_clf = DecisionTreeClassifier(max_depth = 5)

Scikit-Learn provides a utility for perfomring K-fold cross-validation called cross_val_score.

The following code randomly splits the training set into 10 distinct subsets (folds). Then, it trains and evaluates the Decision Tree model 10 times, picking a different fold for evaluation every time and training on the other 9 folds.

The result is an array containing the 10 evaluation scores:

In [65]:
scores = cross_val_score(tree_depth_5_clf, training_attr, training_labels, scoring='accuracy', cv=10)
display_scores(scores)

Scores: [0.6925   0.676875 0.6975   0.6775   0.69125  0.7025   0.691875 0.701875
 0.684375 0.67625 ]
Mean: 0.68925
Standard deviation: 0.009531198770354134


### Decision tree with depth = 7

In [66]:
tree_depth_7_clf = DecisionTreeClassifier(max_depth = 7)

In [67]:
scores = cross_val_score(tree_depth_7_clf, training_attr, training_labels, scoring='accuracy', cv=10)
display_scores(scores)

Scores: [0.68     0.68125  0.689375 0.668125 0.676875 0.701875 0.684375 0.69375
 0.6775   0.668125]
Mean: 0.682125
Standard deviation: 0.010144271782636747


### Logistic regression

In [68]:
logistic_regr_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("logistic_regression", LogisticRegression()),
])

In [69]:
scores = cross_val_score(logistic_regr_clf, training_attr, training_labels, scoring='accuracy', cv=10)
display_scores(scores)

Scores: [0.6025   0.61625  0.640625 0.6125   0.638125 0.63875  0.615625 0.635
 0.61     0.5925  ]
Mean: 0.6201875
Standard deviation: 0.01606882171318109


### Linear support vector machine

In [70]:
svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("linear_svc", LinearSVC()), # SVC stands for support vector classifier
])

In [71]:
scores = cross_val_score(svm_clf, training_attr, training_labels, scoring='accuracy', cv=10)
display_scores(scores)

Scores: [0.6      0.615625 0.638125 0.618125 0.63875  0.6375   0.61375  0.63375
 0.61     0.595625]
Mean: 0.620125
Standard deviation: 0.015267919799370202


## Evaluation

By comparing the generalization performance of each model, we can select the model with the best average accuracy.

In this case, the best model is a decision tree with depth = 5. 

We need to estimate the model's generalization accuracy using a test set strictly independent of the data we used when selecting the best model to get an independent estimate of model accuracy.

> The test set should separate from the sets used in cross-validation, so the test set is never going to be used to make any modeling decisions.

Once we’ve chosen the best model, why not learn a new tree with depth=5 nodes from the whole, original training set? 

Then we might get the best of both worlds: using the subtraining/validation split to pick the best model without tainting the test set, and building a model of this best complexity on the entire training set (subtraining plus validation):

In [72]:
tree_depth_5_clf.fit(training_attr, training_labels)

DecisionTreeClassifier(max_depth=5)

In [74]:
performance = evaluate_performance(tree_depth_5_clf, test_attr, test_labels)

"Generalization performance of our final model— " +\
    f"decision tree classifier depth = {tree_depth_5_clf.get_depth()}: {performance * 100:.2f}%"

'Generalization performance of our final model— decision tree classifier depth = 5: 69.10%'