# Decision Trees

## Introduction

*  In this practical, we will learn to train decision tree classifiers and regressors and evaluate their performance <br>
*  We will first build the trees without constraints and use their performance as the baseline for comparison <br>
*  We will then set regularization hyperparameters and use grid search with cross validation to find the best-performing model <br>
*  We will use the telecom customer churn dataset for these exercises

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## 1. Decision tree classifier with customer churn dataset
*  We will train a decision tree classifier on the telecom customer churn dataset <br>
*  The objective is to predict customer churn (true or false) based on the customers' service usage attributes in the dataset <br>
(Some parts of this section are adopted from Reference [1])

### 1.1 Data exploration and preparation

As the first step, let's explore the dataset and perform necessary preparation for the machine learning task at hand

* Import the customer churn data
* Remove any columns that are unlikely to be useful for prediction
* Encode data types as appropriate
* Split the dataset into the training and test sets

Load the data

In [152]:
data = pd.read_csv('Orange_Telecom_Churn_Data.csv')

What is the size of the dataset?

In [None]:
data.shape

What are the attributes (features) and their data types? Any missing data?

In [None]:
data.info()

Look at the first 20 data insances and take note of the four non-numeric features

In [None]:
data.head(20).T

Examine the data range of numeric features <br>
As decision trees can handle data of any scale, we seldom need to perform data scaling for training decision tree models

In [None]:
data.describe()

Among the features, `phone_number` is unique to each customer and should not be used as a predictor <br>
Also, it is unlikely that `area_code` or `state` would be desired, unless there is some reason to assume that the model has a very specific geographic factor <br>
We can drop these features

In [157]:
data.drop(['state', 'phone_number', 'area_code'], axis=1, inplace=True)

In [None]:
data.shape

All features are numeric except `intl_plan` and `voice_mail_plan` that need to be boolean encoded (Why?)

In [None]:
data.dtypes

In [None]:
data.intl_plan.value_counts()

In [None]:
data.voice_mail_plan.value_counts()

In [None]:
for col in ['intl_plan', 'voice_mail_plan']:
    data[col] = data[col].replace('yes', True).replace('no', False).astype(np.bool)
      
data[['intl_plan', 'voice_mail_plan']].dtypes

In [None]:
data.head(20).T

Examine the distribution of the target variable `churned` <br>
Split the dataset into the training and test sets (decide if a stratified split should be used or not, based on the distribution) <br>
After the split, examine the distribution of the target variable in the training and test sets to be assured that the two classes are proportionally represented in both datasets

In [None]:
data.churned.value_counts()

In [None]:
data.churned.value_counts(normalize=True)

Data is skewed at ~85% towards non-churned customers <br>
This will be an important factor to consider when evaluating the trained model <br>
Given the skew in the target variable, let's split the data into the training and test sets which are stratified by the `churned` values (target labels) <br>
We will use `StratefiedShuffleSplit()`, instead of `train_test_split()` which does the split randomly

In [166]:
from sklearn.model_selection import StratifiedShuffleSplit

# Separate feature columns and the target column
feature_cols = [x for x in data.columns if x != 'churned']

# Split the data into two parts with 30% samples in the test set

# This creates a generator
strat_shuff_split = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)

# Get the index values from the generator
train_idx, test_idx = next(strat_shuff_split.split(data[feature_cols], data['churned']))

# Create the data sets
X_train = data.loc[train_idx, feature_cols]
y_train = data.loc[train_idx, 'churned']

X_test = data.loc[test_idx, feature_cols]
y_test = data.loc[test_idx, 'churned']

In [None]:
X_train.head(20).T

In [None]:
y_train.head(20)

Now check the number of each target label in the training and test sets

In [None]:
y_train.value_counts()

In [None]:
y_train.value_counts(normalize=True)

Take note of the churned and non-churned numbers in the test set (we need to take these into consideration when evaluating the performance metrics of the trained classifier)

In [None]:
y_test.value_counts()

In [None]:
y_test.value_counts(normalize=True)

Data preparation is done <br>
Now, build the decision tree model

### 1.2 Decision tree without constraints
* Fit a decision tree classifier without set limits to constrain its growth (i.e. no regularization) <br>
* This is almost certain to lead to overfitting <br>
* Determine how many nodes there are and what the depth of this (very large) tree is <br>
* Using this tree, measure the prediction error in the training and test sets <br>

In [173]:
from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(random_state=42)
dt_clf = dt_clf.fit(X_train, y_train)

Get the number of nodes and the depth of the tree

In [None]:
dt_clf.tree_.node_count, dt_clf.tree_.max_depth

The decision tree model also tells us the relative importances of the features through the `feature_importances_` variable<br>
Which attributes are the biggest predictors of customer churn?

In [None]:
dt_clf.feature_importances_

In [None]:
feature_cols

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

feature_imp = pd.Series(dt_clf.feature_importances_, index=feature_cols).sort_values(ascending=False)
fig = plt.figure(figsize=(12,5))
ax = feature_imp.plot(kind='bar')
ax.set(ylabel='Relative Importance');

We define a function to return various error metrics

In [178]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def measure_error(y_true, y_pred, label):
    return pd.Series({'accuracy':accuracy_score(y_true, y_pred),
                      'precision': precision_score(y_true, y_pred),
                      'recall': recall_score(y_true, y_pred),
                      'f1': f1_score(y_true, y_pred)},
                      name=label)

Get the performance of the trained model on the training and test sets

In [None]:
# The error on the training and test data sets
y_train_pred = dt_clf.predict(X_train)
y_test_pred = dt_clf.predict(X_test)

train_test_full_error = pd.concat([measure_error(y_train, y_train_pred, 'train'),
                                   measure_error(y_test, y_test_pred, 'test')],
                                   axis=1)

train_test_full_error

*  What do you think is going on here based on the difference in training and test accuracies? <br>
*  The decision tree predicts far better on the training data than the test data, which is a sign of overfitting <br>
*  Also notice the perfect scores on the training data

We should also take a look at the confusion matrix on the test set prediction results

In [180]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
cm = confusion_matrix(y_test, y_test_pred)
print(cm)

sns.set_context('talk')
ax = sns.heatmap(cm, annot=True, fmt='d')
labels = ['False', 'True']
ax.set_xticklabels(labels);
ax.set_yticklabels(labels);
ax.set_ylabel('Actual');
ax.set_xlabel('Predicted');

In [None]:
print(classification_report(y_test, y_test_pred))

How would we interpret this confusion matrix? <br>
Given the skewed distribution of the target classes, which metric do you think is the most important? <br>
What could be a better result given the objective of this machine learning task (i.e. to predict customer churn)

Although recall is not high for the customers who churned (positive class) since their number is quite small, we are still doing better than random guessing (classification accuracy of 91% vs 86% for random guessing)

### 1.3. Regularization using grid search with cross validation

* Using grid search with cross validation, we try to find a combination of hyperparameters that will produce a decision tree with optimized performance <br>
* In this experiment, we will try to find an optimal combination of `max_depth` and `max_features`, both of which are regularization hyperparameters <br>
* Find out the number of nodes and the depth of this tree <br>
* Measure the errors on the training and test sets and compare them to those obtained from the tree in Section 1.2 <br>

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth': range(1, dt_clf.tree_.max_depth+1, 2),
              'max_features': range(1, len(X_train.columns)+1)}

gr_dt_clf = GridSearchCV(DecisionTreeClassifier(random_state=42),
                         param_grid=param_grid,
                         scoring='accuracy',
                         n_jobs=-1)

gr_dt_clf = gr_dt_clf.fit(X_train, y_train)

Get the number of nodes and the depth of the best-performing tree based on the combinations of `max_depth` and `max_features` specified by us <br>
As we will see, this is a much smaller tree than the unrestricted one obtained in Section 1.2

In [None]:
gr_dt_clf.best_estimator_.tree_.node_count, gr_dt_clf.best_estimator_.tree_.max_depth

We can obtain the best combination of the hyperparameters from the `best_params_` variable

In [None]:
gr_dt_clf.best_params_

Get the performance metrics of the best-performing tree

In [None]:
y_train_pred = gr_dt_clf.predict(X_train)
y_test_pred = gr_dt_clf.predict(X_test)

train_test_error = pd.concat([measure_error(y_train, y_train_pred, 'train'),
                              measure_error(y_test, y_test_pred, 'test')],
                              axis=1)

train_test_error

The test accuracy is much improved from the previous one <br> 

Take a look at the confusion matrix on the test set prediction results

In [None]:
cm = confusion_matrix(y_test, y_test_pred)

sns.set_context('talk')
ax = sns.heatmap(cm, annot=True, fmt='d')
labels = ['False', 'True']
ax.set_xticklabels(labels);
ax.set_yticklabels(labels);
ax.set_ylabel('Actual');
ax.set_xlabel('Predicted');

In [None]:
print(classification_report(y_test, y_test_pred))

We did not get a better recall score for the churned customers as we hoped for, although the precision is much better 

As an attempt to address the issue of imbalanced dataset, one technique that we can try is to apply `class_weight` when training the decision tree model

For training of decision tree classifiers on imbalanced datasets, interested learners may want to refer to this article: https://machinelearningmastery.com/cost-sensitive-decision-trees-for-imbalanced-classification/

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth': range(1, dt_clf.tree_.max_depth+1, 2),
              'max_features': range(1, len(X_train.columns)+1)}

gr_dt_clf = GridSearchCV(DecisionTreeClassifier(class_weight='balanced', random_state=42),
                         param_grid=param_grid,
                         scoring='accuracy',
                         n_jobs=-1)

gr_dt_clf = gr_dt_clf.fit(X_train, y_train)

In [None]:
gr_dt_clf.best_estimator_.tree_.node_count, gr_dt_clf.best_estimator_.tree_.max_depth

In [None]:
gr_dt_clf.best_params_

Get the performance metrics of the best-performing tree

In [None]:
y_train_pred = gr_dt_clf.predict(X_train)
y_train_pred_prob = gr_dt_clf.predict_proba(X_train)
y_test_pred = gr_dt_clf.predict(X_test)
y_test_pred_prob = gr_dt_clf.predict_proba(X_test)

train_test_error = pd.concat([measure_error(y_train, y_train_pred, 'train'),
                              measure_error(y_test, y_test_pred, 'test')],
                              axis=1)

train_test_error

The recall is much improved, but at the expense of the precision and the overall accuracy <br> 

Evidently, this is the force of precision/recall tradeoff at work<br>
Where we should set the threshold depends on the problem at hand and the objective we aim to achieve with the machine learning model that we are building

Take a look at the confusion matrix for the prediction results on the training set

In [None]:
cm = confusion_matrix(y_train, y_train_pred)

sns.set_context('talk')
ax = sns.heatmap(cm, annot=True, fmt='d')
labels = ['False', 'True']
ax.set_xticklabels(labels);
ax.set_yticklabels(labels);
ax.set_ylabel('Actual');
ax.set_xlabel('Predicted');

In [None]:
print(classification_report(y_train, y_train_pred))

For decision tree classifiers, as we have learned, the class label assigned to a leaf node (hence the predicted class for all data instances associated with the leaf node) is the class that has the higher probability (i.e. >= 50%) in the leaf node

However, we can change the default threshold (50%) to shift the decision point for classification one way or another, in favor of precision or recall

We want to look at the predicted probability of each training instance being a churn (decision tree's `predict_proba()` method can tell us this), and examine the distribution of churned probabilities versus the true labels

In [None]:
y_train_pred

In [None]:
y_train_pred_prob

In [197]:
y_train_scores = y_train_pred_prob[:, 1]  # we use the positive class (churned) probability as the decision score

In [None]:
y_train = y_train.replace(True, 1).replace(False, 0).astype(np.int64)

train_predict = pd.DataFrame({'actual': y_train.values,
                             'predict': y_train_scores})

train_predict.plot.scatter(x='predict', y='actual', s=list(range(2,500)), alpha=0.05, figsize=(10,6));

Now, let's take a look at the precision/recall curve

In [199]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train, y_train_scores)

In [None]:
thresholds

In [None]:
def plot_precision_recall_curve(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.xlabel("Threshold", fontsize=16)
    plt.legend(loc="upper left", fontsize=16)
    plt.xlim([0, 1])
    plt.ylim([0, 1])

plt.figure(figsize=(8, 4))
plot_precision_recall_curve(precisions, recalls, thresholds)
plt.show()

In order to improve recall, try to set the decision threshold at 0.2

In [205]:
y_train_pred = (y_train_scores > 0.2)

In [None]:
train_error = pd.concat([measure_error(y_train, y_train_pred, 'train')],
                         axis=1)

train_error

In [None]:
print(confusion_matrix(y_train, y_train_pred))

As expected, there is a good improvement in recall, but at the great expense of the precision and overall accuracy (recall for non-churned customers is badly affected)

How does the shift of decision threshold improve the test results?

In [208]:
y_test_scores = y_test_pred_prob[:, 1] 

In [209]:
y_test_pred = (y_test_scores > 0.2)

In [None]:
test_error = pd.concat([measure_error(y_test, y_test_pred, 'test')],
                        axis=1)

test_error

In [None]:
print(confusion_matrix(y_test, y_test_pred))

Conclusion? Setting of the decision threshold is all about balancing the various performance metrics in line with the objectives of the use case that the machine learning model aims to solve

## 2. Decision tree regressor

* Using the same customer churn dataset, we will first build a decision tree regressor without regularization
* We will then use grid search with cross validation to build an optimized decision tree regressor
* We will measure the errors on the training and test sets using mean squared error <br>
* We will make a plot of actual versus predicted customer churn <br>

Build a decision tree regressor without regularization

In [59]:
from sklearn.tree import DecisionTreeRegressor

dt_reg = DecisionTreeRegressor(random_state=42)
dt_reg = dt_reg.fit(X_train, y_train)

In [None]:
dt_reg.tree_.node_count, dt_reg.tree_.max_depth

Seems to be the same tree as our first decision tree classifier without constraints

This is an overfitted regression tree

The error on training and test datasets <br>
Since this is continuous, we will use mean squared error

In [None]:
from sklearn.metrics import mean_squared_error

y_train_pred_reg = dt_reg.predict(X_train)
y_test_pred_reg  = dt_reg.predict(X_test)

train_test_error = pd.Series({'train': mean_squared_error(y_train, y_train_pred_reg),
                              'test':  mean_squared_error(y_test, y_test_pred_reg)},
                              name='MSE').to_frame().T

train_test_error

Like the decision tree classifier, the regressor model can tell us which features are more important

In [None]:
dt_reg.feature_importances_

Now, use grid search to find an optimized regressor model

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth': range(1, dt_reg.tree_.max_depth+1, 2),
              'max_features': range(1, len(X_train.columns)+1)}

gr_dt_reg = GridSearchCV(DecisionTreeRegressor(random_state=42),
                         param_grid=param_grid,
                         scoring='neg_mean_squared_error',
                         n_jobs=-1)

gr_dt_reg = gr_dt_reg.fit(X_train, y_train)

The number of nodes and the maximum depth of the best-performing tree

In [None]:
gr_dt_reg.best_estimator_.tree_.node_count, gr_dt_reg.best_estimator_.tree_.max_depth

In [None]:
gr_dt_reg.best_params_

The error on training and test datasets by the best model <br>

In [None]:
from sklearn.metrics import mean_squared_error

y_train_pred_reg = gr_dt_reg.predict(X_train)
y_test_pred_reg  = gr_dt_reg.predict(X_test)

train_test_error = pd.Series({'train': mean_squared_error(y_train, y_train_pred_reg),
                              'test':  mean_squared_error(y_test, y_test_pred_reg)},
                              name='MSE').to_frame().T

train_test_error

We get much improved results

A plot of actual vs predicted target values

In [None]:
y_test.values

In [None]:
y_test_pred_reg

In [None]:
y_test = y_test.replace(True, 1).replace(False, 0).astype(np.int64)

test_predict = pd.DataFrame({'actual': y_test.values,
                             'predict': y_test_pred_reg})

test_predict.plot.scatter(x='predict', y='actual', s=list(range(2,500)), alpha=0.05, figsize=(10,6));

In [None]:
test_predict

#### References
[1] Intel AI Academy, Machine Learning 501. <br>
[2] A. Geron (2017), Hands-on machine learning with Scikit-Learn and TensorFlow (O’Reilly).