# **Lab 5A: Classification Trees**

**WHAT** This nonmandatory lab consists of several programming and insight exercises/questions.

**WHY** The exercises are meant to get some experience fitting Classification Trees.

**HOW** Follow the exercises in this notebook either on your own or with a fellow student. Work your way through these exercises at your own pace and be sure to ask questions to the TA's when you don't understand something.

$\newcommand{\q}[1]{\rightarrow \textbf{Question #1}.}$
$\newcommand{\ex}[1]{\rightarrow \textbf{Exercise #1}.}$

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_graphviz, export_text
from IPython.display import Image
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
from sklearn.metrics import roc_curve, auc, average_precision_score

## 1. Loading the Data

In [None]:
train = pd.read_excel('lendingclub_traindata.xlsx')
validation = pd.read_excel('lendingclub_valdata.xlsx')
test = pd.read_excel('lendingclub_testdata.xlsx')
# 1 = good, 0 = default
print(train.head())
print("----------------------")
print(validation.head())
print("----------------------")
print(test.head())

In [None]:
# remove target column to create feature only dataset
X_train = train.drop('loan_status',axis=1)
X_val=validation.drop('loan_status',axis=1)
X_test=test.drop('loan_status',axis=1)

# store target column
y_train = train['loan_status']
y_val=validation['loan_status']
y_test=test['loan_status']

colnames = list(X_train.columns)
print("Features:", *colnames, sep="  ")
print("Matrix dimensions: ", X_train.shape, y_train.shape, X_val.shape,y_val.shape,X_test.shape,y_test.shape)

## 2. Fitting Decision Trees, finding a good one

In [None]:
clf = DecisionTreeClassifier(criterion='entropy',max_depth=4,min_samples_split=1000,min_samples_leaf=200,random_state=0)
clf = clf.fit(X_train,y_train)
fig, ax = plt.subplots(figsize=(40, 30))
plot_tree(clf, filled=True, feature_names=colnames, proportion=True)
plt.show()

In [None]:
def neg_avg_loglik(y, probs):
    # Preconditions:
    # y: ndarray of 0/1
    # probs: two-column ndarray, rows contain [P(Y=0),P(Y=1)]
    return np.average(-np.log(y * probs[:,1] + (1-y) * probs[:,0]))
    

# prob_train, prob_val, and prob_test are the predicted probabilities for the training,
# validation, and test set, using the fitted logistic regression model

prob_train = clf.predict_proba(X_train)
prob_val = clf.predict_proba(X_val)
prob_test = clf.predict_proba(X_test)

print(prob_train)

# Calculate the negative of the average loglikelihood for training, validation, and test set

cost_func_train_minimum = neg_avg_loglik(y_train, prob_train)
cost_func_val = neg_avg_loglik(y_val, prob_val)
cost_func_test = neg_avg_loglik(y_test, prob_test)

print('\nCost function value overview:')
print(f'        value for training set   = {cost_func_train_minimum:10.8f}')
print("From this trained model:")
print(f'        value for validation set = {cost_func_val:10.8f}')
print(f'              value for test set = {cost_func_test:10.8f}')

<div style="background-color:#c2eafa">
    
$\q{1}$  Fit at least five more trees by varying arguments in the `DecisionTreeClassifier`-call:
1. criterion (try 'gini' at least once),
2. max_depth,
3. min_samples_split,
4. min_samples_leaf.

Use negative average loglikelhood criterion and the validation set to choose the best tree. Make a table with the four argument values in columns 1 to 4, supplemented with cost function scores for the training and validation sets.

You might use the test set to determine a score for the best model.
Assign the best model to `clf` below, so the rest of the code produces what you want.

In [None]:
# START ANSWER
# END ANSWER

## 3. Evaluating the best tree and finding the most profitable threshold

In [None]:
THRESHOLD = [.75, .80, .85]
results = pd.DataFrame(columns=["THRESHOLD", "accuracy", "true pos rate", "true neg rate", "false pos rate", "precision", "f-score"]) # df to store results
results['THRESHOLD'] = THRESHOLD                                                                           # threshold column
Q = clf.predict_proba(X_test)[:,1]

j = 0                                                                                                      
for i in THRESHOLD:                                                                                        # iterate over each threshold        
                                                 # fit data to model
    preds = np.where(Q>i, 1, 0)                  # if prob > threshold, predict 1
    
    cm = (confusion_matrix(y_test, preds,labels=[1, 0], sample_weight=None)/X_test.shape[0])*100 
    # confusion matrix (in percentage)
    
    print('Confusion matrix for threshold =',i)
    print(cm)
    print(' ')      
    
    TP = cm[0][0]                                                                                          # True Positives
    FN = cm[0][1]                                                                                          # False Positives
    FP = cm[1][0]                                                                                          # True Negatives
    TN = cm[1][1]                                                                                          # False Negatives
        
    results.iloc[j,1] = accuracy_score(y_test, preds) 
    results.iloc[j,2] = recall_score(y_test, preds)
    results.iloc[j,3] = TN/(FP+TN)                                                                         # True negative rate
    results.iloc[j,4] = FP/(FP+TN)                                                                         # False positive rate
    results.iloc[j,5] = precision_score(y_test, preds)
    results.iloc[j,6] = f1_score(y_test, preds)
   
    j += 1

print('ALL METRICS')
# print(results.T.to_string(header=False))
results.T

In [None]:
# Compute the ROC curve and AUC
fpr, tpr, _ = roc_curve(y_test, Q)
roc_auc = auc(fpr,tpr)

plt.figure(figsize=(8,6))      # format the plot size
lw = 1.5
plt.plot(fpr, tpr, color='darkorange', marker='.',
         lw=lw, label='Decision Tree (AUC = %0.4f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--',
         label='Random Prediction (AUC = 0.5)' )
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic curve')
plt.legend(loc="lower right")
plt.show()

<div style="background-color:#c2eafa">
    
$\q{2}$ Suppose the population of lenders actually consists of 90% "good" and 10% "bad" lenders. Suppose, the profit from a good loan is $V$ and the loss from a bad loan is $4V$. Look at $\S 3.11$ and explain why the criterion to optimize is:
V x TPR(Z) x 0.9 - 4 V x FPR(Z) x 0.1.

<div style="background-color:#ffa500">
    
Write your answer in this colored box:

[//]: # (START ANSWER)
[//]: # (END ANSWER)

<div style="background-color:#c2eafa">
    
$\q{3}$  Which of the three threshold values results in the highest expected profit per loan evaluated? And huw much is that?

In [None]:
# START ANSWER
# END ANSWER