# 0 Imports and helper functions

In [0]:
import sklearn
import pandas as pd
from sklearn import tree
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import time
from sklearn.metrics import roc_curve, auc
import json

%matplotlib inline

# 1 Loading the data

In this notebook we build a decision tree binary classifier to predict wether or not a flight will be delayed. It is based on the same dataset as in notebook 4, where we built a binary classifier with logistic regression. All data preprocessing steps are the same, and we will simply directly import the processed dataset containing the joined flight, delay and weather information.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
delays = pd.read_csv("/content/drive/My Drive/xylosai/flightDelay/FlightDelaysData_clean.csv",header=0,index_col = 0)
delays.head()

In [0]:
delays.info(memory_usage='deep',max_cols=0)

In [0]:
# just quickly downcasting all int64 columns (memory)
int_columns = delays.dtypes[delays.dtypes == 'int64'].index.tolist()
counter = 0
for columnname in int_columns:
  counter += 1
  if (counter%5 == 0):
    print("downcasting column {0} of {1}".format(counter,len(int_columns)))
  delays[columnname] = pd.to_numeric(delays[columnname],downcast="integer")
  

In [0]:
delays.info(memory_usage='deep',max_cols=0)

# 2 Decision tree classifier

## 2.1 Building a first model

With a decision tree classifier, we can learn non-linear decision boundaries.
We keep the wind direction input. As a reminder, let's look at our final dataset.

In [0]:
delays.columns

In [0]:
Y = delays["ArrDel15"]
del delays["ArrDel15"]
X = delays



In [0]:
X_traindev, X_test, Y_traindev, Y_test = train_test_split(X,Y,random_state=0,test_size = 0.10)

del X
del Y


In [0]:
# balancing the classes in the train/dev set, memory-optimized

l_1 = np.sum(Y_traindev == 1)


X_traindev = pd.concat([X_traindev[Y_traindev == 1], X_traindev[Y_traindev == 0][0:l_1]],axis=0)
Y_traindev = pd.concat([Y_traindev[Y_traindev == 1], Y_traindev[Y_traindev == 0][0:l_1]],axis=0)

X_traindev, Y_traindev = sklearn.utils.shuffle(X_traindev, Y_traindev, random_state=0)



In [0]:
# category count in the train/dev set after balancing. Should be equal.
print("not delayed: {0}".format(np.sum(Y_traindev == 0)))
print("delayed: {0}".format(np.sum(Y_traindev == 1)))


In [0]:
X_train, X_dev, Y_train, Y_dev = train_test_split(X_traindev,Y_traindev,random_state=0,test_size = 0.10)

del X_traindev
del Y_traindev

In [0]:
clf = tree.DecisionTreeClassifier(random_state=0)

In [0]:
start = time.time()
clf.fit(X_train, Y_train)
end = time.time()

seconds = end-start
print("Fitting took {0} seconds".format(seconds))

## 2.2 Evaluating our model

Let's first look at the predicted probabilities for our test set... Somethings odd...

In [0]:
Y_test_prob = clf.predict_proba(X_test)
Y_test_prob



In [0]:
clf.tree_.node_count #that's a lot of nodes!

In [0]:
auc = sklearn.metrics.roc_auc_score(Y_test,Y_test_prob[:,1])
print(auc)

In [0]:
fpr_test, tpr_test, tresholds = sklearn.metrics.roc_curve(Y_test, Y_test_prob[:,1])

In [0]:
def plotROC(fpr, tpr):
  fig = plt.figure(figsize = (10,10))
  plt.xlabel("false positive rate (FPR)",fontsize = 15)
  plt.ylabel("true positive rate (TPR)",fontsize = 15)
  plt.title("ROC curve",fontsize=20)
  plt.plot(fpr, tpr,"b",fpr, fpr, "k:")
  plt.legend(("ROC curve","baseline"),fontsize=15)
  plt.show()

In [0]:
plotROC(fpr_test, tpr_test)

In [0]:
Y_train_prob = clf.predict_proba(X_train)

auc = sklearn.metrics.roc_auc_score(Y_train,Y_train_prob[:,1])
print(auc)

fpr_train, tpr_train, tresholds = sklearn.metrics.roc_curve(Y_train, Y_train_prob[:,1])

plotROC(fpr_train, tpr_train)

**What's wrong here?**

## 2.3 Regularization


In the code above, we have not limited the growth of our tree in any way. As a result, the tree has kept making splits untill every node had zero impurities. The result is a tree that is overfitted on the data.

For decision trees, the growth of the tree must be limited to avoid overfitting. This can be done in multiple ways. In the next section we will limited the tree growth by:

1. Limiting the maximum depth of the tree
2. Setting a minimum amount off samples required to allow another split of a tree stump

This introduces two hyperparameters: 

1. The maximum depth
2. The minimum amount of samples to split

We will use both techniques at the same time with fixed values for the hyperparameters. 

In a more advanced scenario, one could use both techniques together with different combinations of hyperparameters to obtain the global optimal set of hyperparameters. In section 4.3 we plot the results for different values of hyperparameters. Since this code requires some time to complete, the resulting figures are included in the notebook so you don't have to run the code again. 



In [0]:
  depth = 10
  split_amount = 300
  
  clf = tree.DecisionTreeClassifier(random_state=0, max_depth = depth, min_samples_leaf = split_amount)
  clf.fit(X_train, Y_train)
  Y_test_prob = clf.predict_proba(X_test)
  auc = sklearn.metrics.roc_auc_score(Y_test,Y_test_prob[:,1])
  print("AUC of this regularized tree: {0}".format(auc))

## 2.4 Hyperparameter search

**NOTE:** Hyperparameter tuning involves the DEV (validation) set. 

### Hyper parameter search: limiting the maximum tree depth



In [0]:
depths = [1,3,5,8,10,20,30,40]

auc_results = []

for depth in depths:
  print("fitting with depth = {}...".format(depth))
  clf = tree.DecisionTreeClassifier(random_state=0, max_depth = depth)
  clf.fit(X_train, Y_train)
  Y_dev_prob = clf.predict_proba(X_dev)
  auc = sklearn.metrics.roc_auc_score(Y_dev,Y_dev_prob[:,1])
  auc_results.append(auc)

**Watch the result** [here](https://drive.google.com/open?id=19VYv95yLx9beSdoxyn5NenC7p2Uqmox1)



In [0]:
  fig = plt.figure(figsize = (10,10))
  plt.xlabel("maximum tree depth",fontsize = 15)
  plt.ylabel("Area Under the Curve (AUC)",fontsize = 15)
  plt.title("Limiting tree growth (regularization)",fontsize=20)
  plt.plot(depths, auc_results, "b")
  plt.show()

It appears that the best validation error is for a depth around 10. Let's evaluate the TEST error for this depth.

**Question:** How will the AUC versus tree depth look when evaluating on the **training set**?

In [0]:
clf = tree.DecisionTreeClassifier(random_state=0, max_depth = 10)
clf.fit(X_train, Y_train)

Y_test_prob = clf.predict_proba(X_test)

auc = sklearn.metrics.roc_auc_score(Y_test,Y_test_prob[:,1])

print("test AUC: {0}".format(auc))

### Hyperparameter search: minimum amount of samples to split

In [0]:
split_amounts = [2,300,500,750,1500]

auc_results = []

for split_amount in split_amounts:
  print("fitting with minimum {} samples before splitting..".format(split_amount))
  clf = tree.DecisionTreeClassifier(random_state=0, min_samples_leaf = split_amount)
  clf.fit(X_train, Y_train)
  Y_dev_prob = clf.predict_proba(X_dev)
  auc = sklearn.metrics.roc_auc_score(Y_dev,Y_dev_prob[:,1])
  auc_results.append(auc)

**Watch the result** [here](https://drive.google.com/open?id=1nZV9R8kGMq3_c_VEIWtObSh9yi4kUzDd)

**Exercise:** What is the best minimum amount of samples to split? Find the TEST error for this model.



In [0]:
  fig = plt.figure(figsize = (10,10))
  plt.xlabel("minimum samples per leaf",fontsize = 15)
  plt.ylabel("Area Under the Curve (AUC)",fontsize = 15)
  plt.title("Limiting tree growth (regularization)",fontsize=20)
  plt.plot(split_amounts, auc_results,"b")
  plt.show()

# 3 Boosted tree

We will training a model of boosted decision stumps. The base classifier is a tree with depth = 1

In [0]:
from sklearn.ensemble import AdaBoostClassifier

In [0]:
base_classifier = tree.DecisionTreeClassifier(random_state=0, max_depth = 1)
adaboost_classifier = AdaBoostClassifier(base_classifier, 50)

In [0]:
adaboost_classifier.fit(X_train, Y_train)

In [0]:
Y_test_prob = adaboost_classifier.predict_proba(X_test)

auc = sklearn.metrics.roc_auc_score(Y_test,Y_test_prob[:,1])

print("test AUC: {0}".format(auc))

We will leave the hyperparamter tuning of T as an exercise.