**NOTE: This notebook is written for the Google Colab platform, which provides free hardware acceleration. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook, using a local GPU.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from class_utils import show_tree

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
DATA_HOME = "https://github.com/michalgregor/ml_notebooks/blob/main/data/{}?raw=1"

from class_utils.download import download_file_maybe_extract
download_file_maybe_extract(DATA_HOME.format("titanic.zip"), directory="data/titanic")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## Decision Trees for Classification

We will now show how to apply a decision tree classifier to the [Titanic](https://www.kaggle.com/c/titanic) dataset. Given that we have already explored this dataset in a previous example and we already know how to preprocess it, we will not repeat the exercise. The code necessary to load and preprocess the data is in the next cell and it is hidden for conciseness.



In [None]:
#@title -- Preprocessing the Data -- { display-mode: "form" }
df = pd.read_csv("data/titanic/train.csv")
df_train, df_test = train_test_split(df, test_size=0.25,
                     stratify=df["Survived"], random_state=4)

categorical_inputs = ["Pclass", "Sex", "Embarked"]
numeric_inputs = ["Age", "SibSp", 'Parch', 'Fare']
class_names = ["died", "survived"]

output = "Survived"

input_preproc = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OrdinalEncoder()),
     categorical_inputs),
    
    (make_pipeline(
        SimpleImputer(),
        StandardScaler()),
     numeric_inputs)
)

X_train = input_preproc.fit_transform(df_train[categorical_inputs+numeric_inputs])
Y_train = df_train[output].values.reshape(-1)

X_test = input_preproc.transform(df_test[categorical_inputs+numeric_inputs])
Y_test = df_test[output].values.reshape(-1)

### Training

Literally the only thing that we need to change at this point w.r.t. the previous example is to use a `DecisionTreeClassifier` instead of the `KNeighborsClassifier`. The rest of the code can stay the same.



In [None]:
model = DecisionTreeClassifier()
model.fit(X_train, Y_train)

### Testing

The code to test the model can be copied verbatim as well.



In [None]:
y_test = model.predict(X_test)

cm = pd.crosstab(Y_test, y_test,
                 rownames=['actual'],
                 colnames=['predicted'])
print(cm, "\n")

acc = accuracy_score(Y_test, y_test)
print("Accuracy = {}".format(acc))

The accuracy achieved by our decision tree classifier is not great. In fact, chances are that it will be lower than that achieved in our KNN example. This is very suspicious indeed and it might mean that our model has overfitted. To check whether this is the case, we should test our model on the training set. If the results are much better, that will indicate overfitting and we will need to modify the hyperparameters of the decision tree so as to decrease its capacity and get a model that generalizes.



In [None]:
y_train = model.predict(X_train)

acc_train = accuracy_score(Y_train, y_train)
print("Accuracy = {}".format(acc_train))

As it turns out, the accuracy on the training set is indeed much higher, which indicates overfitting. We can also visualize the resulting tree to examine how complex it is. We will use an auxiliary function `show_tree`. The tree is likely to be quite complex and difficult to read.



In [None]:
show_tree(model,
  feature_names=categorical_inputs+numeric_inputs,
  class_names=class_names)

### Tuning the Hyperparameters to Do More Pruning

As the next step we will show how to tune hyperparameters of the decision tree to make it simpler and to prevent it from overfitting. This is achieved using pruning, which can come in two different flavors:

* **pre-pruning** : as the tree is grown, new branches are prevented from being formed unless some pre-specified criteria are met;
* **post-pruning** : the tree is grown fully and branches are removed from it afterwards.
In this example we will only be using pre-pruning and its parameters will be specified in the constructor of the `DecisionTreeClassifier`.

#### Using Cross-Validation

When tuning the hyperparameters, we will need a way to determine which parameters work. We cannot test each different setting on the testing set: recall that we are only allowed to use it once – to test the final model.

We basically have 2 options:

* To split the dataset into 3 parts: the training set, the validation set and the testing set (the validation set would be used to tune the hyperparameters and the testing set would be used at the end to verify that the final model generalizes).
* To use cross-validation: The training set would be split into $k$ folds and then the model would be trained on $k-1$ folds and tested on the remaining fold. This would be repeated for all combinations of folds and the results would be averaged.
Since decision trees are cheap to train and our dataset is not too large, in the present example we will use cross-validation. Let us have a look at how it is applied in `scikit-learn`. We will use the `sklearn.model_selection.cross_validate` function and specify `cv=5`, which means that there will be $k = 5$ folds. The function will return the accuracies on all of the folds. We will compute the mean of these and use that as an indicator of how well our model is doing.



In [None]:
cross_validate(model, X_train, Y_train, cv=5)['test_score'].mean()

#### Changing the Hyperparameters

Now let's do some actual hyperparameter tuning. To see what we can change when constructing the decision tree, we will have a look at its documentation.



In [None]:
print(DecisionTreeClassifier.__doc__)

The minimum number of samples for a leaf (`min_samples_leaf`) seems like a good candidate: if we make a prediction based on a very small amount of samples, it is likely not to be representative. You can, of course, also try to experiment with other parameters such as the maximum depth of the tree and so on.

---
#### Task 1: Tune `min_samples_leaf`

**In the cell below, experiment with different values of `min_samples_leaf` and select try to maximize the cross-validation accuracy. Also observe the effect of the hyperparameter on the structure of the tree.** 

---


In [None]:
model = DecisionTreeClassifier(
    
    
    min_samples_leaf=    # ------
    
    
)

acc = cross_validate(model, X_train, Y_train, cv=5)['test_score'].mean()
print("Cross-validation accuracy = {}".format(acc))

# we need to fit the model before we plot it
model.fit(X_train, Y_train)
show_tree(model)

### Testing the Tuned Tree

Having tuned the hyperparameters, we can now verify how well our final model actually generalizes. We re-train the model with the best hyperparameters on the entire training set and evaluate the accuracy on the testing set. The results should be substantially better now.



In [None]:
model.fit(X_train, Y_train)

In [None]:
y_test = model.predict(X_test)
accuracy_score(Y_test, y_test)