<a href="https://colab.research.google.com/github/nicolez9911/colab/blob/main/AdvML_L6S1_N1_Random_Forests_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Trees Exercises

This notebook focuses on decision tree exercises.
Two main exercises are in focus:

1. Exploring the relationship between:
   * Tree depth
   * Training & Test Performance

2. Applying Discretization (Binning) To The Samples

In [None]:
import sys
import os
# add library module to PYTHONPATH
sys.path.append(f"{os.getcwd()}/..")

import sklearn
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from dtreeviz.trees import *

import graphviz
import pandas as pd

random_state = 1234

In [None]:
%load_ext autoreload
%autoreload 2 # Reload all modules (except those excluded by %aimport) every time before executing code.


%matplotlib inline

# Load data

We will use the Titanic dataset as a basis for testing Decision Trees.

The `Titanic` dataset consists of two elements:
* Original passenger data
* Survival as target variabel

The passenger data contains the following columns:

* pclass: A proxy for socio-economic status (SES)
   * 1st = Upper
   * 2nd = Middle
   * 3rd = Lower

* age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

* sibsp: The dataset defines family relations in this way...
    * Sibling = brother, sister, stepbrother, stepsister
    * Spouse = husband, wife (mistresses and fiancés were ignored)

* parch: The dataset defines family relations in this way...
    * Parent = mother, father
    * Child = daughter, son, stepdaughter, stepson
    * Some children travelled only with a nanny, therefore parch=0 for them.

In [None]:
dataframe = pd.read_csv("./titanic_dataset/titanic.csv")

## Feature Engineering

Before we can start working with the dataframe we have to fix the missing values.
* fillna() allows us to do inplace replacement of those values

In [None]:
# Fill missing values for Age
dataframe.fillna({"Age":dataframe.Age.mean()}, inplace=True)
# Encode categorical variables
# The `astype("category").cat.codes` call encodes a categorical label in numerical form
dataframe["Sex_label"] = dataframe.Sex.astype("category").cat.codes
dataframe["Cabin_label"] = dataframe.Cabin.astype("category").cat.codes
dataframe["Embarked_label"] = dataframe.Embarked.astype("category").cat.codes

features = ["Pclass", "Age", "Fare", "Sex_label", "Cabin_label", "Embarked_label"]
target = "Survived"

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataframe[features], dataframe[target], test_size=0.2, random_state=42)


## Model training
We will train with full data, the goal is to just interpretate the tree structure

In [None]:
dtc = DecisionTreeClassifier(max_depth=5, random_state=random_state)
dtc.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1234, splitter='best')

In [None]:
### Exercise 1: Impact of Depth on Test and Train Performance

Evaluate the performance for the decision tree based on the train & test sets defined in the cells above:

* X_train, y_train
* X_test, y_test

Focus your evaluation on the hyper-parameter `max_depth` in the range of `2` to `20` and observe the impact on the accuracy on the train and test level.


### Exercise 2: Plotting Test and Train over Depth

Plot train and test accuracy over the hyper-parameter `max_depth` in the range of `2` to `20` in a single plot.


In [None]:
import matplotlib.pyplot as plt



### Exercise 3: Model interpretation

Use the visualisation and interpretation capability below to analyse trees at the extreme edges of the hyperparameter depth for values:

* 2
* 20

In [None]:
class_names = list(dtc.classes_)

In [None]:
dtreeviz(dtc, X_train, y_train, features, target, class_names)

In [None]:
# fancy=False
dtreeviz(dtc, X_train, y_train, features, target, class_names, fancy=False )

### Leaf samples
Each node contains some important details. One of these is 'samples', which shows the number of samples from training set which pass through that node.<br>
Would be very helpful to see the number of samples from each leaf. Why? Because it shows the confidence of leaf prediction. <br>
For example, if we have a leaf with good prediction(ex. gini=0.0) but very few samples in in (ex. samples=1), this could be the sign of overfiting. If our leaf would contains more samples, then we could be more confident about its prediction. <br>

This is how we can easily get leaf samples from a big tree structure (using plots or plain text)


In [None]:
viz_leaf_samples(dtc, figsize=(20,10))

In [None]:
ctreeviz_leaf_samples(dtc, display_type="text")

In [None]:
#Useful when you want to easily see the general distribution of leaf samples.
viz_leaf_samples(dtc, display_type="hist", bins=30, figsize=(20,7))

In [None]:
viz_leaf_samples(dtc, display_type="hist", bins=30, figsize=(20,7), min_samples=3, max_samples=100)

### Leaf samples by class
Here we can see the number of samples from each leaf by its classes. <br>
The leaf with id 78 contains a lot of samples from training set and mojority of them from class 0. In leaf 17 all samples are from class 1. Would be very helpful to see how the samples from these leaves look, what do they have in common. This is a way to get domain knowledge about our dataset using a ML driven approach. <br>
More about how we can get the training samples from a leaf in the near future.

In [None]:
ctreeviz_leaf_samples(dtc, figsize=(20,7))

### Discretization & Binning

Discretization is the process of turning a continuous variable or representation into an integer (natural number) representation.
Binning is the process of reducing the range of possible values by introducing a set of bins.

Discretize:
[0.1, 0.4, 1.2, 7.8] -> [0, 0, 1, 8]

Binning: With 4 bins [0,2.5]::1,[2.5,5]::2,[5,7.5]::3,[7.5,10]::4

[0.1, 0.4, 1.2, 7.8] -> [1, 1, 1, 4]

Discretization and Binning can have positive effect on ML performance by reducing the feature space.

* Less possible feature values means more observations
* More observations means more opportunities to set a helpful weight for a feature observation



### Exercise: Binning

Take a look at the input data for the titanic dataset.

* Identify a feature that is suited for binning
* Apply binning to the feature and re-train the decision tree
* Try different bins and observe the effect on train and test accuracy and the interpretability of the model.

The functionality of the `loc` function should be helpful when it comes to transforming the original values into binned values.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

This allows in place replacement as shown below:

```
dataset.loc[(dataset['FeatureName'] > 24.91) & (dataset['FeatureName'] <= 40), 'FeatureName'] = 3
```
