<a href="https://colab.research.google.com/github/nicolez9911/colab/blob/main/AdvML_L62S1_N1_Random_Forests_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Trees

This notebook introduces decision trees.

<p>To study decision trees we will use the Titanic dataset.</p>
<p>It consists of the passenger manifest of the Titanic and an associated survival variable.</p>

# Imports & Installation

In [None]:
!conda info --envs

In [None]:
import sys
import os
# add library module to PYTHONPATH
sys.path.append(f"{os.getcwd()}/..")

### Installing pip Packages into the Correct Ananacoda Environment

To make sure that pip packages are installed into the correct environment we can use the following approach:

* check what is our active environment: `conda info --envs` [optional step]
* check what is the path of our enviroment: `which pip` [if pip is not registered we can use `which python`]
* execute pip installation from the pip executive in the environment



In [None]:
!which pip

In [None]:
!/home/dev/BIN/anaconda3/envs/deng_ml/bin/pip install dtreeviz

### Installing graphviz

`graphviz` is a library for drawing graph diagrams. It is a `C` executable that has to be installed locally.
To install on Linux you can use the cell below.
On Windows or Mac please have a look at: https://github.com/parrt/dtreeviz#windows-10 .


In [None]:
!conda install -y graphviz

In [None]:
import sklearn
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from dtreeviz.trees import *

import graphviz
import pandas as pd

In [None]:
%load_ext autoreload
%autoreload 2 # Reload all modules (except those excluded by %aimport) every time before executing code.


%matplotlib inline

In [None]:
random_state = 1234

# Load data

We will use the Titanic dataset as a basis for testing Decision Trees.

The `Titanic` dataset consists of two elements:
* Original passenger data
* Survival as target variabel

The passenger data contains the following columns:

* pclass: A proxy for socio-economic status (SES)
   * 1st = Upper
   * 2nd = Middle
   * 3rd = Lower

* age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

* sibsp: The dataset defines family relations in this way...
    * Sibling = brother, sister, stepbrother, stepsister
    * Spouse = husband, wife (mistresses and fiancés were ignored)

* parch: The dataset defines family relations in this way...
    * Parent = mother, father
    * Child = daughter, son, stepdaughter, stepson
    * Some children travelled only with a nanny, therefore parch=0 for them.

In [None]:
dataframe = pd.read_csv("./titanic_dataset/titanic.csv")

In [None]:
dataframe.shape

## Exploratory Data Analysis

When starting work on new datasets, we can use some utility functions from pandas to analyse the dataframe.

* describe() provides us with count and descriptive statistics
* isna() allows to identify missing values


In [None]:
dataframe.describe()

In [None]:
dataframe.isna().any()

In [None]:
dataframe.isna().sum()

## Feature Engineering

Before we can start working with the dataframe we have to fix the missing values.
* fillna() allows us to do inplace replacement of those values

In [None]:
# Fill missing values for Age
dataframe.fillna({"Age":dataframe.Age.mean()}, inplace=True)

In [None]:
# Encode categorical variables
# The `astype("category").cat.codes` call encodes a categorical label in numerical form
dataframe["Sex_label"] = dataframe.Sex.astype("category").cat.codes
dataframe["Cabin_label"] = dataframe.Cabin.astype("category").cat.codes
dataframe["Embarked_label"] = dataframe.Embarked.astype("category").cat.codes

In [None]:
dataframe["Embarked_label"]

# Classification

## Feature and target variables

In [None]:
features = ["Pclass", "Age", "Fare", "Sex_label", "Cabin_label", "Embarked_label"]
target = "Survived"

## Model training
We will train with full data, the goal is to just interpretate the tree structure

In [None]:
dtc = DecisionTreeClassifier(max_depth=5, random_state=random_state)
dtc.fit(dataframe[features], dataframe[target])

In [None]:
dtc.score(dataframe[features],dataframe[target])

In [None]:
min_samples = 0
max_samples = 99999
node_type = ShadowDecTree.get_node_type(dtc)
n_node_samples = dtc.tree_.n_node_samples

leaf_samples = [(i, n_node_samples[i]) for i in range(0, dtc.tree_.node_count) if node_type[i]
                and min_samples <= n_node_samples[i] <= max_samples]
x, y = zip(*leaf_samples)

In [None]:
np.array(x)

In [None]:
np.array(y)

In [None]:
ShadowDecTree.get_leaf_sample_counts(dtc, max_samples=20)

## Model interpretation
Here we have a tree with depth=5. Take your time to look through it structure and try to find its leaves.

In [None]:
class_names = list(dtc.classes_)

In [None]:
dtreeviz(dtc, dataframe[features], dataframe[target], features, target, class_names)

In [None]:
# fancy=False
dtreeviz(dtc, dataframe[features], dataframe[target], features, target, class_names, fancy=False )

### Leaf samples
Each node contains some important details. One of these is 'samples', which shows the number of samples from training set which pass through that node.<br>
Would be very helpful to see the number of samples from each leaf. Why? Because it shows the confidence of leaf prediction. <br>
For example, if we have a leaf with good prediction(ex. gini=0.0) but very few samples in in (ex. samples=1), this could be the sign of overfiting. If our leaf would contains more samples, then we could be more confident about its prediction. <br>

This is how we can easily get leaf samples from a big tree structure (using plots or plain text)


In [None]:
viz_leaf_samples(dtc, figsize=(20,10))

In [None]:
ctreeviz_leaf_samples(dtc, display_type="text")

In [None]:
#Useful when you want to easily see the general distribution of leaf samples.
viz_leaf_samples(dtc, display_type="hist", bins=30, figsize=(20,7))

In [None]:
viz_leaf_samples(dtc, display_type="hist", bins=30, figsize=(20,7), min_samples=3, max_samples=100)

### Leaf samples by class
Here we can see the number of samples from each leaf by its classes. <br>
The leaf with id 78 contains a lot of samples from training set and mojority of them from class 0. In leaf 17 all samples are from class 1. Would be very helpful to see how the samples from these leaves look, what do they have in common. This is a way to get domain knowledge about our dataset using a ML driven approach. <br>
More about how we can get the training samples from a leaf in the near future.

In [None]:
ctreeviz_leaf_samples(dtc, figsize=(20,7))

In [None]:
ctreeviz_leaf_samples(dtc, display_type="text")

# Regression

## feature and target variables
To keep the same dataset for regression, now our task is to predict the age.

In [None]:
features_reg = ["Pclass", "Fare", "Sex_label", "Cabin_label", "Embarked_label", "Survived"]
target_reg = "Age"

## Model training

In [None]:
dtr = DecisionTreeRegressor(max_depth=4, random_state=random_state)
dtr.fit(dataframe[features_reg], dataframe[target_reg])

## Model interpretation

In [None]:
dtreeviz(dtr, dataframe[features_reg], dataframe[target_reg], features_reg, target_reg)

### Leaf samples

In [None]:
viz_leaf_samples(dtr, figsize=(40,10))

In [None]:
viz_leaf_samples(dtr, display_type="text")

In [None]:
viz_leaf_samples(dtr, display_type="hist", bins=30)

## Leaf target values distribution

In [None]:
#%config InlineBackend.figure_format = 'svg'
dtr = DecisionTreeRegressor(max_depth=3, random_state=random_state)
dtr.fit(dataset[features_reg], dataset[target_reg])
viz_leaf_target(dtr, dataset[features_reg], dataset[target_reg], features_reg, target_reg, show_leaf_labels=True, grid=False)


In [None]:
dtr = DecisionTreeRegressor(max_depth=7, random_state=random_state)
dtr.fit(dataset[features_reg], dataset[target_reg])
viz_leaf_target(dtr, dataset[features_reg], dataset[target_reg], features_reg, target_reg, show_leaf_labels=True, grid=False, figsize=(4,20))
