In [None]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('default')
from activation_functions import sigmoid
from metrics import accuracy
from BaseRegression import BaseRegression

from sklearn import datasets
from sklearn.model_selection import train_test_split

# Decision Tree

Decision tree is a powerful algorithm that can fit complex data and perform both classification, regression, and multioutput tasks.

Advantage of Decision Trees:
* Make very few assumptions about the data.
* Fairly intuitive and the decisions are easy to interpret. (white box model)
* Feature scaling and centering is not necessary to obtain good results.

Other noteworthy info:
* Decision trees form the fundamental components of a RandomForest.
* The CART algorithm (scikit-learn) produces only binary trees whereas ID3 for example allows nodes to have more than 2 children.

<img src="https://miro.medium.com/v2/resize:fit:1060/1*H6thrs5CR_wdxQyMCwWawQ.png" alt="Image of Decision tree" style="background-color:white;">

### <span style="color:#217AB8"> Making predictions</span> 

Starting at the root node and follow the conditions that apply to your current instance to the leaf. This will look like is the attribute x of your instance larger or smaller than 1. If yes follow the tree down the right path. If no go down the left. Once you reach a leaf node (aka does not have any child nodes) use this node's class to predict the class of your instance.\
It takes approximately $O(log_2(m))$ nodes to predict an instance's class. This is indipendent of the number of features so predictions are very fast even with large training sets.

* *sample* how many training samples a node's condition applies to.\
* *value* how many training samples of each class a node applies to (eg. in node x: class a is represented 1, class b is represented 20 times)
* *gini* measures the impurity of a node (pure when a node applies to only instances of one class)

Gini impurity = $G_i = 1  - \sum_{k=1}^n p_{i,k}^2$ where $p_{i,k}$ is the ratio of class k instances among the training instances in the ith node.

### <span style="color:#217AB8">CART classification and regression tree algorithm</span> 
A greedy algorithm meaning at every step from the beginning it tries to optimize the split rather than checking whether this improves impurity further down the line at lower levels.\
Therefore it does not guarantee an optimal solution.\
It is also an NP complete problem and requires O(exp(m)) time, making it hard to work with even small training sets. -> find reasonalbly good solutions.

1. Split the training set into 2 using a single feature $k$ and a threshold $t_k$ (eg. petal length >= 1.3). Find the purest subsets for pairs (k, $t_k$) weighted by their size.

&emsp;&emsp;&emsp;Minimize Cost function: $ J(k, t_k) = \frac{m_{left}}{m}G_{left} + \frac{m_{right}}{m}G_{right}$ where m is the number of instances in the subesets

2. Continue this on the subsets recursively.
3. Stop when max_depth is reached or if no split further reduces the gini impurity.\
Training complexity requires the comparison of all features (unless max_features is set) on all samples at each node.\
This brings the training compexity to $O(n * m * log(m))$

If the tree is left unconstrained it will fit itself very closely to the training data and most likely overfitting.\
This is often described as a non-parametric model. In contrast, parametric models such as linear models have a pre-determined number of parameters, so their degrees of freedom are limited.\
(which in turn can lead to underfitting, especially when the data contains more complex patterns than the model is able to catch)



### <span style="color:#217AB8">Gini impurity or Entropy</span> 
Shannon's information theory: Entropy measures the average information content of a message:\
Entropy is 0 when all messages are identical.\
For example if in the case of decision tree the entropy is 0 then the node captures only one class.

Entropy in the ith node: $H_i = \sum_{k=1}^{n} p_{i,k}*log(p_{i,k})$ where $p_{i,k}\not=0$

Example:

Should you use Gini or Entropy?
According to O'reilly "Hands on Machine learning" they lead to similar trees.\
"The gini index seems to be slightly faster to compute so it is a good default.\
However when they differ the Gini impurity tends to isolate the most frequent class  in its own branch of the tree, while entropy tends to produce slightly more balanced trees." (Sebastian Raschka's analysis)

### <span style="color:#217AB8">Regularization</span> 
For example:\
* restrict the maximum depth of the tree
* set the minumum number of samples a node must have before it can split
* set the minumum number of samples a leaf must have
* set the minumum fraction of all training data that a leaf must represent
* restrict the maximum number of leaf nodes that can be determined
* restrict the maximum number of features considered for splitting at a node

You could also train the tree without restrictions and then prune the tree after the training.
For example:
Prune a node if all of its children are leaves and it provides no statistically significant improvement of purity.\
Using the chi-sqaure test with null hypothesis that the node increases purity. If you want to reject the null hypothesis with 95% confidence then the p-value should be over 0.05. 
 
