# Decison Tree

Decision trees are formed from very simple components. Decision trees are particularly popular because it can be very easy to understand exactly why a specific prediction was made.

Decision tree work by recusively splitting an input dataset into two subgroups. The split is decided using some metric which is comparable to a loss function in other settings. Metrics used to decide on the split include Gini impurity, Information Gain, Variance Reduction and measure of goodness. 

Decision trees are often referred to as Classification And Regression Trees (CART) which was introduced by Brieman in 1984. 

CART form the underlying components of ensemble methods such as Random Forests, AdaBoost and Rotation Forests.

Notable decision tree algorihtms include:
* CART - Nonparametric decsion tree learning technique that produce a classification or regression dependent on the input variable.
* Chi-squared automatic interation and detection - Often used in the context of direct marketing to select groups of consumers and predict how their reponses to some variables affect some other variables.
* MARS - Multivariate Adaptive Regression Splines are a from of regression analysis that work by fitting piece wise linear regressions. They are capable of automatically modeling nonlinearities. The algorithm is computationally very expensive.


For the purposes of these notes we are going to focus on CART.

## Mathematics

CART works by recursively partitioning the training set $T$. The aim is to find a partition of the training set that minimises a loss function $L$. Each node in the tree is associated with the generation of a particular subsets $T_i \subset T = \{(x,y)\}^k_{n=1}$, where $x$ is an vector of independent variables and $y$ is the the dependent variable. 

The partition is then defined by a function capable of splitting the data according to the value of a particular variable from the input dataset $x_i \in T$.

Considering a feature $j$ from an input set $T$ and taking $a$ as an arbitraty value then a split can be defined by two subsets:

$$T_l = \{t \in T : x_j \leq A\}$$

and 

$$T_r = \{t\in T : x_j \geq A\}$$

A categorical feature can be split by using 

$$T_l = \{t\in T : x_j = A\}$$

and 

$$T_r = \{t\in T : x_j \neq A\}$$

When partitoning a dataset the decision tree takes into account all possible paritions, it tests each partition and selects the one that minimises the defined loss function $L$. 

The loss function $L$ used to measure compare the value of different splits tends to be different for continuous and categorical variables, and their exist many loss functions that can be used in both cases. In the continuous setting a square loss can be calculated for each subset using:

$$
L = \sum_{t \in T}(y_t - f(x_t))^2
$$

There are many other continuous loss functions that could be used in place of an L2 loss.

In the case of CART the Gini impurity is used which provides an estimate on how pure (homogeneous) a subgroup is. The Gini impurity is calculated using the the probability of selecting a sample that corresponds to given class $c_i \in C$ for the sample in the split

$$L = \sum_{c_i \in C}p(c_i)(1-p(c_i))$$

This leads to an impurity of 0 when all measures of the same class and increases as the homogeneity of the class decreases.

The process of partitioning continues for each generated node until some stopping criteria has been met. The final tree is then returned and can be used for prediction.

## Python

In this section we will break down decision trees into the various components. In this function we treat the decision tree as a function that creates a tree by recursively splitting nodes until the model has been trained. The model includes several parts, the first is initialisation of a node which is given a subset of data. Once the node has been split we then find the optimal variable to split the node on `find_varsplit`. During the `find_varsplit` method we use `find_better_split` which assess the efficacy of the current split vs the current best split. This assessment is made on the value of the a equivalent to the RMSE which can be seen as a loss function. Once the best split has been found it is returned and the data is split between the `lhs` and `rhs`, which both initialise their own nodes. This recursion happens again until the stopping criterion is met.

In [43]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from math import sqrt

# Fit the model given training data and parameters
def fit(self, X, y, min_leaf = 5):
    self.dtree = Node(X, y, np.array(np.arange(len(y))), min_leaf)
    return self

# Make a prediction
def predict(self, X):
    return self.dtree.predict(X.values)

# Node constructor
class Node:
    # Initialise the necessary aspects of a node
    def __init__(self, x, y, idxs, min_leaf=5):
        self.x = x 
        self.y = y
        self.idxs = idxs 
        self.min_leaf = min_leaf
        self.row_count = len(idxs)
        self.col_count = x.shape[1]
        self.val = np.mean(y[idxs])
        self.score = float('inf')
        self.find_varsplit()
        
    # Find the correct splitpoint
    def find_varsplit(self):
        # Go through each of the features looking for best split
        for c in range(self.col_count): 
            self.find_better_split(c)
        # Return without action if leaf is found
        if self.is_leaf: return
        # Split the values into lhs and rhs and recurse
        x = self.split_col
        lhs = np.nonzero(x <= self.split)[0]
        rhs = np.nonzero(x > self.split)[0]
        self.lhs = Node(self.x, self.y, self.idxs[lhs], self.min_leaf)
        self.rhs = Node(self.x, self.y, self.idxs[rhs], self.min_leaf)
    
    # Using the variable index to assess split on feature
    def find_better_split(self, var_idx):
          
        # Get all the values from the index in question    
        x = self.x.values[self.idxs, var_idx]
        
        # Append through the rows of the input matrix and split on the defined value
        for r in range(self.row_count):
            lhs = x <= x[r]
            rhs = x > x[r]
            
            # Early check to see if there are too few samples in the leaf
            if rhs.sum() < self.min_leaf or lhs.sum() < self.min_leaf: continue
            
            # Get the score for the current split
            curr_score = self.find_score(lhs, rhs)
            # Replace best split if better
            if curr_score < self.score: 
                self.var_idx = var_idx
                self.score = curr_score
                self.split = x[r]
                
    # Calculate the score of the split this is equivalent to the rmse
    def find_score(self, lhs, rhs):
        y = self.y[self.idxs]
        lhs_std = y[lhs].std()
        rhs_std = y[rhs].std()
        return lhs_std * lhs.sum() + rhs_std * rhs.sum()
                
    @property
    def split_col(self): 
        return self.x.values[self.idxs,self.var_idx]
                
    @property
    def is_leaf(self): 
        return self.score == float('inf')                

    
    def predict(self, x):
        return np.array([self.predict_row(xi) for xi in x])

    def predict_row(self, xi):
        if self.is_leaf: return self.val
        node = self.lhs if xi[self.var_idx] <= self.split else self.rhs
        return node.predict_row(xi)
 

# Import training dataset
import numpy as np
from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt


data = load_diabetes()

X_train = pd.DataFrame(data=data.data)
y_train = pd.DataFrame(data=data.target).iloc[:,0]


regressor = DecisionTreeRegressor().fit(X_train, y_train)

