# Tree-Based Machine Learning Models

- Decision trees - simple model and model with class weight tuning
- Bagging (bootstrap aggregation)
- Random Forest - basic random forest and application of grid search on hyperparameter tuning
- Boosting (AdaBoost, gradient boost, extreme gradient boost - XGBoost)
- Ensemble of ensembles (with heterogeneous and homogeneous models)

# Decision tree classifiers

- Decision trees can be applied to either classification or regression problems. Based on features in data, decision tree models learn a series of questions to infer the class labels of samples.

![Decision Tree](DT.png) 

# Terminology used in decision trees

Decision Trees do not have much machinery as compared with logistic regression. Here we have a few metrics to study. We will majorly focus on impurity measures; decision trees split variables recursively based on set impurity criteria until they reach some stopping criteria (minimum observations per terminal node, minimum observations for split at any node, and so on):

#### <b>Entropy</b> : 
- Entropy controls how a Decision Tree decides to split the data. It actually effects how a Decision Tree draws its boundaries.
- Entropy came from information theory and is the measure of impurity in data. 
- If the sample is completely homogeneous, the entropy is zero, and if the sample is equally divided, it has entropy of one. 
- In decision trees, the predictor with most heterogeneousness will be considered <b>nearest to the root node</b> to classify the given data into classes in a greedy mode.


![Entropy](entropy.png) 

- Where n = number of classes. 
- Entropy is maximum in the middle, with a value of 1 and 
- minimum at the extremes with a value of 0. 
- The low value of entropy is desirable, as it will segregate classes better.

#### <b> Information Gain </b> : 

- Information gain is the main key that is used by Decision Tree Algorithms to construct a Decision Tree.
- Decision Trees algorithm will always tries to <b>maximize</b> Information gain.
- An attribute with highest Information gain will tested/split first.
- Information gain is the expected reduction in entropy caused by partitioning the examples according to a given attribute. 
- The idea is to start with mixed classes and to continue partitioning until each node reaches its observations of purest class. 
- At every stage, the variable with maximum information gain is chosen in a greedy fashion.
- <b> Information Gain = Entropy of Parent - sum (weighted % * Entropy of Child)</b>
- <b> Weighted % = Number of observations in particular child/sum (observations in all child nodes)</b>

#### <b> Gini </b> : 
- Gini impurity is a measure of misclassification, which applies in a multi-class classifier context. 
- Gini works similar to entropy, except <b>Gini is quicker to calculate</b>
- Where i = Number of classes. The similarity between Gini and entropy is shown in the following figure:

![Gini](gini.jpg)

![compare](compare.png)

### Classification tree:

In [29]:
import pandas as pd
df = pd.read_csv("DT_data.csv")
df

Unnamed: 0,Day,Outlook,Temperature,Humidity,Wind,Play tennis
0,D1,Sunny,Hot,High,Weak,No
1,D2,Sunny,Hot,High,Strong,No
2,D3,Overcast,Hot,High,Weak,Yes
3,D4,Rain,Mild,High,Weak,Yes
4,D5,Rain,Cool,Normal,Weak,Yes
5,D6,Rain,Cool,Normal,Strong,No
6,D7,Overcast,Cool,Normal,Strong,Yes
7,D8,Sunny,Mild,High,Weak,No
8,D9,Sunny,Cool,Normal,Weak,Yes
9,D10,Rain,Mild,Normal,Weak,Yes


- Taking the Humidity variable as an example to classify the Play Tennis field

- CHAID: Humidity has two categories and our expected values should be evenly distributed in order to calculate how distinguishing the variable is:

![table](tab1.jpg)

- Expected is the Average of Play tennis
- Differrence is the Play tennis - Expected 
- Calculating x2 (Chi-square) value:
- square of the Difference of Actual - Expected / Expected 

![chi](chi.jpg)

- Calculating degrees of freedom = (r-1) * (c-1)
- Where r = number of row components or number of variable categories, C = number of response variables.
- Here, there are two row categories (High and Normal) and two column categories (No and Yes).
- Hence = (2-1) * (2-1) = 1
- p-value for Chi-square 2.8 with 1 d.f = 0.0942
- p-value can be obtained with the following Excel formulae: = CHIDIST (2.8, 1) = 0.0942

#### In a similar way, we will calculate the p-value for all variables and select the best variable with a low p-value.

## $ Entropy = - Σ p * log_2  p $

![cal](cal3.jpg)


### CHIDIST
- The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question.

### In a similar way, we will calculate information gain for all variables and select the best variable with the highest information gain

### GINI:

$ Gini = 1- Σp^2 $

![cal](cal4.jpg)

- In a similar way, we will calculate Expected Gini for all variables and select the best with the lowest expected value.
- For the purpose of a better understanding, we will also do similar calculations for the Wind variable:

### CHAID: 
Wind has two categories and our expected values should be evenly distributed in order to calculate how distinguishing the variable is:

![wind](tab2.jpg)
![wind](wind.png)
![wind](windcal1.png)

### Now we will compare both variables for all three metrics so that we can understand them better.

In [30]:
df = pd.read_csv("compare.csv")
df

Unnamed: 0,Variables,CHAID,Entropy,Gini
0,,(p-value),information gain,expected value
1,Humidity,0.0942,0.1518,0.3669
2,Wind,0.2733,0.0482,0.4285
3,Temperature,4.066,0.9402,0.65
4,Better,Low value,High value,Low value


## Conclusion : The root node is selected by this formula
1. CHAID : Select the lowest CHAID
2. Entropy : Select the Highest Entropy
3. GINI : Select the lowest GINI

##### For all three calculations, Humidity is proven to be a better classifier than Wind. Hence, we can confirm that all methods convey a similar story.

### Study the compare2.xlsx and the formulas to understand the calculations

![Diagram](DT_3.png)

### Nodes :

- Root Nodes , Internal Nodes or Split Node or non leaf node  , Leave Nodes 
![Diagram](DT_4.png)
- Decision tree video explaination
[![Video](http://img.youtube.com/vi/YOUTUBE_VIDEO_ID_HERE/0.jpg)](https://www.youtube.com/watch?v=7VeUPuFGJHk&t=1s "Decision tree")
