# Decision Tree Skeleton

**Decision Trees are a non-parameteric Supervised Learning method for both Classification and Regression** <br>

*non-parameteric Machine Learning algorithms are algorithms that do not make strong assumptions about the form of mapping function<br>
for eaxmple, Logistic Regression is a parameteric Machine Learning algorithms that assumes $y = \frac{1}{1 + \exp^{-\theta^{T} X}}$ and tries to optimize $\theta$ <br>
non-parameteric Machine Learning algorithms such as Decision Tree does not presume any known/simplified function*

The basic intuition behind a Decision Tree is to map out all possible decision paths in the form of a tree<br>

In general, Decision Trees are referred to as **CART** or **Classification and Regression Trees**<br>

Decision Trees are **built using recursive partitioning** to classify the data segments by **minimizing the Impurity** at each step and **maximizing the Information Gain**

## Decision Tree Terminologies

### Root Node:
It represents the entire population or sample and this further gets divided into 2 or more homogeneous sets

### Leaf Node:
Node that cannot be further segregated into further nodes. This node generally has Pure subset

### Splitting: 
It referes to dividing the root node/sub node into different parts on the basis of some condition

### Branch / SubTree:
Sub Tree formed by the splitting of Tree

### Purity/Impurity: 
The concept of Im/purity is based on the fractin of data (or sample under consideration) that belongs to just 1 class<br>
**Pure subset**   : subset that contains data belonging to only 1 class (8-A, 0-B) or (0-A, B-B) <br>
**Impure subset** : subset that contains data belonging to multiple classes (6-A, 2-B), (4-A, 4-B), etc <br>
A node having equally distributed split of classes (50-50 or 33-33-33) has the worst Purity <br><br>

###  Entropy:
It is the measure of randomness or uncertainity in data i.e. it is the measure of Purity in data (or sub-split) <br>
*Lower the Entropy, less uniform the Distribution, ourer the node* <br>
[1-A / 7-B]--> Low Entropy &emsp; [3-A / 5-B]-->High Entropy <br>
If sample is completely homogeneous (contains data of only 1 class), the Entropy is 0 and if sample is equally divided (50-50, 33-33-33, etc), it has Entropy of 1 <br>
[8-A / 0-B] or [0-A / 8-B]--> Entropy = 0 &emsp; [4-A / 4-B]-->Entropy = 1

### Information Gain
It is the information that can increase the level of uncertainity after splitting. Information Gain is in respect to the attribute which is under consideration <br>
Information Gain quantifies how much a question/attribute (at node) reduces the Entropy (uncertainity) i.e ot measures how much information a feature(attribute) gives us about the class. <br>
The feature with highest Informatin Gain is taken as split and the process is repeated until all children nodes are pure or Information Gain is 0<br>
*Information Gain is the entropy of a tree before the split minus the weighted entropy after the split* <br>

$ Information $ $ Gain = (Entropy$ $before$ $split) - (weighted$ $entropy$ $after$ $the$ $split) $

Our objective is to minimize the uncertainity/randomness i.e. Entropy and maximize the Information Gain at every level in the Decision Tree <br>
**So our goal is to find the attribute that has the highest Information Gain and use that attribute to split the tree in our Decision Tree**

![infoGain.PNG](images/infoGain.PNG)

In above example, Information Gain using attribute 'sex' is higher so we select 'sex' as node and continue to find Entropy and Information Gain with all other attributes to create the Decision Tree until purity of child node is 1 or Information Gain is 0

### Gini Index
Gini Index or Gini Score is a metric to measure how often a randomly chosen element would be incorrectly identified <br>
Like Entropy, it is a criterion for calculating Information Gain <br>
Gini Index quantifies the amount of uncertainity of a single node <br>
*Attribute with lower Gini Index should be preferred for splitting the tree* <br>
$ Gini Index = \sum{(p * (1 - p) )} $ <br>
where, p is the proportion of some class inputs present in a particular group

## How to build a Decision Tree
A Decision Tree is constructed by considering all the attributes 1-by-1 <br>
**STEPS**
1. Choose an attribute from dataset
2. Calculate the significance of attribute in splitting of data (Information Gain / Gini Index)
3. Split data based in the most significant attribute
4. go to step 1 i.e. go to each branch and repeat the steps for the rest of the attributes


### How to pick features for splitting
The significant attribute is selected by comparing the following measures of each attribute:

1. Gini Index (low value desired)
2. Information Gain (high value desired)
3. Reduction in Variance (only in Regression problems - low value desired)
4. Chi Square (low value desired)


### Advantages of CART

- Simple to understand, interpret and visualize
- Can handle both categorical and numerical data. Can also handle multi-output problems
- Resistant to outliers, hence require little data preprocessing
- Decision trees implicitly perform variable screening or feature selection
- Nonlinear relationships between parameters do not affect tree performance

### Disadvantages of CART

- Decision-tree learners can create over-complex trees that do not generalize the data well. Thus they are prone to overfitting
- Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This is called *variance*, which needs to be lowered by methods like *bagging* and *boosting*
- Decision tree learners create *biased trees* if some classes dominate. It is therefore recommended to balance the data set prior to fitting with the decision tree
- Since Decision Tree learners fall under Greedy algorithms paradigm, they cannot gurantee to return the globally optimal model (decision tree)


## Avoid Overfitting

### Pruning
To avoid decision tree from overfitting we **remove the branches that make use of features having low importance**. This method is called as **Pruning** or **post-pruning**<br>
This way **the complexity of tree is reduced, which improves the predictive accuracy by the reduction of overfitting**. <br>
Pruning can start at either root or the leaves. <br>
Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured by a cross-validation set


### Early Stop
An alternative method to prevent overfitting is to **try and stop the tree-building process early, before it produces leaves with very small samples**. This heuristic is known as **early stopping** but is also sometimes known as **pre-pruning** decision trees. <br>
At each stage of splitting the tree, we check the cross-validation error. If the error does not decrease significantly enough then we stop. <br>
Early stopping **may underfit** by stopping too early.<br>
The current split may be of little benefit, but having made it, subsequent splits more significantly reduce the error <br><br>

**NOTE:** Early Stopping and Pruning can be used together, separately, or not at all

**NOTE:** In Machine Learning, **Ensemble methods** use (or combine) multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone

## Bagging and Boosting

Decision Tree algorithms are highly prone to *Variance* i.e. a small variation in the data might result in a completely different tree being generated. <br>
**Bagging** and **Boosting** are ensemble techniques to decrease tha variance in prediction and reduce bias respectively <br><br>

### Bagging
**Bootstrap Aggregating**, also known as **Bagging**, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. <br>
It decreases the variance and helps to avoid overfitting.<br> 
It is a special case of the model averaging approach. <br>
**STEPS**:
- create a few subsets of data from the training data set, which is chosen randomly with replacement
- prepare base Decision Trees in parallel with each training data subset independent of each other
- ensemble of various models is obtained by picking the average output of all models ((30+23+44+21+50)/5 = 33.6)(regression) or more frequent class (Red, Blue, Red, Red, Blue = Red)(classification)

![Bagging.png](images/Bagging.png)

### Boosting
Boosting is an ensemble modeling technique that attempts to build a strong classifier from the number of weak classifiers. The basic idea behind boosting is converting many weak learners to form a single strong learner. <br>
Boosting needs us to specify a weak model (e.g. regression, shallow decision trees, etc) and then improves it.
In this model, learners learn sequentially and adaptively to improve model predictions of a learning algorithm. <br>
**STEPS**:
- Firstly, a model is built from the training data (weak model)
- Then the second model is built which tries to correct the errors present in the first model
- This procedure is continued and models are added until either the complete training data set is predicted correctly or the maximum number of models is added
<br><br>

Popular Boosting Algorithms : 
- AdaBoost or Adaptive Boosting for Classification problems which is implemented using iteratively refined sample weights
- Gradient Boosting uses an internal regression model trained iteratively on the residuals
- Extreme Gradient Boosting or XGBoost

AdaBoost:
![adaboost.png](images/adaboost.png)

### Stacking
Stacking is one of the popular ensemble modeling techniques in machine learning. Various weak learners are ensembled in a parallel manner in such a way that by combining them with Meta learners, we can predict better predictions for the future.

![stacking.png](images/stacking.png)

## Decision Tree using sklearn

## Decision Tree Classifier using sklearn

In [15]:
#import libraries

import numpy as np
import pandas as pd

In [3]:
# import data

df = pd.read_csv("SUV.csv")

In [4]:
# EDA

print(df.sample(4))
print(df.shape)

      User ID  Gender  Age  EstimatedSalary  Purchased
188  15674206    Male   35            72000          0
343  15629739  Female   47            51000          1
166  15762228  Female   22            55000          0
257  15794493    Male   40            57000          0
(400, 5)


In [5]:
X = pd.DataFrame(df.iloc[:, [1,2,3]].values, columns = ['sex', 'age', 'income'])
y = pd.DataFrame(df.iloc[:, 4].values, columns = ['purchased'])

In [None]:
# Feature Engineering

In [7]:
from sklearn.preprocessing import LabelEncoder

label = LabelEncoder()
X['sex'] = label.fit_transform(X['sex'])

In [9]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = scaler.fit_transform(X)

In [10]:
# train-test split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

In [12]:
# Model Building / Decision Tree Building

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(criterion="entropy")

model.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy')

In [13]:
y_pred = model.predict(X_test)

In [14]:
# Model Evaluation

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

print('Accuracy :', accuracy, '\nConfusion Matrix :\n', cm)

Accuracy : 0.85 
Confusion Matrix :
 [[55  7]
 [ 8 30]]


## Decision Tree Regressor using sklearn

In [16]:
# import libraries

import pandas as pd
import numpy as np

In [17]:
# import data

data = pd.read_csv('winequality.csv')

In [25]:
# EDA

print(data.shape)
print(list(data.columns))
data.sample(4)

(1599, 12)
['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
62,7.5,0.52,0.16,1.9,0.085,12.0,35.0,0.9968,3.38,0.62,9.5,7
253,7.7,0.775,0.42,1.9,0.092,8.0,86.0,0.9959,3.23,0.59,9.5,5
303,7.4,0.67,0.12,1.6,0.186,5.0,21.0,0.996,3.39,0.54,9.5,5
231,8.0,0.38,0.06,1.8,0.078,12.0,49.0,0.99625,3.37,0.52,9.9,6


In [26]:
X, y = pd.DataFrame(data.iloc[:, :-1].values), pd.DataFrame(data.iloc[:,-1:].values)

In [27]:
# Feature Engineering

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit_transform(X)

array([[-0.52835961,  0.96187667, -1.39147228, ...,  1.28864292,
        -0.57920652, -0.96024611],
       [-0.29854743,  1.96744245, -1.39147228, ..., -0.7199333 ,
         0.1289504 , -0.58477711],
       [-0.29854743,  1.29706527, -1.18607043, ..., -0.33117661,
        -0.04808883, -0.58477711],
       ...,
       [-1.1603431 , -0.09955388, -0.72391627, ...,  0.70550789,
         0.54204194,  0.54162988],
       [-1.39015528,  0.65462046, -0.77526673, ...,  1.6773996 ,
         0.30598963, -0.20930812],
       [-1.33270223, -1.21684919,  1.02199944, ...,  0.51112954,
         0.01092425,  0.54162988]])

In [28]:
# train test split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.2,)

In [29]:
# Model building / Decision Tree Regressor building

from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor()

model.fit(X_train, y_train)

DecisionTreeRegressor()

In [30]:
y_pred = model.predict(X_test)

In [31]:
# Model Evaluation

from sklearn.metrics import r2_score, accuracy_score

r2 = r2_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

print('Accuracy :', accuracy, '\nR2 Score Matrix :', r2)

Accuracy : 0.6375 
R2 Score Matrix : 0.018193031144002503


### References

- https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052
- https://towardsdatascience.com/decision-tree-in-machine-learning-e380942a4c96
- https://medium.com/pursuitnotes/decision-tree-classification-in-9-steps-with-python-600c85ef56de
- https://medium.com/pursuitnotes/decision-tree-regression-in-6-steps-with-python-1a1c5aa2ee16
- https://towardsdatascience.com/boosting-the-accuracy-of-your-machine-learning-models-f878d6a2d185
- https://www.geeksforgeeks.org/bagging-vs-boosting-in-machine-learning/
- https://towardsdatascience.com/boosting-algorithms-explained-d38f56ef3f30