# Decision Trees
<li>Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression.</li>
<li>It is a non-parametric learning algorithm because it doesnot make any assumptions about the underlying data distribution or parameters.</li>
<li>The goal of decision trees is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.</li>
<li>Decision Tree has a hierarchical, tree like structure, which consists of a root node, branches, internal nodes and leaf nodes.</li>

![](images/decision_trees.png)

## How Decision Trees Work?
<li>The decision tree algorithm builds the tree in a recursive way, by selecting the best attribute to split the data at each node based on some criterion.</li>
<li>The criterion that can be used for splitting up a decision node is information gain or Gini impurity.</li>
<li>Information gain measures the reduction in entropy (i.e., uncertainty) of the class labels after a split.</li>
<li>Entropy is defined as a measure of randomness or disorder of a system.</li>
<li>Information gain and Entropy are inversely proportional to each other.</li>
<li>When entropy increases, information gain decreases and when entropy decreases, information gain increases.</li>
<li>Gini impurity measures the probability of misclassifying a random sample from the node.</li>
<li>The process continues until all the instances in a node belong to the same class or until a stopping criterion is met.</li>
<li>Stopping criterion could be maximum tree depth or minimum number of instances per leaf.</li>
<li>The resulting tree can be used to classify new instances by traversing from the root to a leaf node, following the path that satisfies the tests at each node.</li>

![](images/working_of_dtrees.png)

## Decision Tree Inducers (Types Of Decision Tree Algorithm)
<li>A decision tree inducer is an algorithm that is used to build a decision tree from a given dataset. Here are some commonly used decision tree inducers:</li>
<ol>
    <b><li>ID3</li></b>
    <b><li>C4.5</li></b>
    <b><li>CART</li></b>
</ol>

<b>1. ID3:</b>
<li>The full form of ID3 algorithm is Iterative Dichotomiser 3.</li>
<li>This is one of the earliest decision tree algorithms developed by Ross Quinlan.</li> 
<li>It uses the concept of entropy and information gain to select the best attribute for splitting the data at each node.</li>
<li>It cannot handle numeric featues and it can only be used for classification tasks only.</li>

<b>2. C4.5:</b>
<li>C4.5 is actually an abbreviation for "Classifier Version 4.5".</li>
<li>It is a decision tree algorithm that was developed by Ross Quinlan, and it is an extension of the earlier ID3 algorithm.</li>
<li>The C4.5 algorithm can handle both discrete and continuous data.</li>
<li>It uses <b>information gain ratio</b> as the splitting criterion.</li>
<li>It also includes a post-pruning step to reduce overfitting.</li>

<b>3. CART:</b>
<li>The full form of CART is Classification And Regression Trees.</li>
<li>This is a decision tree algorithm developed by Breiman, Friedman, Olshen, and Stone.</li>
<li>It can be used for both classification and regression tasks.</li>
<li>It uses the GIni impurity measure to select the best attribute for splitting the data.</li>

![](images/decision_tree_inducers.png)

## Entropy

<li>We use the concept of Entropy and Information Gain while splitting up a node in an ID3 algorithm.</li>
<li>Entropy is defined as a measure of randomness or disorder in the system.</li>
<li>The formula to calculate entropy is given by:</li>

![](images/Entropy_formula.png)

<li>Here, c is the number of class. So for binary classification problem, the entropy formula is given by:</li>

![](images/expanded_eqn_entropy.png)

<li>Here, p is the probablity that it belongs to positive class and q is the probability that it belongs to negative class.</li>
<li>Let's say you are predicting whether the employee will get a promotion or not.</li>
<li>If only 30% of employees in your total dataset has received promotion then your p=0.3 being your positive class and q=1-p=0.7 being your negative class.</li>

## Information Gain & Splitting Of Node In ID3 Algorithm
<li>One of the key steps in ID3 algorithm is to split a node into child nodes based on the attribute that maximizes the information gain.</li>

<li>Information gain is a measure of the reduction in entropy (impurity) of the dataset after splitting the data based on an attribute.</li>
<li>Entropy is a measure of the randomness or uncertainty in the dataset.</li>

**The formula for information gain is:**
<code>
Information Gain = Entropy(parent) - ∑ [Weighted Average] * Entropy(children)
</code>
**where**

<li>Entropy(parent) is the entropy of the parent node</li>
<li>Entropy(children) is the entropy of each child node</li>
<li>the Weighted Average is the proportion of the data that belongs to each child node.</li>


![](images/information_gain_id3.png)
<li>Firstly, we calculate the entropy of the parent node
<li>After calculating entropy, we calculate the information gain for each of the attributes.</li>
<li>The attribute that results in the highest information gain is selected as the splitting attribute for the node.</li> 
<li>The node is then split into child nodes based on the values of the selected attribute.</li>
<li>This process is repeated recursively until all leaf nodes are pure (contain only one class) or until some stopping criteria is met.</li>
<li>In this way, ID3 algorithm uses information gain to select the attribute to split a node and to construct a decision tree from the dataset.</li>



In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


In [2]:
gameplay_df = pd.DataFrame({
    "Outlook": ["Sunny", "Sunny", "Overcast", "Rain", "Rain",
               "Rain", "Overcast", "Sunny", "Sunny", "Rain",
               "Sunny", "Overcast", "Overcast", "Rain", 
               "Sunny", "Overcast", "Rain"],
    "Temperature": ["Hot", "Hot", "Hot", "Mild", "Cool",
                   "Cool", "Cool", "Mild", "Cool", "Mild",
                   "Mild", "Mild", "Hot", "Mild",
                   "Hot", "Mild", "Cool"],
    "Humidity": ["High", "High", "High", "High", "Normal",
                "Normal", "Normal", "High", "Normal", "Normal",
                "Normal", "High", "Normal", "High",
                "Normal", "High", "Normal"],
    "Wind": ["Weak", "Strong", "Weak", "Weak", "Weak",
            "Strong", "Strong", "Weak", "Weak", "Weak",
            "Strong", "Strong", "Weak", "Strong", 
            "Strong", "Weak", "Strong"],
    "Play" : ["No", "No", "Yes", "Yes", "Yes",
             "No", "Yes", "No", "Yes", "Yes",
             "Yes", "Yes", "Yes", "No",
             "Yes", "Yes", "No"]
})

In [3]:
gameplay_df

Unnamed: 0,Outlook,Temperature,Humidity,Wind,Play
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes
5,Rain,Cool,Normal,Strong,No
6,Overcast,Cool,Normal,Strong,Yes
7,Sunny,Mild,High,Weak,No
8,Sunny,Cool,Normal,Weak,Yes
9,Rain,Mild,Normal,Weak,Yes


### Gini Index 
<li>Gini index is a measure of impurity or diversity used to select the best split in decision trees.</li>
<li>In the context of decision trees, it is used to measure the quality of a split when determining the feature that should be used to create child nodes.</li>
<li>The main goal of measuring impurity is to create child nodes that are as pure as possible in terms of the target variable.</li>
<li>The Gini index measures the probability of misclassifying a randomly chosen element from a dataset.</li>
<li>It ranges from 0 to 1, where 0 indicates a pure node and 1 indicates maximum impurity.</li>
<li>A pure node is a node where all elements belong to the same class.</li>
<li>An impure node is a node where elements are equally distributed across all classes.</li>

**The formula for calculating the Gini index for a leaf node is:**
<code>
Gini Index(Leaf) = 1 - ∑(p_i^2)
</code>

**where p_i is the proportion of samples that belong to class i in the node.**

<li>After calculating the gini index for a leaf node, weighted gini index for the node is calculated based on the formula.</li>

![](images/weighted_gini_index.png)

<li>When selecting a split in a decision tree, the feature that results in the lowest weighted Gini index (highest purity) is chosen.</li>
<li>The resulting split divides the dataset into two or more child nodes, which are then processed recursively to create the decision tree.</li>



### Decision Tree For Regression
<li>For classification, DT tries to split node by maximizing information gain incase of ID3 or minimizing gini index incase of CART.</li>
<li>But for regression, the goal is to reduce the variance of the target variable (i.e., the dependent variable).</li>
<li>Decision Trees works on the principle of variance reduction since the target variable is continuous.</li>
<li>This is typically done by minimizing the sum of squared differences between the target variable and the mean value of the samples in each resulting group.</li>

### How Splitting Of Node is Done in Decision Tree Regressor

<li>The decision tree regressor considers all possible splits for each predictor variable and selects the one that maximizes the variance reduction.</li>
<li>The process is repeated recursively for each resulting group until a stopping criterion is met.</li>
<li>Common stopping criteria include a minimum number of samples required to split a node, a maximum tree depth.</li>