# Video 16. Decision Tree Algorithm
***

- Decision tree starts with a problem
- Decision tree is a tree shaped diagram used to determine a course of action. Each branch of the tree represents a possible decision, occurrence or reaction


### Problems Decision Trees can Solve
- **Classification**
    - A classification tree will determine a set of logical if-then conditions to classify problems
    - For example, disciminating between three types of flowers based on certain features
- **Regression**
    - Regression tree is used when the target variable is numerical or continious in nature
    - We fit a regression model to the target variable using each of the independent variables
    - Each split is made based on the sum of squared error
    
    
### Advantages of Decision Tree
- Simple to understand, interpret and visualize
- Little effort is required for data preparation
- Can handle both numerical and categorical data
- Non-linear parameters don't effect its performance
    - even if data doesn't fit nicely on a graph it can still be used to make predictions
    
    
### Disadvantages of Decision Tree
- Overfitting
    - occurs when the algorithm captures noise in the data
- High variance
    - the model can get unstable due to small variation in data
- Low biased tree
    - a highly complicated Decision tree tends to have a low vias which makes it difficult for the model to work with new data
    
    
### Important Terms
**ENTROPY**
- Measure of randomness or unpredictability in the dataset

**INFORMATION GAIN**
- It is the measure of decrease in entropy after the dataset is split

<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
    <img src="img/210319/1.png"><br><a href="https://youtu.be/RmajweUFKvM?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy"><b>Image Source</b></a></div>
    
**LEAF NODE**
- Carries the classification or the decision

<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
    <img src="img/210319/2.png"><br><a href="https://youtu.be/RmajweUFKvM?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy"><b>Image Source</b></a></div>
    
**ROOT NODE**
- The top most decision node is known as the root node

<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
    <img src="img/210319/3.png"><br><a href="https://youtu.be/RmajweUFKvM?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy"><b>Image Source</b></a></div>
    
    
### How Does a Decision Tree Work?
- Here are some animals:

<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
    <img src="img/210319/4.png"><br><a href="https://youtu.be/RmajweUFKvM?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy"><b>Image Source</b></a></div>
    
- *Problem Statement*: classify the different types of animals based on their features using decision tree
- The dataset is looking quite messy and the entropy is high in this case
- Let's look at the training dataset:

<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
    <img src="img/210319/5.png"><br><a href="https://youtu.be/RmajweUFKvM?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy"><b>Image Source</b></a></div>
    
- We have to frame the conditions that split the data in such a way that the information gain is the highest
    - Gain is the measure of decrease in entropy after splitting
- And here's a formula for entropy:

$$ \large \sum_{i=1}^{k} P(value_i) \log_2 (P(value_i)) $$

- Let's try to calculate the entropy for the current dataset
- We have:
    - 3 giraffes
    - 2 tigers
    - 1 monkey
    - 2 elephants
        - total of 8 animals
- If we plug that in to the formula:

$$ ENTROPY = (\frac{3}{8}) \log_2 (\frac{3}{8}) + (\frac{2}{8}) \log_2 (\frac{2}{8}) + (\frac{1}{8}) \log_2 (\frac{1}{8}) + (\frac{2}{8}) \log_2 (\frac{2}{8}) $$<br>
$$ ENTROPY = 0.571 $$

- We will calculate the entropy of the dataset similarly after every split to calculate the gain
- Gain can be calculated by finding the difference of the subsequent entropy values after split
- Now we will try to choose a condition that gives us the highest gain
- We will do that by splitting the data using each condition and checking the gain that we get out them
- Here are possible conditions:

<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
    <img src="img/210319/6.png"><br><a href="https://youtu.be/RmajweUFKvM?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy"><b>Image Source</b></a></div>
    
- We will split the data by the color yelow:

<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
    <img src="img/210319/7.png"><br><a href="https://youtu.be/RmajweUFKvM?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy"><b>Image Source</b></a></div>
    
- The entropy after splitting has decreased considerably
- However, we still need some splitting at both the branches to attain an entropy value equal to zero
- So, we decide to split both the nodes using **height** as condition:

<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
    <img src="img/210319/8.png"><br><a href="https://youtu.be/RmajweUFKvM?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy"><b>Image Source</b></a></div>
    
- Since every branch now contains single label type, we can say that the entropy in this case has reached the least value
- This tree can now predict all the classes of animals present in the dataset with 100% accuracy


### Use Case - Loan Repayment Prediction
- Individual says: *I need to find out if my customers are going to return the loan they took from my bank or not*
- Predict if a customer will repay loan amount or not using decision tree algorithm in Python

We'll import the libraries

        import numpy as np
        import pandas as pd
        from sklearn.model_selection import train_test_split
        from sklearn.tree import DecisionTreeClassifier
        from sklearn.metrics import accuracy_score
        from sklearn import tree
        
Link to the dataset

        data = pd.read_csv("data")
        
Separating the target variables

        X = data.values[:, 1:5]
        Y = data.values[:, 0]
        
Splitting dataset into training and test

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
        
Function to perform training with Entropy

        clf_entropy = DecisionTreeClassifier(criterion="entropy", max_depth=3, min_samples_leaf=5)
        clf_entropy.fit(X_train, y_train)
        
Function to make Prediction

        y_pred = clf_entropy.predict(X_test)
        
Checking accuracy

        print("Accuracy is: {}".format(accuracy_score(y_test, y_pred) * 100))