### Geometric Intuition

* Similar to classification algorithms like 
    - K-NN
    - Naive Bayes
    - Logistic Regression
    - Linear Regressioin
    - SVM
    we have another algorithm popularly known as **Decision Trees**.

* Decision tree algorithm is simply a nested `if-else` condition classifier.

* Decision tree models are highly interpretable.

* A sample code can represented as a decision tree.

![dt-1](https://user-images.githubusercontent.com/63333753/123220619-c2785080-d4eb-11eb-9093-f23d5ffc706f.png)

* A decision tree can be visualized geometrically.

![dt-2](https://user-images.githubusercontent.com/63333753/123220683-cc9a4f00-d4eb-11eb-8fed-b4261c4ba0bd.png)

> In DT, all the hyperplanes are axis parallel and intuitively, it is a set of axis parallel hyperplanes.

**Terminology**

* Root node → The very first node in a tree.
* Leaf nodes' → The terminating nodes in a tree.
* Internal nodes → The nodes which are neither leaf nodes or root nodes in a tree.

> At all non-leaf nodes, we have a decision/condition in a tree.

### Building a Decision Tree - Entropy

* The toughest task is to build a decision tree given the training data set.

* Decision trees can be built by applying the concept of entropy (the concept of entropy is used in information theory).

* Suppose we are given a random variable $Y$ which can take $k$ values.

$$Y = \{y_1, y_2, y_3, \dots, y_k \}$$

* Entropy of $Y$ is defined as $H(Y)$.

$$H(Y) = - \sum_{i=1}^k P(y_i) \log_b[P(y_i)]; \text{where} \ (b = 2 \ \text{or} \ 2.718)$$

In [1]:
import pandas as pd
import numpy as np

data_source = 'http://www.shatterline.com/MachineLearning/data/tennis_anyone.csv'
df = pd.read_csv(data_source)
df.columns = ['outlook', 'temp', 'humidity', 'wind', 'class']

def compute_entropy(dframe, f):
    udf = dframe[f].value_counts().to_frame()
    unique_vals = udf.index.to_list()
    fsum = np.sum(udf[f].to_list())
    
    fh = {}
    for i in unique_vals:
        icount = len(dframe[dframe[f] == i])
        iprob = icount / fsum
        iprob_log = iprob * np.log2(iprob)
        fh[i] = iprob_log
    
    entropy = np.sum(list(fh.values()))
    return round(-entropy, 3)

In [2]:
for i in df.columns:
    ent_ = compute_entropy(dframe=df, f=i)
    print("{} → {}".format(i, ent_))

outlook → 1.577
temp → 1.557
humidity → 1.0
wind → 0.985
class → 0.94


### Properties of Entropy

Let $Y$ be a random variable which can take $2$ values $\rightarrow y_+, y_-$

* **Case 1**
    - $y_+ \rightarrow 99%$
    - $y_- \rightarrow 1%$
    - $H(Y) \rightarrow 0.0801$

* **Case 2**
    - $y_+ \rightarrow 50%$
    - $y_- \rightarrow 50%$
    - $H(Y) \rightarrow 1$

* **Case 3**
    - $y_+ \rightarrow 0%$
    - $y_- \rightarrow 100%$
    - $H(Y) \rightarrow 0$

**Entropy Plot for Binary Class**

<img src="https://upload.wikimedia.org/wikipedia/commons/c/c9/Binary_entropy_plot.png">

> The more peaked a distribution is the smaller is its entropy, and vice-versa.

**Credits** - Image from Internet

### KL-Divergence

* Given pdf of $P$ and pdf $Q$ such that $P$ and $Q$ be any two distributions of random variable $X$.

* The distance between $P$ and $Q$ is referred to as KL-Divergence.

* Using the concept of KS-Statistic, we can compute the distance between $P$ and $Q$ by considering the cdf of $P$ and $Q$.

* Here, we take the maximum distance by which cdf of $P$ and cdf $Q$ are separated.

* The only problem with using KS-Statistic is that it is not differentiable. In most of the machine learning concepts, differentiation plays a major role.

* With the help of KL-Divergence, we can get the distance that can be differentiated.

$$D_{KL}(P||Q) = \sum_x P(x) \log_2 \bigg[\frac{P(x)}{Q(x)}\bigg] \ \text{or} \ \int_x P(x) \log_2 \bigg[\frac{P(x)}{Q(x)}\bigg]$$

### Building a Decision Tree - Gini Impurity ($I_G$)

* Gini impurity is very similar to entropy.
* Given $Y = \{y_1, y_2, y_3, \dots, y_k\}$, gini impurity is defined as -

$$I_G(Y) = 1 - \sum_{i=1}^k \big[P(y_i)\big]^2$$

* Let $Y$ be a random variable which can take $2$ values $\rightarrow y_+, y_-$

    * **Case 1**
        - $y_+ \rightarrow 50%$
        - $y_- \rightarrow 50%$
        - $I_G(Y) \rightarrow 0.5$

    * **Case 2**
        - $y_+ \rightarrow 0%$
        - $y_- \rightarrow 100%$
        - $H(Y) \rightarrow 0$

* Gini impurity is more computationally efficient than entropy.

In [3]:
import pandas as pd
import numpy as np

data_source = 'http://www.shatterline.com/MachineLearning/data/tennis_anyone.csv'
df = pd.read_csv(data_source)
df.columns = ['outlook', 'temp', 'humidity', 'wind', 'class']

def compute_gini_impurity(dframe, f):
    udf = dframe[f].value_counts().to_frame()
    unique_vals = udf.index.to_list()
    fsum = np.sum(udf[f].to_list())
    
    fh = {}
    for i in unique_vals:
        icount = len(dframe[dframe[f] == i])
        iprob = (icount / fsum) ** 2
        fh[i] = iprob
    
    gm = 1 - np.sum(list(fh.values()))
    return round(gm, 3)

In [4]:
for i in df.columns:
    gm = compute_gini_impurity(dframe=df, f=i)
    print(gm)

0.663
0.653
0.5
0.49
0.459


### Building a Decision Tree - Information Gain (IG)

$$\text{Information Gain = [Entropy(parent)] – [Weighted Average Entropy of child nodes]}$$

$$\text{or}$$

$$\text{IG(Y, var)} = H_D(Y) - \sum_{i=1}^k \frac{|D_i|}{|D|} H_{D_i}(Y)$$

In [8]:
import pandas as pd
import numpy as np

data_source = 'http://www.shatterline.com/MachineLearning/data/tennis_anyone.csv'
df = pd.read_csv(data_source)
df.columns = ['outlook', 'temp', 'humidity', 'wind', 'class']

def compute_weighted_entropy(dframe, f, c):
    fudf = dframe[f].value_counts().to_frame()
    funique_vals = fudf.index.to_list()
    c_tot = len(dframe[c])
    
    finfo = {}
    for i in funique_vals:
        fidf = dframe[dframe[f] == i]
        fent_ = compute_entropy(dframe=fidf, f=c)
        finfo[i] = (len(fidf) / c_tot) * fent_
    
    went_ = np.sum(list(finfo.values()))
    return round(went_, 3)

def gain_information(dframe, f, c):
    went_ = compute_weighted_entropy(dframe=dframe, f=f, c=c)
    pent_ = compute_entropy(dframe=df, f=c)
    ginfo = pent_ - went_
    return round(ginfo, 3)

In [9]:
compute_weighted_entropy(dframe=df, f='outlook', c='class')

0.694

In [10]:
for i in df.columns:
    if (i == 'class'):
        break
    ig = gain_information(dframe=df, f=i, c='class')
    print("{} → {}".format(i, ig))

outlook → 0.246
temp → 0.029
humidity → 0.152
wind → 0.048


### Constructing a Decision Tree

Helpful link → https://bit.ly/3d9noYg

**Pure node** → The node which has only one class label is called a pure node.

1. Choose the root node by computing information gain on all the features w.r.t target and pick the maximum.

2. Split the data by likeliness
    - If any of the nodes is pure node, do not extend further.
    - Otherwise, continue.

3. Choose the internal node by step 1

> This entire process is computed recursively. IG plays a major role in visualizing the decision tree.

**Rules**

1. If pure node, stop growing the tree.

2. If lack of points, growing the tree is impossible.

3. If the depth of a tree is too large, stop growing the tree.
    - As the depth increases, chances of overfitting to noise increases.
    - If depth is very small, the model underfits
    - The hyperparameter is `depth` and the right `depth` is decided by cross-validation techniques.

### Building a Decision Tree - Splitting Numerical Features

* Splitting in numerical data has to be done by -

    - Sort the numerical features in ascending order.
    - Conditional for splitting can be decided in `n` possible ways, like -
        * f1 < thresh_1
        * f2 < thresh_2
        * f3 < thresh_3
        * ...
        * fn < thresh_n
        * We need to evaluate for all `n` conditions.
    - Pick the one that gives maximum IG.

* This complete process is very time consuming.

### Building a Decision Tree - Categorical Features with `n` Possible Ways

* In case of the categorical data with `n` possible ways, it is better to convert the data into numeric by considering the feature along with the target variable and compute the conditional probability.

* Replace the categorical value with the probability and construct a decision tree.

When we have a categorical feature with many elements and it is nominal in nature then we have three approaches to converting this feature.

* CASE1 (when response variable is both categorical/continues):
    - We can bin the feature into fewer subcategories.

* CASE2 (when response variable is continues):
    - Replace each category with its mean/median response variable value.

* CASE3 (when response variable is discrete/categorical):
    - Replace each x_ij with P(y_i=C|F_j="ABC")

CASE1 is known as binning

CASE2 and CASE3 is known as response variable encoding

### Overfitting and Underfitting

> A tree of depth 1 is called a decision stump.

* If depth of the tree increases, 
    - the possibility of having very few points at leaf node increases.
    - model gets overfitted to the noisy data.
    - interpretability of the model decreases and this should never happen.

* If depth of the tree is shallow,
    - model get underfitted to the data.

> The right depth needs to be discovered by cross-validation.

### Train and Runtime Complexity

* **Training time complexity** → $O[n\log_2(n)d]$
    - `n` → number of points in the data
    - `d` → dimensionality of the data
    - $n\log_2(n)$ → corresponds to sorting
* **Space complexity** → after training the data, we need to convert the tree into `if-else` condition. The space complexity is 
    - $(\text{total number of internal nodes}) + (\text{total number of leaf nodes})$
    - or
    - $(\text{total number of nodes}$)
* **Runtime complexity** → it is just the order of depth of the tree; $O(\text{depth})$.

### Decision Tree Regression

Helpful link → http://www.saedsayad.com/decision_tree_reg.htm

### Cases

* If imbalance data, balance it and do the process.
* When the data has categorical features, avoid one-hot encoding. Otherwise, the model take much time to get trained.
* Instead of data, if given similarity matrix, decision trees cannot work. Decision trees need features for IG.
* Feature interaction is used to take a decision for a query point to belong to anuy of the class labels.
    - Logical feature interactions are in-built in decision trees.
* As depth increases, the impact of outliers is more.
* Interpretability of the features is very easy as everything can be changed to `if-else` conditions.