# Decision Tree

---

## References


[Scikit Learn - Decision Trees](https://scikit-learn.org/stable/modules/tree.html)

[Geeks for Geeks - Decision Tree](https://www.geeksforgeeks.org/decision-tree/)

[Geeks for Geeks - Decision Tree in Machine Learning](https://www.geeksforgeeks.org/decision-tree-introduction-example/)

[Geeks for Geeks - Python | Decision tree implementation](https://www.geeksforgeeks.org/decision-tree-implementation-python/)

[Geeks for Geeks - Python | Decision Tree Regression using sklearn](https://www.geeksforgeeks.org/python-decision-tree-regression-using-sklearn/)

---

## Notes

#### Characteristics
- Supervised Learning
- Classification and Regression
- Uses tree like structure where
    - internal nodes are questions
    - branches are paths for each answer
    - leaf nodes are predictions

#### Input & Output
- **Input**: feature matrix $X$ with shape (n_samples, n_features)
- **Output**: label or target value vector $y$ with shape (n_samples,)

#### Parameters
- splitting feature
- splitting threshold
- tree structure
- leaf predictions 

#### Hyperparameters

- maximum depth
- minimum samples for node (internal and leaf)
- splitting metric
    - gini index
    - entropy
    - mean squared error
- minimum gain

#### Runtime Complexity
- **Training**: $O(n\log(n) \cdot m\cdot d)$
    - for each feature, we sort ($O(n\log n)$) and calculate gain for each of $n-1$ thresholds
    - so, for each level, it takes $O(n\log n\cdot m)$
- **Inference**: $O(d)$

where 
- $n$: number of samples
- $m$: number of features
- $d$: depth of the decision tree

#### Pros & Cons
- **Pros**: 
    - easy to understand and visualize the model
    - decision boundaries are not linear and can be more complex
    - little data pre processing is required
    - fast inference

* **Cons**: 
    - high risk for overfitting
    - not optimal since we decide boundary greedily
    - biased towards features with more data

---

## Mathematics

#### Creating Internal / Root Nodes

When creating an internal node, we receive a set of samples $S$.

For each feature $f_i$:
- Sort $S$ based on the values of feature $f_i$
- Evaluate potential thresholds or intervals that split $S$ and compute the corresponding gain

Select the splitting feature $f$ and partition $V_f$ that yields the maximum gain.

Create new child nodes using subsets $S_v\subseteq S$, where each $S_v=\{x \in S \mid x[f] \in v\}$ for $v \in V_f$.

#### Creating Leaf Nodes

When creating a leaf node, we receive a set of samples $S$.

- Classification: the prediction is the most frequent class in $S$.
- Regression: the prediction is the mean of the target values in $S$.

#### Gain

$$\text{Gain}(S,V_f,f) = i(S) - \sum_{v\in V_f}\frac{|S_v|}{|S|}i(S_v)$$

where 
- $S$: set of samples in the node
- $f$: index of splitting feature
- $V_f$: set of intervals representing partition on feature $f$
- $S_v$: subset of $S$ such that $x[f]\in v$ 
- $i(S)$: impurity of set $S$ (e.g., Gini, Entropy, MSE)

#### Gain (Binary Trees)

$$\text{Gain}(S,t,f) = i(S) - \frac{|L|}{|S|}i(L) - \frac{|R|}{|S|}i(R)$$
$$L=\{x\in S\mid x[f]\leq t\},\quad R=\{x\in S\mid x[f]> t\}$$

where 
- $S$: set of samples in the node
- $t$: splitting threshold
- $f$: splitting feature
- $i(S)$: impurity of set $S$ (e.g., Gini, Entropy, MSE)

#### Gini Index
$$\text{Gini Index} = 1 - \sum_{i\in C}(p_i)^2$$

where 
- $C$: set of classes
- $p_i$: proportion of samples with class $i$ in the node

#### Entropy
$$H = -\sum_{i\in C}p_i\log_2(p_i)$$

where 
- $C$: set of classes
- $p_i$: proportion of samples with class $i$ in the node

#### Mean Squared Error

$$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \overline{y})^2$$

where 
- $n$: number of samples in the node
- $y_i$: target value of $i$-th sample in the node
- $\overline{y}$: mean target value of the samples in the node

---


## Comments

Only binary decision tree will be implemented