# Decision Tree

## Definition

A decision tree is a flowchart-like tree structure where an internal node represents a feature (or attribute), a branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner called recursive partitioning.

### Structure

- **Root Node**: Represents the entire dataset, which gets divided into two or more homogeneous sets.
- **Decision Node**: When a sub-node splits into further sub-nodes.
- **Leaf Node**: Nodes with no children (no further split).
- **Branch**: Represents a decision rule, leading to a decision or leaf node.

### Entropy

Entropy measures the randomness or uncertainty in the information. It is defined as:

$$ H(S) = -\sum_{x} P(x) \log_2 P(x) $$

Where:
- $ H(S) $ is the entropy of set $ S $.
- $ P(x) $ is the proportion of the number of elements in class $ x $ to the number of elements in set $ S $.

### Information Gain

Information Gain measures the reduction in entropy or surprise by splitting a dataset on an attribute. It is calculated as:

$$ IG(S, A) = H(S) - \sum \left( \frac{|S_v|}{|S|} \right) H(S_v) $$

Where:
- $ IG(S, A) $ is the information gain by splitting set $ S $ on attribute $ A $.
- $ H(S) $ is the entropy of set $ S $.
- $ S_v $ is the subset of $ S $ for each value $ v $ of attribute $ A $.
- $ |S_v|/|S| $ is the proportion of the number of elements in $ S_v $ to the number of elements in set $ S $.\

### Gini Index
$$ \text{Gini}(D) = 1 - \sum_{i=1}^{m} P_i^2 $$
$$ \text{Gini}(D, F) = \sum_{j \in \text{values}(F)} \frac{|D_j|}{|D|} \text{Gini}(D_j) $$

Where:
- $ P_i $ is the proportion of the number of elements in class $ i $ to the number of elements in set $ D $.
- $ D_j $ represents the subset of $ D $ for each value $ j $ of feature $ F $.

## Assumption

None

## Algorithm

1. **Initialization**: Start with the entire dataset, $ D $, at the root node.

2. **Feature Selection**:
   - For each feature $ F $ in the dataset, calculate the split criterion (such as Entropy or Gini index) for each possible split value.
   - The split criterion for a binary classification problem can be calculated as follows:

     **Entropy**:
     $$ H(D) = -\sum_{i=1}^{m} P_i \log_2 P_i $$
     $$ H(D, F) = \sum_{j \in \text{values}(F)} \frac{|D_j|}{|D|} H(D_j) $$
     $$ \text{Information Gain} = H(D) - H(D, F) $$

     **Gini Index**:
     $$ \text{Gini}(D) = 1 - \sum_{i=1}^{m} P_i^2 $$
     $$ \text{Gini}(D, F) = \sum_{j \in \text{values}(F)} \frac{|D_j|}{|D|} \text{Gini}(D_j) $$

     Where:
     - $ P_i $ is the proportion of the number of elements in class $ i $ to the number of elements in set $ D $.
     - $ D_j $ represents the subset of $ D $ for each value $ j $ of feature $ F $.

3. **Best Feature Selection**:
   - Choose the feature with the highest information gain (or the lowest Gini index) as the splitting feature.
   - Let $ F_{\text{best}} $ be this feature.

4. **Node Creation**:
   - Create a decision node that splits on $ F_{\text{best}} $.

5. **Dataset Splitting**:
   - Split the dataset $ D $ into subsets $ D_1, D_2, \ldots, D_k $ based on the unique values of $ F_{\text{best}} $.

6. **Recursion**:
   - For each subset $ D_i $:
     - If $ D_i $ is pure (contains only one class), attach a leaf node to the decision node with that class label.
     - If $ D_i $ still contains mixed classes, repeat steps 2 to 6 for $ D_i $.

7. **Termination**:
   - The process continues until all data is classified or another stopping criterion is met (such as maximum tree depth).

### Example

Here's the dataset:

| Weather | Wind   | Play |
|---------|--------|------|
| Sunny   | Weak   | Yes  |
| Sunny   | Strong | No   |
| Rainy   | Weak   | Yes  |
| Rainy   | Strong | No   |
| Sunny   | Weak   | Yes  |

First, we calculate the entropy and Gini index for the entire dataset (root node), and then for one of the features.

#### Entropy Calculation

1. **Entropy for the entire dataset:**

   There are 3 'Yes' and 2 'No' outcomes.

   $$ P(Yes) = \frac{3}{5}, \quad P(No) = \frac{2}{5} $$

   $$ H(S) = -P(Yes) \log_2 P(Yes) - P(No) \log_2 P(No) $$
   $$ H(S) = -\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5} $$

2. **Entropy for a feature (e.g., Weather):**

   For 'Sunny' (3 instances): 2 'Yes', 1 'No'
   $$ H(Sunny) = -\frac{2}{3} \log_2 \frac{2}{3} - \frac{1}{3} \log_2 \frac{1}{3} $$

   For 'Rainy' (2 instances): 1 'Yes', 1 'No'
   $$ H(Rainy) = -\frac{1}{2} \log_2 \frac{1}{2} - \frac{1}{2} \log_2 \frac{1}{2} $$

   Weighted average:
   $$ H(Weather) = \frac{3}{5} H(Sunny) + \frac{2}{5} H(Rainy) $$

#### Gini Index Calculation

1. **Gini index for the entire dataset:**
   $$ Gini(S) = 1 - [P(Yes)^2 + P(No)^2] $$
   $$ Gini(S) = 1 - \left[ \left(\frac{3}{5}\right)^2 + \left(\frac{2}{5}\right)^2 \right] $$

2. **Gini index for a feature (e.g., Weather):**

   For 'Sunny':
   $$ Gini(Sunny) = 1 - \left[ \left(\frac{2}{3}\right)^2 + \left(\frac{1}{3}\right)^2 \right] $$

   For 'Rainy':
   $$ Gini(Rainy) = 1 - \left[ \left(\frac{1}{2}\right)^2 + \left(\frac{1}{2}\right)^2 \right] $$

   Weighted average:
   $$ Gini(Weather) = \frac{3}{5} Gini(Sunny) + \frac{2}{5} Gini(Rainy) $$

Let's calculate these values.

Here are the calculated values for entropy and the Gini index:

1. **Entropy for the entire dataset (root node):** $ H(S) = 0.971 $
2. **Weighted average entropy for the 'Weather' feature:** $ H(Weather) = 0.951 $

3. **Gini index for the entire dataset (root node):** $ Gini(S) = 0.48 $
4. **Weighted average Gini index for the 'Weather' feature:** $ Gini(Weather) = 0.467 $


## Pros and Cons

### Pros of Decision Trees:

1. **Easy to Understand and Interpret**: Trees can be visualised, which makes them easy to interpret and explain, even to non-technical stakeholders.

2. **Handles Both Numerical and Categorical Data**: They can handle datasets with a variety of variable types.

3. **No Need for Data Normalization**: They don't require data to be normalized or scaled.

4. **Performs Well with Large Datasets**: Can handle large datasets efficiently and are scalable.

5. **Automatic Feature Selection**: The tree can select the most informative features from a dataset.

6. **Non-Parametric Method**: They don't make any assumptions about the space distribution and the structure of the classifier.

7. **Flexible and Versatile**: Can be used for both classification and regression tasks.

<br>

### Cons of Decision Trees:

1. **Overfitting**: Without proper tuning, decision trees can easily overfit, especially with a lot of features or complex structures.

2. **Not Suitable for Linear Problems**: They are not the best choice for tasks where relationships between features are linear.

3. **Instability**: Small changes in the data can lead to a completely different tree.

4. **Greedy Algorithms**: Decision tree algorithms are greedy, meaning they might not produce the globally optimal tree.

5. **Bias Towards Dominant Classes**: Trees can be biased toward the dominant classes, leading to imbalanced classification.

6. **Poor Performance on XOR, Parity, or Multiplexer Problems**: Decision trees struggle with these types of problems because they cannot easily express the XOR, parity, or multiplexer functions.

7. **Complex Trees Can Be Hard to Interpret**: While simple trees are interpretable, very large and complex trees can become difficult to understand and interpret.

In summary, decision trees are a powerful tool with a natural ability to explain their predictions, but they can suffer from overfitting and instability. They work best when the dataset is not too complex, and the relationships between features are not linear. Balancing the depth of the tree and the number of features considered at each split is key to building a well-performing decision tree model.

# Random Forest

## Definition

Random Forest is an ensemble learning method used for both classification and regression tasks. It builds upon the concept of decision trees, combining multiple trees to improve the overall performance and accuracy.

- **Ensemble of Decision Trees**: Random Forest creates a 'forest' of decision trees, each trained on a random subset of the data.
- **Bagging (Bootstrap Aggregating)**: Each tree is grown on a bootstrap sample, which is a random sample of the training data drawn with replacement.
- **Feature Randomness**: When splitting a node during the construction of the tree, the best split is found from a random subset of the features, not from all features.
- **Majority Voting (Classification)** or **Average Prediction (Regression)**: For classification, the prediction of the Random Forest is the mode of the classes predicted by individual trees. For regression, it is the average of the predictions.

Given a training set $ X = x_1, x_2, ..., x_n $ with responses $ Y = y_1, y_2, ..., y_n $, the Random Forest algorithm involves the following steps:

1. For $ B $ bootstrap samples (where $ B $ is the number of trees in the forest):
   - Draw a bootstrap sample $ X_b, Y_b $ from the training data.
   - Grow a decision tree $ f_b $ on $ X_b, Y_b $, splitting nodes by evaluating only a randomly selected subset of the features at each step. (select a random subset of features for each node)

2. Output the ensemble of trees $ \{ f_1, f_2, ..., f_B \} $.

3. To make a prediction for a new point $ x $:
   - **Classification**: Let each tree $ f_b $ vote and take the majority vote as the final prediction.
   - **Regression**: Average the predictions of each tree.

   $$ \hat{f}(x) = \frac{1}{B} \sum_{b=1}^{B} f_b(x) $$ (for regression)

## Pros and Cons
### Pros

- **High Accuracy**: Combining multiple trees reduces the variance, often leading to better predictions.
- **Handles Overfitting**: The randomization elements help to reduce overfitting.
- **Works Well on Large Datasets**: Effective for datasets with a large number of features and data points.
- **Handles Missing Values**: Can handle missing values in the data.

### Cons

- **Model Interpretability**: More complex and harder to interpret than individual decision trees.
- **Computationally Intensive**: Requires more computational resources and time to train, especially with a large number of trees.
- **Not Ideal for Real-time Predictions**: Due to the size of the model, it might not be the best choice for applications requiring real-time predictions.

## Code

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from datasets import load_dataset

In [2]:
dataset = load_dataset("imodels/diabetes-readmission", split='train')
df = dataset.to_pandas()
df.head(5)

Unnamed: 0,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,change,diabetesMed,...,glyburide-metformin:Up,A1Cresult:>7,A1Cresult:>8,A1Cresult:None,A1Cresult:Norm,max_glu_serum:>200,max_glu_serum:>300,max_glu_serum:None,max_glu_serum:Norm,readmitted
0,2.0,38.0,3.0,27.0,0.0,1.0,2.0,7.0,1.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0
1,4.0,48.0,0.0,11.0,0.0,0.0,0.0,9.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0
2,2.0,28.0,0.0,15.0,0.0,3.0,4.0,9.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1
3,4.0,44.0,0.0,10.0,0.0,0.0,0.0,7.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0
4,3.0,54.0,0.0,8.0,0.0,0.0,0.0,8.0,1.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0


In [3]:
X = np.array(df.iloc[:,:-1])
y = np.array(df.iloc[:,-1])
print(X.shape)
print(y.shape)

(81410, 150)
(81410,)


In [4]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Decision Tree
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)
dt_predictions = dt_classifier.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_predictions)
print("Decision Tree Accuracy:", dt_accuracy)

# Random Forest
rf_classifier = RandomForestClassifier(n_estimators=1000, random_state=42)
rf_classifier.fit(X_train, y_train)
rf_predictions = rf_classifier.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_predictions)
print("Random Forest Accuracy:", rf_accuracy)

Decision Tree Accuracy: 0.558653728043238
Random Forest Accuracy: 0.634934283257585
