## Decision Trees: Mathematical Background and Variants

### Introduction to Decision Trees

Decision Trees are a non-parametric supervised learning method used for both classification and regression tasks. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

### How Decision Trees Work

1. **Node Splitting**: Each node in the tree represents a single feature in a dataset, which is split into two or more branches based on a certain condition. This process starts at the tree's root and is repeated recursively on each derived subset.

2. **Splitting Criteria**:
   - **Classification**: The Gini impurity or entropy is commonly used to measure the best split.
   - **Regression**: Variance reduction is often used for determining splits.

3. **Tree Pruning**: Reducing the size of a decision tree by removing sections of the tree that provide little power in predicting instances. This reduces the complexity and variance of the model.

4. **Stopping Criteria**: The tree stops growing when it reaches a predetermined condition, like a maximum depth or a minimum number of samples required to split a node.

### Mathematical Formulas

- **Gini Impurity**: A measure of how often a randomly chosen element from the set would be incorrectly labeled.
  
  $$
  G = 1 - \sum_{i=1}^{n} p_i^2
  $$

- **Entropy**: A measure of the amount of information disorder or uncertainty.
  
  $$
  H = -\sum_{i=1}^{n} p_i \log_2(p_i)
  $$

- **Information Gain**: The change in information entropy from a prior state to a state that takes some information as given.
  
  $$
  IG(T, a) = H(T) - H(T|a)
  $$

### Decision Tree Variant: Random Forest

Random Forest is an ensemble learning method for classification and regression that constructs a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees.

#### Benefits of Random Forest:
- **Accuracy**: Generally more accurate than a single decision tree.
- **Overfitting**: Controls overfitting better than individual decision trees.
- **Variable Importance**: Provides a good indicator of the importance of features.






## Problems of Using Gini, Information Gain, and Entropy in Decision Trees

### Introduction

Decision Trees use criteria like Gini impurity, Information Gain, and Entropy to decide the best splits at each node. Each method has its own advantages and disadvantages, which can impact the performance and characteristics of the decision tree.

### Gini Impurity

#### Formula
Gini impurity is a measure of how often a randomly chosen element would be incorrectly classified:

$$
G = 1 - \sum_{i=1}^{n} p_i^2
$$

Where $p_i$ is the probability of an element being classified as class $i$.

#### Advantages
1. **Computation**: Gini impurity is computationally efficient as it does not require logarithmic calculations.
2. **Performance**: Often leads to faster splits as it tends to create purer child nodes early on.

#### Disadvantages
1. **Bias towards Multisplits**: Gini impurity can be biased towards attributes with more levels, leading to a preference for features with more categories.
2. **Sensitivity to Class Distribution**: It may not perform well if the class distribution is highly imbalanced.

### Information Gain

#### Formula
Information Gain is based on the decrease in entropy after a dataset is split on an attribute:

$$
IG(T, a) = H(T) - H(T|a)
$$

Where $H(T)$ is the entropy of the dataset $T$ and $H(T|a)$ is the conditional entropy of $T$ given attribute $a$.

#### Advantages
1. **Intuitive**: Provides a clear measure of the reduction in uncertainty due to the split.
2. **Theoretical Foundation**: Based on information theory, providing a solid theoretical background.

#### Disadvantages
1. **Bias towards Attributes with Many Values**: Information Gain tends to favor attributes with a large number of distinct values, potentially leading to overfitting.
2. **Computational Complexity**: Calculating entropy involves logarithmic functions, which can be computationally intensive.

### Entropy

#### Formula
Entropy is a measure of the amount of uncertainty or impurity in the dataset:

$$
H = -\sum_{i=1}^{n} p_i \log_2(p_i)
$$

Where $p_i$ is the probability of an element being classified as class $i$.

#### Advantages
1. **Robustness**: Entropy is robust to small changes in the dataset and provides a clear measure of impurity.
2. **General Use**: Suitable for both binary and multi-class classification problems.

#### Disadvantages
1. **Computational Cost**: Similar to Information Gain, the calculation of entropy involves logarithmic computations, which can be expensive.
2. **Bias towards High Cardinality**: Like Information Gain, Entropy can also be biased towards attributes with many distinct values.

### Other Methods

#### Chi-Square
- **Advantages**: Useful for categorical data and can handle multi-class classification problems.
- **Disadvantages**: May not perform well with small sample sizes and can be sensitive to rare categories.

#### Gain Ratio
- **Advantages**: Adjusts Information Gain by taking the number and size of branches into account, reducing bias towards attributes with many values.
- **Disadvantages**: Can still be complex to compute and may not always outperform simpler criteria like Gini.

### Conclusion

Each criterion for splitting in Decision Trees has its own set of advantages and disadvantages. The choice of criterion can significantly affect the performance and interpretability of the model. It is essential to consider the nature of the dataset, the computational resources available, and the specific requirements of the problem at hand when selecting a splitting criterion.


## Simple Numerical Example of Decision Tree

Consider a small dataset about whether or not to play tennis based on the weather conditions. The dataset has three features: Outlook, Temperature, and Humidity. Here’s a simplified dataset:

| Outlook | Temperature | Humidity | PlayTennis |
|---------|-------------|----------|------------|
| Sunny   | Hot         | High     | No         |
| Sunny   | Hot         | High     | No         |
| Overcast| Hot         | High     | Yes        |
| Rainy   | Mild        | High     | Yes        |
| Rainy   | Cool        | Normal   | Yes        |
| Rainy   | Cool        | Normal   | No         |
| Overcast| Cool        | Normal   | Yes        |
| Sunny   | Mild        | High     | No         |
| Sunny   | Cool        | Normal   | Yes        |
| Rainy   | Mild        | Normal   | Yes        |
| Sunny   | Mild        | Normal   | Yes        |
| Overcast| Mild        | High     | Yes        |
| Overcast| Hot         | Normal   | Yes        |
| Rainy   | Mild        | High     | No         |

### Step-by-Step Calculation

#### 1. Calculate Entropy for the Target

Entropy helps us measure the impurity in a dataset. For the target variable PlayTennis:

$$
H(PlayTennis) = -p(Yes) \log_2(p(Yes)) - p(No) \log_2(p(No))
$$

From the dataset:
- Total samples = 14
- Yes = 9
- No = 5

Thus:

$$
H(PlayTennis) = -\left(\frac{9}{14}\right) \log_2\left(\frac{9}{14}\right) - \left(\frac{5}{14}\right) \log_2\left(\frac{5}{14}\right) \approx 0.940
$$

#### 2. Calculate Entropy for Each Attribute

**Outlook:**

1. **Sunny:**
   - Total = 5
   - Yes = 2, No = 3

   $$
   H(Sunny) = -\left(\frac{2}{5}\right) \log_2\left(\frac{2}{5}\right) - \left(\frac{3}{5}\right) \log_2\left(\frac{3}{5}\right) \approx 0.971
   $$

2. **Overcast:**
   - Total = 4
   - Yes = 4, No = 0

   $$
   H(Overcast) = -\left(\frac{4}{4}\right) \log_2\left(\frac{4}{4}\right) - \left(\frac{0}{4}\right) \log_2\left(\frac{0}{4}\right) = 0
   $$

3. **Rainy:**
   - Total = 5
   - Yes = 3, No = 2

   $$
   H(Rainy) = -\left(\frac{3}{5}\right) \log_2\left(\frac{3}{5}\right) - \left(\frac{2}{5}\right) \log_2\left(\frac{2}{5}\right) \approx 0.971
   $$

The total entropy for Outlook:

$$
H(Outlook) = \frac{5}{14}H(Sunny) + \frac{4}{14}H(Overcast) + \frac{5}{14}H(Rainy)
$$

$$
H(Outlook) = \frac{5}{14} \cdot 0.971 + \frac{4}{14} \cdot 0 + \frac{5}{14} \cdot 0.971 \approx 0.693
$$

**Temperature:**

1. **Hot:**
   - Total = 4
   - Yes = 2, No = 2

   $$
   H(Hot) = -\left(\frac{2}{4}\right) \log_2\left(\frac{2}{4}\right) - \left(\frac{2}{4}\right) \log_2\left(\frac{2}{4}\right) = 1
   $$

2. **Mild:**
   - Total = 6
   - Yes = 4, No = 2

   $$
   H(Mild) = -\left(\frac{4}{6}\right) \log_2\left(\frac{4}{6}\right) - \left(\frac{2}{6}\right) \log_2\left(\frac{2}{6}\right) \approx 0.918
   $$

3. **Cool:**
   - Total = 4
   - Yes = 3, No = 1

   $$
   H(Cool) = -\left(\frac{3}{4}\right) \log_2\left(\frac{3}{4}\right) - \left(\frac{1}{4}\right) \log_2\left(\frac{1}{4}\right) \approx 0.811
   $$

The total entropy for Temperature:

$$
H(Temperature) = \frac{4}{14}H(Hot) + \frac{6}{14}H(Mild) + \frac{4}{14}H(Cool)
$$

$$
H(Temperature) = \frac{4}{14} \cdot 1 + \frac{6}{14} \cdot 0.918 + \frac{4}{14} \cdot 0.811 \approx 0.889
$$

**Humidity:**

1. **High:**
   - Total = 7
   - Yes = 3, No = 4

   $$
   H(High) = -\left(\frac{3}{7}\right) \log_2\left(\frac{3}{7}\right) - \left(\frac{4}{7}\right) \log_2\left(\frac{4}{7}\right) \approx 0.985
   $$

2. **Normal:**
   - Total = 7
   - Yes = 6, No = 1

   $$
   H(Normal) = -\left(\frac{6}{7}\right) \log_2\left(\frac{6}{7}\right) - \left(\frac{1}{7}\right) \log_2\left(\frac{1}{7}\right) \approx 0.592
   $$

The total entropy for Humidity:

$$
H(Humidity) = \frac{7}{14}H(High) + \frac{7}{14}H(Normal)
$$

$$
H(Humidity) = \frac{7}{14} \cdot 0.985 + \frac{7}{14} \cdot 0.592 \approx 0.789
$$

#### 3. Calculate Information Gain for Each Attribute

**Information Gain for Outlook:**

$$
IG(Outlook) = H(PlayTennis) - H(Outlook)
$$

$$
IG(Outlook) = 0.940 - 0.693 = 0.247
$$

**Information Gain for Temperature:**

$$
IG(Temperature) = H(PlayTennis) - H(Temperature)
$$

$$
IG(Temperature) = 0.940 - 0.889 = 0.051
$$

**Information Gain for Humidity:**

$$
IG(Humidity) = H(PlayTennis) - H(Humidity)
$$

$$
IG(Humidity) = 0.940 - 0.789 = 0.151
$$

### Conclusion

The attribute with the highest Information Gain is chosen as the root node for the decision tree. In this example, the attribute with the highest Information Gain is **Outlook**. This process is then repeated for each branch until the stopping criteria are met.

This simplified example illustrates the calculation of entropy and information gain, which are fundamental to constructing a decision tree.


## Decision Trees: Assumptions, Bias-Variance Trade-off, Advantages, and Disadvantages

### Assumptions Behind Decision Trees

1. **Assumption of Sufficient Data**:
   - Decision Trees assume there is enough data to make multiple splits and ensure that each leaf node has a significant number of data points to reduce overfitting.

2. **Assumption of Independent Features**:
   - Although Decision Trees do not explicitly assume independence between features, they can sometimes overfit to particular combinations of features if the dataset is small or highly correlated.

3. **Homogeneity of Splits**:
   - Decision Trees assume that data can be partitioned into homogeneous subsets that maximize purity or information gain at each split.

4. **Feature Relevance**:
   - Decision Trees implicitly assume that all features are relevant and contribute to the final decision. Irrelevant or noisy features can impact the tree's structure and performance.

### Bias-Variance Trade-off

#### Bias
- **High Bias**:
  - Simplistic models like shallow trees that do not capture the underlying data distribution well, leading to underfitting.
  
#### Variance
- **High Variance**:
  - Complex models like deep trees that capture noise in the training data, leading to overfitting.
  
#### Trade-off
- **Balance**:
  - The goal is to find the optimal depth of the tree that balances bias and variance to ensure good performance on both training and unseen test data.
  
### Advantages of Decision Trees

1. **Easy to Understand and Interpret**:
   - Decision Trees can be easily visualized and understood by humans. The rules are clear and can be easily explained.

2. **Requires Little Data Preparation**:
   - No need for normalization or scaling of data. Can handle both numerical and categorical features.

3. **Handles Nonlinear Relationships**:
   - Decision Trees can model complex and nonlinear relationships between features and the target variable.

4. **Versatile**:
   - Applicable for both classification and regression tasks.

5. **Feature Selection**:
   - Decision Trees perform implicit feature selection as they split the data based on the most important features first.

### Disadvantages and Drawbacks of Decision Trees

1. **Overfitting**:
   - Decision Trees can easily overfit, especially if they are deep. They tend to capture noise in the data and perform poorly on unseen data.

2. **Unstable**:
   - Small changes in the data can lead to a completely different tree structure, making them highly sensitive to data variations.

3. **Biased to Dominant Classes**:
   - If some classes dominate, the decision tree can become biased towards these classes.

4. **Complexity**:
   - As the number of features increases, the complexity of the tree increases, making it computationally expensive.

5. **Lack of Smoothness**:
   - Decision Trees can create a decision boundary that is very fragmented and lacks smoothness, especially in regression tasks.

### Conclusion

Decision Trees are powerful tools for both classification and regression tasks. They are easy to understand and interpret, require little data preparation, and can handle nonlinear relationships. However, they are prone to overfitting, can be unstable, and may become complex with increasing features. Balancing the depth of the tree is crucial to managing the bias-variance trade-off effectively.


In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Decision Tree Classifier
tree_classifier = DecisionTreeClassifier(max_depth=3)
tree_classifier.fit(X_train, y_train)

# Predict the labels of test data
y_pred = tree_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the Decision Tree model:", accuracy)

# Visualize the tree
from sklearn.tree import export_text
tree_rules = export_text(tree_classifier, feature_names=iris['feature_names'])
print(tree_rules)

Accuracy of the Decision Tree model: 1.0
|--- petal length (cm) <= 2.45
|   |--- class: 0
|--- petal length (cm) >  2.45
|   |--- petal length (cm) <= 4.75
|   |   |--- petal width (cm) <= 1.65
|   |   |   |--- class: 1
|   |   |--- petal width (cm) >  1.65
|   |   |   |--- class: 2
|   |--- petal length (cm) >  4.75
|   |   |--- petal width (cm) <= 1.75
|   |   |   |--- class: 1
|   |   |--- petal width (cm) >  1.75
|   |   |   |--- class: 2

