###
## Decision Tree Parameters
train and test accuracy
###

## 1. criterion

Function to measure the quality of a split.

Options:

- "gini" (Gini Impurity)
- "entropy" (Information Gain)
- "log_loss" (used for classification).

## 1.1 Gini Impurity

### What is it?
Gini Impurity is a measure of how "impure" or "mixed" a group of data is. It tells us the chance that a randomly picked item from the group would be labeled incorrectly if we guessed the label randomly.

- A **Gini Impurity of 0** means the group is completely pure (all items belong to the same class).  
- A higher Gini value (closer to 1) means the group is more mixed.

---

### How is it calculated?
The formula for Gini Impurity is:

Gini=1−∑(p square i)

Where:
- (pi) = proportion of items in the group belonging to class i.

---

### Example
Imagine we have a basket with 10 fruits:
- 7 Apples
- 3 Oranges

#### Step 1: Calculate the proportions
- Proportion of apples : 7/10 = 0.7
- Proportion of oranges : 3/10 = 0.3

#### Step 2: Plug into the formula

Gini = 1 - (p square apples + p square oranges)

Gini = 1 - (0.7 * 2 + 0.3 * 2)

Gini = 1 - (0.49 + 0.09) = 1 - 0.58 = 0.42


---

### Interpretation
- A **Gini Impurity of 0.42** means the group is somewhat mixed but not completely random.
- If the basket had only apples (10 apples, 0 oranges), the Gini would be:


Gini = 1 - (1 * 2 + 0 * 2) = 1 - 1 = 0

This means the group is pure.

---

### Key Points
- Gini Impurity helps decide the best split when building a decision tree.
- **Lower Gini values** indicate purer groups, which are preferred.

##

## 1.2 Entropy

### What is it?
Entropy measures the level of **disorder** or **uncertainty** in a group of data. It’s used in decision trees to evaluate the quality of splits.  
- A **low entropy** value (close to 0) means the group is **pure**, where all items belong to the same class (e.g., all "Yes" or all "No").
- A **high entropy** value (closer to 1) means the group is **mixed**, containing a mix of different classes.

The goal in decision trees is to create splits that lower the entropy, making groups purer.

---

### Formula
The formula for entropy is:

Entropy = −∑(pi.log2(pi))

Where:
- pi is the proportion of items in the group belonging to class i.

If pi = 0, we define pi(log2(p_i)) = 0 to avoid undefined values.

---

### Example
Imagine a basket of 10 fruits:
- 7 Apples
- 3 Oranges

#### Step 1: Calculate the proportions
- Proportion of apples (papples) = 7/10 = 0.7
- Proportion of oranges (papples) = 0.3/10= 0.3

#### Step 2: Plug into the formula

Entropy = -(papples.log2(papples) + poranges.log2(poranges))

Substitute the values:
-(0.7.log2(0.7) + 0.3.log2(0.3))

#### Step 3: Calculate the log2 values
Using approximate values:
- log2(0.7) = -0.514
- log2(0.3) = -1.737


Now calculate:
Entropy = −(0.7⋅−0.514 + 0.3⋅−1.737)
Entropy= −(−0.36−0.52) = 0.88

---

### Interpretation
- The **entropy of 0.88** means the group is **somewhat mixed** but not completely random.
- If the basket had **only apples** (10 apples, 0 oranges), the entropy would be:
  - (1.log2(1) + 0.log2(0))
  - Entropy= −(1⋅0+0) = 0

This means the group is **completely pure**.

---

### Key Points
- **Entropy** measures **disorder** or **uncertainty** in the group.
- **Lower entropy** values indicate **purer** groups.
- Decision trees aim to split data into groups with **lower entropy** to make the classifications more certain.

##

## 1.3 logloss (Logarithmic Loss)
###
**What is it?**
Log Loss, also known as Logarithmic Loss, is a measure used to evaluate the performance of classification models, particularly in binary classification tasks (e.g., predicting "Yes" or "No"). It calculates the difference between the predicted probabilities and the actual labels, providing a way to measure the accuracy of probabilistic predictions.

- Low Log Loss values indicate that the model's predictions are close to the true labels (high accuracy).
- High Log Loss values indicate that the model is making poor predictions, with a greater difference between predicted probabilities and actual labels.

Formula
The formula for Log Loss in binary classification is:

Log Loss = -(1/N) Σ [yᵢ * log(pᵢ) + (1 - yᵢ) * log(1 - pᵢ)]

Where,
- N = the number of data points (samples).
- yi = the actual label of i-th sample(either 0 or 1)
- pi = the predicted probability of the positive class (the probability yi = 1)

##
## 2 Max_Depth
###
Max Depth refers to the maximum number of levels or layers a decision tree can have. It determines how deep the tree can grow, where each level represents a decision based on the data features.

- Shallow trees (low max depth) tend to underfit the data, as they don't capture enough complexity.
- Deep trees (high max depth) can overfit the data, meaning they might perform well on training data but poorly on unseen data because they capture noise and outliers.

**Why is it important?**

Max Depth is a crucial hyperparameter in decision tree models. By controlling the tree's depth, you balance underfitting and overfitting:

- Underfitting happens when the tree is too shallow and doesn't capture enough patterns in the data.
- Overfitting happens when the tree is too deep and fits noise or small fluctuations in the training data.

**Scenario 1: Max Depth = 1 (Shallow Tree)**

If we set max_depth = 1, the tree can only make one decision. It may split on the study hours feature and decide a threshold (say, 4 hours of study) to classify the students.

The decision tree might look like this:

Study Hours <= 4?  
    |  
   Yes -> Fail (0)  
   No -> Pass (1)

- This tree doesn't capture the full complexity of the dataset. It classifies students with fewer than 4 study hours as failing and those with more than 4 study hours as passing.

- Underfitting: The model is too simple, and it doesn't distinguish between the students who studied 2 hours and 4 hours, for example.

**Scenario 3: Max Depth = 3 (Very Deep Tree)**

If we set max_depth = 3, the tree can grow even deeper and may end up overfitting the data:

Study Hours <= 4?  
    |  
   Yes -> Fail (0)  
   No -> Study Hours <= 6?  
          |  
        Yes -> Pass (1)  
        No -> Study Hours <= 8?  
               |  
             Yes -> Pass (1)  
             No -> Pass (1)

- The tree is perfectly fitting the training data. It could classify each student based on their exact study hours.
- Overfitting: The model is too complex and will likely perform poorly on new, unseen data because it's too specific to the training data.

**When to use low value:**
- To prevent overfitting on small or noisy datasets.
- When interpretability is important (e.g., simple models for decision-making).
- For shallow datasets with fewer features or less complexity.

**When to use low value:**
- For complex datasets with high feature interaction or large amounts of data.
- When you prioritize high accuracy and are less concerned about overfitting.
- In ensemble methods like Random Forests, where overfitting is controlled by averaging across trees.

