#  Decision Tree Classifier ‚Äî Mathematical Intuition and Working

##  Introduction

A **Decision Tree** is a supervised machine learning algorithm that can be used for both **classification** and **regression** problems.  
In this document, we'll focus on **classification**.

Decision Trees follow a structure similar to nested `if‚Äìelif‚Äìelse` conditions in programming.  
They split data based on certain conditions until the output becomes **pure** (contains only one class).

---

## ‚öôÔ∏è Example: Simple Logical Structure

```python
age = 14

if age <= 14:
    print("The person is in school")
elif 15 <= age <= 21:
    print("The person is in college")
else:
    print("The person has completed college")


# üåø Decision Tree Structure

The **root node** represents the first feature used for splitting (e.g., `age`).  
Each **internal node** represents a condition (e.g., `age <= 15`).  
Each **leaf node** represents a final decision/output (e.g., `"school"`).

---

### Example



         Age <= 15
          /      \
      Yes          No
   "School"    Age <= 21
                 /     \
             Yes         No
          "College"   "Graduated"




A Decision Tree continues splitting until it reaches **pure leaf nodes** (where all samples belong to the same class).

---

## üîç Types of Decision Tree Algorithms

| Type | Full Form | Splitting Type | Library |
|------|------------|----------------|----------|
| **ID3** | Iterative Dichotomiser 3 | Multi-split | Historical (used in older versions) |
| **CART** | Classification and Regression Trees | **Binary splits only** | ‚úÖ Used in `sklearn` |

---

## ‚öôÔ∏è Example Dataset ‚Äî Play Tennis

| Outlook  | Temperature | Humidity | Wind  | PlayTennis |
|-----------|-------------|-----------|--------|-------------|
| Sunny     | Hot         | High      | Weak   | No          |
| Sunny     | Hot         | High      | Strong | No          |
| Overcast  | Hot         | High      | Weak   | Yes         |
| Rain      | Mild        | High      | Weak   | Yes         |
| Rain      | Cool        | Normal    | Weak   | Yes         |
| Rain      | Cool        | Normal    | Strong | No          |
| Overcast  | Cool        | Normal    | Strong | Yes         |
| Sunny     | Mild        | High      | Weak   | No          |
| Sunny     | Cool        | Normal    | Weak   | Yes         |
| Rain      | Mild        | Normal    | Weak   | Yes         |
| Sunny     | Mild        | Normal    | Strong | Yes         |
| Overcast  | Mild        | High      | Strong | Yes         |
| Overcast  | Hot         | Normal    | Weak   | Yes         |
| Rain      | Mild        | High      | Strong | No          |

Here, the **target variable** is `PlayTennis (Yes/No)` and the independent features are **Outlook**, **Temperature**, **Humidity**, and **Wind**.

---

## üß© Step 1: Initial Split Example

Let's assume we choose **Outlook** as the root node.

Unique values in `Outlook`:
- Sunny  
- Overcast  
- Rain

So, we split the dataset into **3 branches**.

| Outlook | Yes | No |
|----------|-----|----|
| Sunny    | 2   | 3  |
| Overcast | 4   | 0  |
| Rain     | 3   | 2  |

---

## üåó Pure vs Impure Splits

- **Pure Split:** Only one class (e.g., all ‚ÄúYes‚Äù or all ‚ÄúNo‚Äù)  
- **Impure Split:** Contains a mix of classes (e.g., 2 ‚ÄúYes‚Äù and 3 ‚ÄúNo‚Äù)

Example:
- `Overcast ‚Üí Pure (All Yes)`
- `Sunny ‚Üí Impure`
- `Rain ‚Üí Impure`

The **Decision Tree** keeps splitting impure nodes until all become pure (leaf nodes).

---

## üßÆ Step 2: Measuring Purity ‚Äî Entropy & Gini Impurity

To mathematically quantify **purity**, we use two metrics:

1. **Entropy**
2. **Gini Impurity**

---

### üîπ Entropy

Entropy measures the **impurity or randomness** in a dataset.

$$
Entropy(S) = -p_{yes} \log_2(p_{yes}) - p_{no} \log_2(p_{no})
$$

Where:
- \( p_{yes} \): proportion of ‚ÄúYes‚Äù outcomes  
- \( p_{no} \): proportion of ‚ÄúNo‚Äù outcomes

**Entropy ranges from 0 to 1:**
- 0 ‚Üí Pure (completely homogeneous)
- 1 ‚Üí Impure (completely random)

---

### üîπ Gini Impurity

Gini measures the **probability of incorrectly classifying** a randomly chosen element.

$$
Gini(S) = 1 - (p_{yes}^2 + p_{no}^2)
$$

Like Entropy:
- 0 ‚Üí Pure
- Closer to 0 ‚Üí More pure

---

## üí° Step 3: Selecting the Best Feature ‚Äî Information Gain

To decide **which feature to split on**, we use **Information Gain (IG)**.

$$
IG(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \times Entropy(S_v)
$$

Where:
- \( S \): entire dataset  
- \( A \): attribute being split  
- \( S_v \): subset of \( S \) where attribute \( A \) takes the value \( v \)  
- \( \frac{|S_v|}{|S|} \): proportion (weight) of samples

‚û°Ô∏è A **higher Information Gain** means the feature provides a **better split** (more pure subsets).

---

## üß† Summary

| Concept | Purpose | Formula |
|----------|----------|----------|
| **Entropy** | Measure impurity | \( -p_1 \log_2(p_1) - p_2 \log_2(p_2) \) |
| **Gini Impurity** | Alternate purity metric | \( 1 - (p_1^2 + p_2^2) \) |
| **Information Gain** | Choose best feature | \( Entropy(S) - \sum \frac{|S_v|}{|S|} Entropy(S_v) \) |

---

‚úÖ In the **next section**, we‚Äôll calculate these values step-by-step using the ‚ÄúPlay Tennis‚Äù dataset and Python code!


# Decision Tree Classifier ‚Äî Entropy and Gini Impurity

In this session, we‚Äôll discuss two important concepts used in Decision Trees:

* **Purity** ‚Äî How to check whether a split is pure or not
* **Measures of impurity:** *Entropy* and *Gini Impurity*

---

## 1. Entropy

Entropy is a measure of impurity or randomness in a dataset.

For a binary classification problem, the formula for entropy is:

$$
H(S) = -p_{+} \log_2(p_{+}) - p_{-} \log_2(p_{-})
$$

Where:

* ( p_{+} ): Probability of being in the **positive** class (e.g., class = 1)
* ( p_{-} ): Probability of being in the **negative** class (e.g., class = 0)

For a **multi-class** classification problem, the formula generalizes to:

$$
H(S) = - \sum_{i=1}^{n} p_i \log_2(p_i)
$$

where ( n ) is the number of classes.

---

### Example: Binary Split

Suppose we have a dataset that splits as follows:

| Node | Yes | No | Total |
| ---- | --- | -- | ----- |
| C1   | 3   | 3  | 6     |
| C2   | 3   | 0  | 3     |

#### Entropy for Node C1

$$
H(C1) = -p_{+} \log_2(p_{+}) - p_{-} \log_2(p_{-})
$$

Substitute values:

$$
p_{+} = \frac{3}{6} = 0.5, \quad p_{-} = \frac{3}{6} = 0.5
$$

Hence,

$$
H(C1) = -0.5 \log_2(0.5) - 0.5 \log_2(0.5) = 1
$$

So this is a **completely impure split** (maximum entropy).

---

#### Entropy for Node C2

$$
p_{+} = \frac{3}{3} = 1, \quad p_{-} = 0
$$

$$
H(C2) = -1 \log_2(1) - 0 \log_2(0)
$$

Since ( \log_2(1) = 0 ) and ( 0 \times \log_2(0) = 0 ):

$$
H(C2) = 0
$$

This is a **pure split**.

---

### Entropy Curve

When we plot entropy vs probability of the positive class ( p_{+} ):

* X-axis: ( p_{+} ) (0 to 1)
* Y-axis: ( H(S) ) (0 to 1)

The graph looks like this:

```
H(S)
 ‚Üë
 |        *
 |      *   *
 |    *       *
 |  *           *
 |*               *
 ------------------------‚Üí p(+)
 0   0.5           1
```

* Entropy is **maximum (1)** when ( p_{+} = 0.5 )
* Entropy is **minimum (0)** when ( p_{+} = 0 ) or ( p_{+} = 1 )

Hence, **entropy ranges from 0 to 1.**

---

## 2. Gini Impurity

Gini impurity measures how often a randomly chosen element would be incorrectly labeled.

The formula is:

$$
G(S) = 1 - \sum_{i=1}^{n} p_i^2
$$

For binary classification:

$$
G(S) = 1 - (p_{+}^2 + p_{-}^2)
$$

---

### Example

#### Case 1: Node C1 (3 Yes, 3 No)

$$
p_{+} = 0.5, \quad p_{-} = 0.5
$$

$$
G(C1) = 1 - (0.5^2 + 0.5^2) = 1 - 0.5 = 0.5
$$

So the **maximum Gini impurity** is **0.5** for a completely impure split.

---

#### Case 2: Node C2 (3 Yes, 0 No)

$$
p_{+} = 1, \quad p_{-} = 0
$$

$$
G(C2) = 1 - (1^2 + 0^2) = 0
$$

Hence, this is a **pure split**.

---

### Gini Impurity Curve

When plotted:

```
G(S)
 ‚Üë
 |       *
 |     *   *
 |   *       *
 | *           *
 ------------------------‚Üí p(+)
 0   0.5           1
```

* Gini impurity ranges from **0 to 0.5**
* Maximum impurity (0.5) occurs when both classes are equally mixed
* Minimum impurity (0) occurs when the node is completely pure

---

## 3. Entropy vs Gini Impurity ‚Äî Comparison

| Property       | Entropy                                          | Gini Impurity                          |
| -------------- | ------------------------------------------------ | -------------------------------------- |
| Formula        | ( -\sum p_i \log_2(p_i) )                        | ( 1 - \sum p_i^2 )                     |
| Range          | 0 to 1                                           | 0 to 0.5                               |
| Interpretation | Measures information (bits) required to classify | Measures misclassification probability |
| Smoothness     | Slightly more computationally expensive          | Faster to compute                      |
| Used In        | ID3, C4.5, C5.0                                  | CART                                   |

---

## 4. Information Gain (Preview)

Once we compute entropy or Gini impurity for each split, the next step is to decide **which feature to split on**.

This is done using **Information Gain (IG)**.

Information Gain helps us determine which feature gives the **maximum reduction in impurity** after splitting.

In the next section, we‚Äôll derive the **Information Gain** formula and see how it helps in feature selection during Decision Tree construction.

---

**Summary:**

* Entropy ‚Üí Range: 0 to 1
* Gini Impurity ‚Üí Range: 0 to 0.5
* Both measure impurity ‚Äî lower value means purer split.
* Used to decide when to stop splitting and how to choose the best features.

---




# Decision Tree Classifier ‚Äî Information Gain

In the previous discussion, we understood **Gini Index**, **Gini Impurity**, and **Entropy** ‚Äî all of which help us **measure the purity of a split** in a Decision Tree.

Now, let‚Äôs understand **Information Gain**, which helps us **decide which feature to split on**.

---

## 1. What Is Information Gain?

When we have multiple features (say ( F_1, F_2, F_3 )), we need to decide **which feature to use first** for splitting the tree.

The **Information Gain (IG)** helps us determine this by measuring **the reduction in entropy** after the dataset is split on a feature.

---

### Formula for Information Gain

$$
Gain(S, \text{feature}) = H(S) - \sum_{v \in \text{Values(feature)}} \frac{|S_v|}{|S|} , H(S_v)
$$

Where:

* ( H(S) ): Entropy of the **root node** (before the split)
* ( S_v ): Subset of data for which feature = ( v )
* ( H(S_v) ): Entropy of subset ( S_v )
* ( \frac{|S_v|}{|S|} ): Weighted proportion of samples in subset ( S_v )

---

## 2. Example ‚Äî Calculating Information Gain

Suppose we have a dataset with **feature F1** and the following class distribution:

| Node                | Yes | No | Total |
| ------------------- | --- | -- | ----- |
| Root                | 9   | 5  | 14    |
| C1 (after F1 split) | 6   | 2  | 8     |
| C2 (after F1 split) | 3   | 3  | 6     |

---

### Step 1: Calculate Root Entropy

$$
H(S) = -p_{+} \log_2(p_{+}) - p_{-} \log_2(p_{-})
$$

Substitute:

$$
p_{+} = \frac{9}{14}, \quad p_{-} = \frac{5}{14}
$$

So:

$$
H(S) = -\frac{9}{14}\log_2\left(\frac{9}{14}\right) - \frac{5}{14}\log_2\left(\frac{5}{14}\right)
$$

$$
H(S) \approx 0.94
$$

---

### Step 2: Calculate Entropy for Each Child Node

#### For Category C1 (6 Yes, 2 No)

$$
H(C1) = -\frac{6}{8}\log_2\left(\frac{6}{8}\right) - \frac{2}{8}\log_2\left(\frac{2}{8}\right)
$$

$$
H(C1) \approx 0.81
$$

#### For Category C2 (3 Yes, 3 No)

Since this is a perfectly impure split (( p_{+} = p_{-} = 0.5 )):

$$
H(C2) = 1
$$

---

### Step 3: Calculate Weighted Average Entropy

$$
\sum_{v \in \text{Values(F1)}} \frac{|S_v|}{|S|} , H(S_v)
$$

Substitute values:

$$
= \frac{8}{14} \times 0.81 + \frac{6}{14} \times 1
$$

$$
= 0.463 + 0.429 = 0.892
$$

---

### Step 4: Compute Information Gain

$$
Gain(S, F1) = H(S) - \sum_{v} \frac{|S_v|}{|S|}H(S_v)
$$

$$
Gain(S, F1) = 0.94 - 0.892 = 0.049
$$

So, the **Information Gain for Feature F1 = 0.049**.

---

## 3. Comparing Features

Let‚Äôs assume another feature ( F2 ) gives the following result after splitting:

$$
Gain(S, F2) > Gain(S, F1)
$$

Then we **choose ( F2 )** as the **root split feature** for our Decision Tree, because it **reduces entropy the most**.

---

## 4. Conceptual Summary

| Concept          | Purpose                                              | Formula                  | Range |    |   |            |     |
| ---------------- | ---------------------------------------------------- | ------------------------ | ----- | -- | - | ---------- | --- |
| Entropy          | Measures impurity (information required to classify) | ( -\sum p_i \log_2 p_i ) | 0‚Äì1   |    |   |            |     |
| Gini Impurity    | Measures misclassification probability               | ( 1 - \sum p_i^2 )       | 0‚Äì0.5 |    |   |            |     |
| Information Gain | Measures reduction in impurity after split           | ( H(S) - \sum \frac{     | S_v   | }{ | S | } H(S_v) ) | ‚â• 0 |

---

## 5. Key Intuition

* Higher **Information Gain** ‚Üí better split ‚Üí higher purity in child nodes
* Decision Trees use Information Gain (or Gini Index) to **select features and build hierarchy**
* Entropy focuses on **information content**, Gini focuses on **misclassification**

---

## 6. Next Topic

In the next discussion, we‚Äôll answer a common interview question:

> **When should we use Entropy, and when should we use Gini Impurity?**

---

**Summary:**

* Information Gain tells **which feature to split first**.
* It uses **Entropy** (or Gini Impurity) as a base.
* Decision Tree internally calculates IG for all features and **chooses the one with maximum gain**.

---

**Next Video:** *Entropy vs Gini ‚Äî When to Use Which*


# Decision Tree Classifier ‚Äî Information Gain

In the previous discussion, we understood **Gini Index**, **Gini Impurity**, and **Entropy** ‚Äî all of which help us **measure the purity of a split** in a Decision Tree.

Now, let‚Äôs understand **Information Gain**, which helps us **decide which feature to split on**.

---

## 1. What Is Information Gain?

When we have multiple features (say ( F_1, F_2, F_3 )), we need to decide **which feature to use first** for splitting the tree.

The **Information Gain (IG)** helps us determine this by measuring **the reduction in entropy** after the dataset is split on a feature.

---

### Formula for Information Gain

$$
Gain(S, \text{feature}) = H(S) - \sum_{v \in \text{Values(feature)}} \frac{|S_v|}{|S|} , H(S_v)
$$

Where:

* ( H(S) ): Entropy of the **root node** (before the split)
* ( S_v ): Subset of data for which feature = ( v )
* ( H(S_v) ): Entropy of subset ( S_v )
* ( \frac{|S_v|}{|S|} ): Weighted proportion of samples in subset ( S_v )

---

## 2. Example ‚Äî Calculating Information Gain

Suppose we have a dataset with **feature F1** and the following class distribution:

| Node                | Yes | No | Total |
| ------------------- | --- | -- | ----- |
| Root                | 9   | 5  | 14    |
| C1 (after F1 split) | 6   | 2  | 8     |
| C2 (after F1 split) | 3   | 3  | 6     |

---

### Step 1: Calculate Root Entropy

$$
H(S) = -p_{+} \log_2(p_{+}) - p_{-} \log_2(p_{-})
$$

Substitute:

$$
p_{+} = \frac{9}{14}, \quad p_{-} = \frac{5}{14}
$$

So:

$$
H(S) = -\frac{9}{14}\log_2\left(\frac{9}{14}\right) - \frac{5}{14}\log_2\left(\frac{5}{14}\right)
$$

$$
H(S) \approx 0.94
$$

---

### Step 2: Calculate Entropy for Each Child Node

#### For Category C1 (6 Yes, 2 No)

$$
H(C1) = -\frac{6}{8}\log_2\left(\frac{6}{8}\right) - \frac{2}{8}\log_2\left(\frac{2}{8}\right)
$$

$$
H(C1) \approx 0.81
$$

#### For Category C2 (3 Yes, 3 No)

Since this is a perfectly impure split (( p_{+} = p_{-} = 0.5 )):

$$
H(C2) = 1
$$

---

### Step 3: Calculate Weighted Average Entropy

$$
\sum_{v \in \text{Values(F1)}} \frac{|S_v|}{|S|} , H(S_v)
$$

Substitute values:

$$
= \frac{8}{14} \times 0.81 + \frac{6}{14} \times 1
$$

$$
= 0.463 + 0.429 = 0.892
$$

---

### Step 4: Compute Information Gain

$$
Gain(S, F1) = H(S) - \sum_{v} \frac{|S_v|}{|S|}H(S_v)
$$

$$
Gain(S, F1) = 0.94 - 0.892 = 0.049
$$

So, the **Information Gain for Feature F1 = 0.049**.

---

## 3. Comparing Features

Let‚Äôs assume another feature ( F2 ) gives the following result after splitting:

$$
Gain(S, F2) > Gain(S, F1)
$$

Then we **choose ( F2 )** as the **root split feature** for our Decision Tree, because it **reduces entropy the most**.

---

## 4. Conceptual Summary

| Concept          | Purpose                                              | Formula                  | Range |    |   |            |     |
| ---------------- | ---------------------------------------------------- | ------------------------ | ----- | -- | - | ---------- | --- |
| Entropy          | Measures impurity (information required to classify) | ( -\sum p_i \log_2 p_i ) | 0‚Äì1   |    |   |            |     |
| Gini Impurity    | Measures misclassification probability               | ( 1 - \sum p_i^2 )       | 0‚Äì0.5 |    |   |            |     |
| Information Gain | Measures reduction in impurity after split           | ( H(S) - \sum \frac{     | S_v   | }{ | S | } H(S_v) ) | ‚â• 0 |

---

## 5. Key Intuition

* Higher **Information Gain** ‚Üí better split ‚Üí higher purity in child nodes
* Decision Trees use Information Gain (or Gini Index) to **select features and build hierarchy**
* Entropy focuses on **information content**, Gini focuses on **misclassification**

---

## 6. Next Topic

In the next discussion, we‚Äôll answer a common interview question:

> **When should we use Entropy, and when should we use Gini Impurity?**

---

**Summary:**

* Information Gain tells **which feature to split first**.
* It uses **Entropy** (or Gini Impurity) as a base.
* Decision Tree internally calculates IG for all features and **chooses the one with maximum gain**.

---

**Next Video:** *Entropy vs Gini ‚Äî When to Use Which*
