# Chosing a split: Information Gain

In Decision Tree learning, the reduction of entropy is called information gain. This help us what feature to use for split that oppers maximum purity.

### What is Uncertainty?
-	Uncertainty in data refers to the unpredictability or disorder in a system.
- When data is perfectly pure (e.g., all examples belong to one class, such as  p = 1 ), there is no uncertainty because the outcome is completely predictable.
- As the probabilities of the classes become more evenly distributed (closer to  p = 0.5 ), the uncertainty increases because predicting the outcome becomes more difficult.

### What is Entropy?
- Entropy quantifies this uncertainty. It’s derived from information theory, where it measures the average amount of information needed to describe an outcome.
- A higher entropy value indicates higher uncertainty, while lower entropy suggests greater predictability.


## Entropy and Purity

### Entropy Formula
Entropy is calculated as:

$$
h(p) = -p \log_2(p) - (1 - p) \log_2(1 - p)
$$

### Comparing Two Cases

#### Case 1: \($p = \frac{4}{5}$\)
For \($p = \frac{4}{5}$), we have \($1 - p = \frac{1}{5}$). Substituting into the entropy formula:

$$
h\left(\frac{4}{5}\right) = -\frac{4}{5} \log_2\left(\frac{4}{5}\right) - \frac{1}{5} \log_2\left(\frac{1}{5}\right)
$$

#### Case 2: \($p = \frac{1}{5}$\)
For \($p = \frac{1}{5}$\), we have \($1 - p = \frac{4}{5}$\). Substituting into the entropy formula:

$$
h\left(\frac{1}{5}\right) = -\frac{1}{5} \log_2\left(\frac{1}{5}\right) - \frac{4}{5} \log_2\left(\frac{4}{5}\right)
$$

Notice that the terms in the two cases are identical, just swapped. Thus:

$$
h\left(\frac{4}{5}\right) = h\left(\frac{1}{5}\right)
$$

Numerically, this evaluates to approximately:

$$
h\left(\frac{4}{5}\right) = h\left(\frac{1}{5}\right) \approx 0.72
$$

### Entropy and Purity
1. **Purity**: Refers to how concentrated the data is in a single class. Higher purity occurs when \(p\) is close to \(0\) or \(1\).
   - For \($p = \frac{4}{5}$\), the data is relatively pure (80% of one class, 20% of the other).
   - For \($p = \frac{1}{5}$\), the data is also relatively pure (20% of one class, 80% of the other).

2. **Entropy**: Measures the level of uncertainty or "mixing" in the data.
   - For \($p = \frac{4}{5}$\), there is some uncertainty because 20% belongs to the minority class.
   - For \($p = \frac{1}{5}$\), there is the same level of uncertainty because 20% belongs to the minority class (just flipped).

Since entropy is symmetric (
\($h(p) = h(1 - p$)\)), it assigns the same value to both cases.

### Key Takeaway
Entropy reflects **uncertainty** in the data, not purity directly. For distributions \($p = \frac{4}{5}$\) and \($p = \frac{1}{5}$\), the uncertainty is the same, even though the dominant class differs. This symmetry ensures that entropy behaves consistently regardless of which class is more frequent.



# Choosing a spllit

If we use the ear shape feature for split:

<img src="attachment:8e7cc41d-fe51-44b1-bab6-ef0d5832e71d.png" width="500">

$p^{1}_{left} = \frac{4}{5} = 0.8$
$p^{1}_{right} = \frac{1}{5} = 0.2$

Now, 


$H(p^{1}_{left}) = 0.72$
$H(p^{1}_{right}) = 0.72$


Now let us calculate the weighted average,

In $p^{1}_{left}$, 5 out of 10 examples and the same for the right.

$(\frac{5}{10}H(0.8) + \frac{5}{10}H(0.2))$

### Let us look at another feature, 

<img src="attachment:d356bf8f-ba86-410c-a4d7-664ad822cbfd.png" width="500">



$p^{2}_{left} = \frac{4}{7} = 0.99$
$p^{2}_{right} = \frac{1}{3} = 0.92$

Now, 


$H(p^{1}_{left}) = 0.72$
$H(p^{1}_{right}) = 0.72$


Now let us calculate the weighted average,

In $p^{2}_{left}$, 4 out of 7 examples and In $p^{2}_{right}$, 1 out of 3 examples

$(\frac{4}{10}H(0.8) + \frac{3}{10}H(0.2))$

**Instead of calculating the weighted average, we are going to calculate the reduction in entropy**. 

One more little change.


In the root note, we have started with 5 cats and 5 dogs. 

So, $H(\frac{1}{2}) = 0.5$

Now, 

for split on ear shape,
$H(0.5) -$ $(\frac{5}{10}H(0.8) + \frac{5}{10}H(0.2))$ $ = 0.28$


for split on face shape, 

$H(0.5) - (\frac{4}{10}H(0.8) + \frac{3}{10}H(0.2)) = 0.03$

> These final values are called Information Gain, which is also reduction in entorpy.

## General Formula for Information Gain

Let, 

$p_1^{left}$ = Of all the examples in left branch, the number of positive examples  
$p_1^{right}$ = Of all the examples in right branch, the number of positive examples  

$w^{left}$ = Out of all exampels in root note, what fraction of them went into the left branch  
$w^{right}$ = Out of all exampels in root note, what fraction of them went into the right branch  


$p_1^{root}$ = fraction of examples that are positive in the root notes


$$ \text{Information Gain} =  H(p_1^{root}) - (w^{left} H(p_1^{left}) + w^{right} H(p_1^{right})) $$