<a href="https://colab.research.google.com/github/gustavovazquez/ML/blob/main/ML_Demo_de_c%C3%A1lculo_de_Entropia_y_Ganancia_de_informaci%C3%B3n_en_Arboles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Árboles de Decisión con Variables **Categóricas**: Entropía y Ganancia de Información

Este notebook muestra cómo calcular la **entropía** y la **ganancia de información** en atributos **categóricos** (multi-rama).
Se ilustra la selección del mejor atributo para dividir en la raíz y se explora un **segundo nivel** dentro de una rama.

**Puntos clave**
- Entropía $H(S) = -\sum_k p_k \log_2 p_k$.
- Para un atributo categórico $A$ con categorías $v$, la ganancia de información es:
  $ IG(S, A) = H(S) - \sum_v \frac{|S_v|}{|S|} H(S_v) $
- Elegimos el atributo con **mayor ganancia de información (IG)**.


In [None]:
#@title Cálculo de la ganancia de información
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import log2

def entropy(y):
    values, counts = np.unique(y, return_counts=True)
    probs = counts / counts.sum()
    return -np.sum([p * np.log2(p) for p in probs if p > 0.0])

def info_gain_categorical(df, target_col, feature_col):
    parent_entropy = entropy(df[target_col].values)
    weighted_child_entropy = 0.0
    parts = {}
    for v, df_v in df.groupby(feature_col):
        Hv = entropy(df_v[target_col].values)
        w = len(df_v) / len(df)
        weighted_child_entropy += w * Hv
        parts[v] = {'n': len(df_v), 'H': float(np.round(Hv, 4))}
        vals, cnts = np.unique(df_v[target_col].values, return_counts=True)
        parts[v]['dist'] = dict(zip(vals, cnts))
    ig = parent_entropy - weighted_child_entropy
    return ig, parent_entropy, parts

    data = pd.DataFrame({
    'Outlook':  ['Sunny','Sunny','Overcast','Rain','Rain','Rain','Overcast','Sunny','Sunny','Rain','Sunny','Overcast','Overcast','Rain'],
    'Temp':     ['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild'],
    'Humidity': ['High','High','High','High','Normal','Normal','Normal','High','Normal','Normal','Normal','High','Normal','High'],
    'Wind':     ['Weak','Strong','Weak','Weak','Weak','Strong','Strong','Weak','Weak','Weak','Strong','Strong','Weak','Strong'],
    'Play':     ['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']
})
data.head(20)

Unnamed: 0,Outlook,Temp,Humidity,Wind,Play
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes
5,Rain,Cool,Normal,Strong,No
6,Overcast,Cool,Normal,Strong,Yes
7,Sunny,Mild,High,Weak,No
8,Sunny,Cool,Normal,Weak,Yes
9,Rain,Mild,Normal,Weak,Yes


# Cálculo paso a paso de la Ganancia de Información en ID3  
### Atributos: *Outlook* y *Humidity* (Dataset “Play Tennis”)

---

## 1️⃣ Entropía del conjunto raíz $S$

Sea $S$ el conjunto completo (14 ejemplos):

$$
\#\text{Yes}=9,\quad \#\text{No}=5
$$

Por lo tanto:

$$
p(\text{Yes}) = \frac{9}{14}, \quad p(\text{No}) = \frac{5}{14}
$$

La entropía del conjunto raíz se calcula como:

$$
\begin{aligned}
H(S)
&= -\sum_{c \in \{\text{Yes}, \text{No}\}} p(c)\log_2 p(c) \\
&= -\left(\frac{9}{14}\log_2\frac{9}{14} + \frac{5}{14}\log_2\frac{5}{14}\right) \\
&\approx -\left(0.6429 \cdot (-0.6374) + 0.3571 \cdot (-1.4854)\right) \\
&\approx 0.9403\ \text{bits.}
\end{aligned}
$$

**Resultado:**  
$$
\boxed{H(S) = 0.9403}
$$

---

## 2️⃣ Qué ganancia de información logro si hago split con **Outlook**?
Posibles valores: {Sunny, Overcast, Rain}

Dividimos $S$ en tres subconjuntos y calculamos la entropía de cada uno.

### Outlook = Sunny
5 ejemplos → (Yes = 2, No = 3)

$$
\begin{aligned}
H(S_{\text{Sunny}})
&= -\left(\frac{2}{5}\log_2\frac{2}{5} + \frac{3}{5}\log_2\frac{3}{5}\right) \\
&= -\left(0.4 \cdot (-1.3219) + 0.6 \cdot (-0.73697)\right) \\
&= 0.97095
\end{aligned}
$$

### Outlook = Overcast
4 ejemplos → (Yes = 4, No = 0)

$$
H(S_{\text{Overcast}}) = -\left(1\cdot\log_2 1 + 0\cdot\log_2 0\right) = 0
$$

### Outlook = Rain
5 ejemplos → (Yes = 3, No = 2)

$$
\begin{aligned}
H(S_{\text{Rain}})
&= -\left(\frac{3}{5}\log_2\frac{3}{5} + \frac{2}{5}\log_2\frac{2}{5}\right) \\
&= 0.97095
\end{aligned}
$$

---

### Entropía ponderada posterior al split

$$
\begin{aligned}
H_{\text{post}}(\text{Outlook})
&= \frac{5}{14}H(S_{\text{Sunny}}) + \frac{4}{14}H(S_{\text{Overcast}}) + \frac{5}{14}H(S_{\text{Rain}}) \\
&= \frac{5}{14}\cdot 0.97095 + \frac{4}{14}\cdot 0 + \frac{5}{14}\cdot 0.97095 \\
&= \frac{10}{14}\cdot 0.97095 \\
&= 0.69354
\end{aligned}
$$

---

### Ganancia de información de *Outlook*

$$
\begin{aligned}
IG(S,\text{Outlook})
&= H(S) - H_{\text{post}}(\text{Outlook}) \\
&= 0.9403 - 0.69354 \\
&= 0.24675
\end{aligned}
$$

**Resultado:**  
$$
\boxed{IG(S, \text{Outlook}) = 0.24675}
$$

---

## 3️⃣ Atributo **Humidity**  
Posibles valores: {High, Normal}

### Humidity = High
7 ejemplos → (Yes = 3, No = 4)

$$
\begin{aligned}
H(S_{\text{High}})
&= -\left(\frac{3}{7}\log_2\frac{3}{7} + \frac{4}{7}\log_2\frac{4}{7}\right) \\
&= -\left(0.4286\cdot(-1.2224) + 0.5714\cdot(-0.8074)\right) \\
&= 0.98523
\end{aligned}
$$

### Humidity = Normal
7 ejemplos → (Yes = 6, No = 1)

$$
\begin{aligned}
H(S_{\text{Normal}})
&= -\left(\frac{6}{7}\log_2\frac{6}{7} + \frac{1}{7}\log_2\frac{1}{7}\right) \\
&= -\left(0.8571\cdot(-0.2224) + 0.1429\cdot(-2.8074)\right) \\
&= 0.59167
\end{aligned}
$$

---

### Entropía ponderada posterior al split

$$
\begin{aligned}
H_{\text{post}}(\text{Humidity})
&= \frac{7}{14}H(S_{\text{High}}) + \frac{7}{14}H(S_{\text{Normal}}) \\
&= 0.5\cdot 0.98523 + 0.5\cdot 0.59167 \\
&= 0.78845
\end{aligned}
$$

---

### Ganancia de información de *Humidity*

$$
\begin{aligned}
IG(S,\text{Humidity})
&= H(S) - H_{\text{post}}(\text{Humidity}) \\
&= 0.9403 - 0.78845 \\
&= 0.15184
\end{aligned}
$$

**Resultado:**  
$$
\boxed{IG(S, \text{Humidity}) = 0.15184}
$$

---



In [None]:
target = 'Play'
features = ['Outlook','Temp','Humidity','Wind']

rows = []
per_attr_details = {}
for feat in features:
    ig, H_parent, parts = info_gain_categorical(data, target, feat)
    rows.append({'Atributo': feat, 'GananciaInfo': float(np.round(ig,4))})
    per_attr_details[feat] = {'H_parent': float(np.round(H_parent,4)), 'parts': parts}

summary_ig = pd.DataFrame(rows).sort_values('GananciaInfo', ascending=False).reset_index(drop=True)
summary_ig

Unnamed: 0,Atributo,GananciaInfo
0,Outlook,0.2467
1,Humidity,0.1518
2,Wind,0.0481
3,Temp,0.0292


Este proceso se repite para todas las variables disponibles

## ✅ Conclusión (primera iteración del ID3)

| Atributo  | Ganancia de Información |
|------------|------------------------|
| Outlook    | **0.24675** |
| Humidity   | 0.15184 |

El algoritmo **ID3** selecciona **Outlook** como el atributo raíz,  
ya que presenta la **mayor ganancia de información** en esta iteración.


In [None]:
best_attr = summary_ig.iloc[0]['Atributo']
best_attr

'Outlook'

In [None]:
details = per_attr_details[best_attr]
rows_detail = []
for cat, info in details['parts'].items():
    row = {'Categoría': cat, 'n': info['n'], 'Entropía': info['H']}
    for k, v in info['dist'].items():
        row[f'count_{k}'] = v
    rows_detail.append(row)
pd.DataFrame(rows_detail)

Unnamed: 0,Categoría,n,Entropía,count_Yes,count_No
0,Overcast,4,-0.0,4,
1,Rain,5,0.971,3,2.0
2,Sunny,5,0.971,2,3.0


Calculando la entropía para cada valor de Outlook (en el caso de Overcast) la entropía es 0, es decir clasifica a los datos de forma perfecta.


---

## Segunda Iteración del algoritmo ID3  
### Subconjunto: **Outlook = Sunny**

---

### 1️⃣ Datos de la rama *Sunny*

| Temp | Humidity | Wind | Play |
|------|-----------|------|------|
| Hot  | High      | Weak | No   |
| Hot  | High      | Strong | No |
| Mild | High      | Weak | No  |
| Cool | Normal    | Weak | Yes |
| Mild | Normal    | Strong | Yes |

- Total: 5 ejemplos  
- $p(\text{Yes}) = 2/5 = 0.4$, $p(\text{No}) = 3/5 = 0.6$

$$
H(S_{\text{Sunny}}) = -0.4\log_2(0.4) - 0.6\log_2(0.6) = 0.97095
$$

---

### 2️⃣ Candidatos: {Temp, Humidity, Wind}

---

#### 🔹 Atributo **Humidity**

Valores posibles: {High, Normal}

- **High:** (Yes = 0, No = 3) → $H = 0$  
- **Normal:** (Yes = 2, No = 0) → $H = 0$

$$
H_{\text{post}}(\text{Humidity}) = \frac{3}{5}\cdot 0 + \frac{2}{5}\cdot 0 = 0
$$

$$
IG(S_{\text{Sunny}}, \text{Humidity}) = 0.97095 - 0 = 0.97095
$$

---

#### 🔹 Atributo **Wind**

Valores posibles: {Weak, Strong}

- **Weak:** (Yes = 2, No = 1) → $H = 0.9183$  
- **Strong:** (Yes = 0, No = 2) → $H = 0$

$$
H_{\text{post}}(\text{Wind}) = \frac{3}{5}\cdot 0.9183 + \frac{2}{5}\cdot 0 = 0.55098
$$

$$
IG(S_{\text{Sunny}}, \text{Wind}) = 0.97095 - 0.55098 = 0.41997
$$

---

#### 🔹 Atributo **Temp**

Valores posibles: {Hot, Mild, Cool}

| Temp | Yes | No |
|------|-----|----|
| Hot  | 0 | 2 |
| Mild | 1 | 1 |
| Cool | 1 | 0 |

Entropías:
- Hot → 0  
- Mild → 1.0  
- Cool → 0

$$
H_{\text{post}}(\text{Temp}) = \frac{2}{5}\cdot 0 + \frac{2}{5}\cdot 1 + \frac{1}{5}\cdot 0 = 0.4
$$

$$
IG(S_{\text{Sunny}}, \text{Temp}) = 0.97095 - 0.4 = 0.57095
$$

---

### 3️⃣ Comparación de ganancias dentro de *Sunny*

| Atributo | Ganancia de Información |
|-----------|------------------------|
| **Humidity** | **0.97095** |
| Temp | 0.57095 |
| Wind | 0.41997 |

---

### ✅ Resultado de la segunda iteración

El algoritmo **ID3** selecciona **Humidity** como el siguiente nodo dentro de la rama *Sunny*,  
ya que obtiene la **mayor ganancia de información (0.97)**.

El árbol actualizado queda:



In [None]:
mode_cat = data[best_attr].mode()[0]
sub_df = data[data[best_attr] == mode_cat]
rest_features = [f for f in features if f != best_attr]

rows_lvl2 = []
for feat in rest_features:
    ig2, H_parent2, parts2 = info_gain_categorical(sub_df, target, feat)
    rows_lvl2.append({'Atributo': feat, 'GananciaInfo': float(np.round(ig2,4))})

pd.DataFrame(rows_lvl2).sort_values('GananciaInfo', ascending=False).reset_index(drop=True)

Unnamed: 0,Atributo,GananciaInfo
0,Wind,0.971
1,Temp,0.02
2,Humidity,0.02
