# Information Gain
<img src="images\infrmationGain.png" width='550'>

Infrmation Gain = Change in Entropy <br>
Infrmation Gain = Entorpy(parent) - [Entropy(child1) + Entropy(child1)] /2

in this example <br>
<table>
    <tr>
        <td width="300">
            <img src="images/infrmationGain2.png" width="300" height="350" alt="Information Gain Image">
        </td>
        <td width="300">
            <img src="images/infrmationGain3.png" width="300" height="350" alt="Information Gain Image">           
        </td>
        <td width="300">
            <img src="images/infrmationGain4.png" width="300" height="350" alt="Information Gain Image">           
        </td>
    </tr>
</table>

$$
\text{Information Gain} = 1 - 0.72 = 0.28
$$
$$
\text{Information Gain} = 1 - 1 = 0
$$
**we don't have any infomation from this split**

$$
\text{Information Gain} = 1 - 0 = 1
$$
**we split the data in perfect way**


$$\text{Information Gain} =  \text{Entropy(Parent)} - \left[ \frac{m}{m+n}\cdot \text{Entropy(Child}_1\text{)} + \frac{n}{m+n} \cdot \text{Entropy(Child}_2\text{)} \right] $$


In [1]:
import numpy as np
def calculate_entropy(probabilities):
    probabilities = probabilities[np.nonzero(probabilities)]
    return -np.sum(probabilities * np.log2(probabilities))
def information_gain(parent_probs, left_probs, right_probs):
    # Calculate entropy for parent and child nodes
    entropy_parent = calculate_entropy(parent_probs)
    entropy_left = calculate_entropy(left_probs)
    entropy_right = calculate_entropy(right_probs)
    
    # Weights for the child nodes
    weight_left = sum(left_probs)
    weight_right = sum(right_probs)
    weight_total = weight_left + weight_right
    
    # Weighted entropy of children
    weighted_entropy = (weight_left / weight_total) * entropy_left + (weight_right / weight_total) * entropy_right
    
    # Information Gain
    return entropy_parent - weighted_entropy

in this example <br>
<img src="images/infrmationGain4.png" width="300" height="350" alt="Information Gain Image">           
At first Entropy: 1.4591479170272446
if we split them by gender by gender 
we got this :<br>
<img src="images/split_by_gender.png" width="200" height="300" alt="Information Gain Image">           
so information gain would be

In [2]:
parent_probs = np.array([0.5, 0.3334, 0.16667])
left_probs = np.array([0.6667, 0.3334])
right_probs = np.array([0.3334, 0.6667])
ig = information_gain(parent_probs, left_probs, right_probs)
print(ig) # 0.54022

0.5408844827697632


but if we use occupation 

In [3]:
left_probs = np.array([1])
right_probs = np.array([0.3334, 0.6667])
ig = information_gain(parent_probs, left_probs, right_probs)
print(ig) # 1

0.9999998853234293


so we will split using occupation first cuasee it have highest **IG**
continue for the rest <br>
<img src="images/spliting_tree.png" width="600" height="350" alt="Information Gain Image">     

In [4]:
import pandas as pd
url = "https://s3.amazonaws.com/video.udacity-data.com/topher/2018/April/5ad940f6_ml-bugs/ml-bugs.csv"
data = pd.read_csv(url)
mobug_count = data['Species'].str.count('Mobug').sum()
lobug_count = data['Species'].str.count('Lobug').sum()
mobug_prob = mobug_count / len(data)
lobug_prob = lobug_count / len(data)
parent_probs = np.array([mobug_prob, lobug_prob])
# print(parent_probs)

datablue = data[data['Color'] == 'Blue']
databrown = data[data['Color'] == 'brown']
datagreen = data[data['Color'] == 'green']
datasmaller17 = data[data['Length (mm)'] < 17]
databigger20 = data[data['Length (mm)'] < 20]
datasets = [datablue, databrown, datagreen, datasmaller17, databigger20]
datasets_mobug = [datablue['Species'].str.count('Mobug').sum(),
 databrown['Species'].str.count('Mobug').sum(),
 datagreen['Species'].str.count('Mobug').sum(),
 datasmaller17['Species'].str.count('Mobug').sum(),
 databigger20['Species'].str.count('Mobug').sum()
]
datasets_lobug = [datablue['Species'].str.count('Lobug').sum(),
 databrown['Species'].str.count('Lobug').sum(),
 datagreen['Species'].str.count('Lobug').sum(),
 datasmaller17['Species'].str.count('Lobug').sum(),
 databigger20['Species'].str.count('Lobug').sum()
]
print(datasets_mobug)
print(datasets_lobug)
datasets_prop = []
for i in range (len(datasets_lobug)):
    mobug_count = datasets_mobug[i]
    lobug_count = datasets_lobug[i]
    total = mobug_count + lobug_count
    if(total > 0):
        mobug_prob = mobug_count / (total)
        lobug_prob = lobug_count / (total)
        datasets_prop.append([mobug_prob, lobug_prob])
    else:
        datasets_prop.append([0, 0])
    # print(mobug_count)
    # print(lobug_count)
    # print(total)
print(datasets_prop)

print(datasmaller17)

[4, 0, 0, 6, 9]
[6, 0, 0, 3, 8]
[[0.4, 0.6], [0, 0], [0, 0], [0.6666666666666666, 0.3333333333333333], [0.5294117647058824, 0.47058823529411764]]
   Species  Color  Length (mm)
0    Mobug  Brown         11.6
1    Mobug   Blue         16.3
2    Lobug   Blue         15.1
6    Mobug  Brown         15.7
12   Mobug  Brown         13.8
13   Lobug   Blue         14.5
19   Mobug   Blue         14.6
21   Lobug  Brown         14.1
23   Mobug   Blue         13.1
