<a href="https://colab.research.google.com/github/nickstone1911/data-analysis-practice/blob/main/Entropy_and_Information_Gain_Functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ABA Week 3, Day 1 Agenda

1. Current Event Dicussion
2. Entropy and IG Notes
3. Entropy and IG Exercises

---


# ABA Tech Lesson 04

## Practice Exercises Part 1: Calculating Entropy and Information Gain

In this exercise notebook, we will:

1. Develop a strong understanding of Entropy and Information gain
2. Use Python to create functions to calculate entropy and information gain
---

# Entropy

Entropy is a mesaure of disorder with values ranging from 0 and infinity where 0 means all categories of a target attribute are the same and as the values get larger it means our groups have a more even split of categories within them. As we add informative attributes we hope they add information and therefore reduce entropy.

The formula for entropy is:

  >- $Entropy = -1 * (p_{1} * log_{2}(p_{1}) +  p_{2} * log_{2}(p_{2}) + ...)$

Where:
>- Each $p_{i}$ represents the probability (or relative percentage) of the *i*th category within the dataset.

---

## Calculating Entropy

To help understand entropy we will first use a small dataset to calculate entropy which will help us when we create our functions for entropy and information gain.




### Example Data
Run the following code cell to generate sample data for the first few problems in this exercise. In this dataset "Credit Rating" will be the attribute and "Liability" will be the target variable.
>- Note: because this data will be randomly generated we will set a seed value

In [1]:
import random
import pandas as pd
random.seed(50)

liability_data = ["Normal", "High"]

random_data = []

for i in range(100):

  liability = random.choice(liability_data)

  if liability == "Normal":

    credit_rating = random.randint(600, 850)

  else:

    credit_rating = random.randint(300,700)


  random_data.append((credit_rating, liability))

random_data

prob1_df = pd.DataFrame(random_data, columns = ['Credit Rating', 'Liability'])

prob1_df.head()

Unnamed: 0,Credit Rating,Liability
0,436,High
1,626,High
2,777,Normal
3,692,High
4,343,High


## Problem 1: Calculating Parent Entropy

In the next code cells we will learn more about entropy by writing the python code to calculate it. The steps outlined by the various subcomponents of problem 1 should help when creating an entropy function.

We will use the `prob1_df` sample data to perform the calculations for entropy.

Hint: use `numpy` and the `log2` function to calculate log base 2.

### Problem 1.1: Calculating the Probabilities for the Target Classes

In the next code cell:

1. Find the total number of cases
2. Get the number of cases for each target class
3. Calculate the probabilties (relative frequencies) for each target class

In [None]:
import numpy as np

print('total cases: ' + str(len(prob1_df)))
total = prob1_df.shape[0]

target_counts = prob1_df['Liability'].value_counts().astype(float).values
probs = target_counts/total
probs

total cases: 100


array([0.5, 0.5])

### Problem 1.2: Calculate Entropy Components

We will define the entropy components as the expression, $p_{i} * log_{2}(p_{i})$ , where *i* represents each class.

Hint: We can use the `numpy` `log2()` function to help with this calculation.




In [None]:
entropy_components = probs * np.log2(probs)
entropy_components

array([-0.5, -0.5])

### Problem 1.3: Calculate Parent Entropy

Now that we have the entropy components, we can calculate entropy using the formula for entropy.

In the next cell, calculate parent entropy for the `prob1_df` data.
>- Parent entropy is commonly denoted with *H* so store your parent entropy value as *H*.

In [None]:
H = -1 * sum(entropy_components)
H

1.0

## Programming Problem 1: Entropy Function

Now that we have a basic understand of how entropy is calculated, let's create a function to be used on any target variable.

In the next code cell, create a function that will calculate entropy.

>- Use the function signature, `def entropy(target_column):` to get your function started.



In [None]:
def entropy(df, target_column):
    total = df.shape[0]
    target_counts = df[target_column].value_counts().astype(float).values
    probs = target_counts/total
    entropy_components = probs * np.log2(probs)
    H = -1 * sum(entropy_components)
    return H

In the next cell, call your `entropy()` function and pass in the "Liability" column as your target_column. Verify this number matches what you calcuated in Problem 1.3.

In [None]:
entropy(prob1_df, 'Liability')

1.0

# Information Gain

Information Gain (IG) measures the change in entropy due to any amount of new information being added. As we add attributes we hope they add information.

The formula for IG is:

>- $IG(parent, children) = entropy(parent) - [p(c_{1}) * entropy(c_{1}) + p_{2}(c_{2}) * entropy(c_{2}) + ... ]$

Where each $p(c_{i})$ represents the proporation of cases within each child node.

---

## Problem 2: Calculating Information Gain

In the next code cells calculate information gain for our example data in the `prob1_df` we created earlier in the notebook. The markdown cells should help guide us through the process and then we will put it all together.

>- Initially, we will pick a best guess for a number on credit rating to split the data into two child nodes based on the credit rating.
>- After you split your data based on credit rating, then you need to calculate entropy for each of the child nodes
>- Finally you need to sum up the weighted entropy of each child node and subtact this value from the parent entropy to get information gain

### Problem 2.1: Splitting the Data

In the next code cell, split the data based on credit rating using a a `threshold` variable. Initially, we will set this value to 550 as our best guess at how to best split this data to help us predict "Liability".

In [None]:
threshold = 550

df1 = prob1_df[prob1_df['Credit Rating'] < threshold]
df2 = prob1_df[prob1_df['Credit Rating'] > threshold]
df1

Unnamed: 0,Credit Rating,Liability
0,436,High
4,343,High
5,414,High
9,463,High
17,302,High
19,522,High
21,438,High
22,364,High
27,518,High
29,347,High


### Problem 2.2: Calculate Entropy for Child Nodes

In the next code cell, calculate entropy for each child node.

In [None]:
H1 = entropy(df1, 'Liability')

In [None]:
H2 = entropy(df2, 'Liability')

### Problem 2.3: Calculate Entropy Given the Credit Rating

To get the entropy given the values of an attribute we calculate the weighted sum of child entropy.

In the next cell:
>- Calcuate the proportion of data that exists in each child node relative to the total amount of data
>- Then use these propotions (weights) to get a weighted average entropy from the child nodes
>- The result is the entropy given the information we have about credit rating
>- Round this value to 4 decimal places

In [None]:
total1 = df1.shape[0]
p1 = total1/total
total2 = df2.shape[0]
p2 = total2/total

H = round(H1*p1 + H2*p2, 4)
H

0.5475

### Problem 2.4: Calculate Information Gain

In the next code cell, calculate the information gain we achieved by splitting "Credit Rating" at a value of 550.
>- Round this result to 4 decimals

In [None]:
IG = H - (H1*p1 + H2*p2)
IG

1.7020293212488546e-05

## Programming Problem 2: Information Gain Function

Now we should have a better understanding of the components and steps involved in calculating information gain. We need to take what we learned in the Problem 2 and create a function for Information Gain that can be used on any data frame, attribute, target, and threshold.

In the next code cell, create a function for information gain with the signature: `def info_gain(df, info_column, target_column, threshold):` where:
>- `df` is the dataset of interest
>- `info_column` is the attribute we want to test to see how much information is gained
>- `target_column` is the labeled column we want to predict
>- `threshold` is a value to split the `info_column` on

In [None]:
def info_gain(df, info_column, target_column, threshold):

    # split the data based on the threshold
    # this allows us to test various thresholds of the attribute to see how much info is gained

    data_above_thresh = df[df[info_column] <= threshold]
    data_below_thresh = df[df[info_column] > threshold]

    # get/calculate entropy from entropy function

    H = entropy(df, target_column) # Parent entropy

    entropy_above = entropy(data_above_thresh, target_column)
    entropy_below = entropy(data_below_thresh, target_column)

    # Get the weighted average
    # first we count the number of values above and below a threshold, and the total

    values_above = data_above_thresh.shape[0]
    values_below = data_below_thresh.shape[0]

    values_total = float(df.shape[0])

    # return info gain
    info_gain = H - (((values_above / values_total) * entropy_above) + ((values_below / values_total) * entropy_below))

    return round(info_gain, 4)



### Problem 2.5: Calculate Information Gain with Function

Test your function by passing in the data from the `prob1_df` example data. Use the values from Problem 2 to set your parameters in the function.
>- Double check you get the same IG value as we found when doing Problem 2 in multiple code cells

In [None]:
info_gain(prob1_df, "Credit Rating", "Liability", 550)

0.4525