# Entropy

https://towardsdatascience.com/entropy-how-decision-trees-make-decisions-2946b9c18c8

Machine Learning
Entropy, as it relates to machine learning, is a measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information. Flipping a coin is an example of an action that provides information that is random. For a coin that has no affinity for heads or tails, the outcome of any number of tosses is difficult to predict. Why? Because there is no relationship between flipping and the outcome. This is the essence of entropy.

Entropy is a measure of chaos in a system. Because it is much more dynamic than other more rigid metrics like accuracy or even mean squared error, using flavors of entropy to optimize algorithms from decision trees to deep neural networks has shown to increase speed and performance.
It appears everywhere in machine learning: from the construction of decision trees to the training of deep neural networks, entropy is an essential measurement in machine learning.
Entropy has roots in physics — it is a measure of disorder, or unpredictability, in a system. For instance, consider two gases in a box: initially, the system has low entropy, in that the two gasses are cleanly separable; after some time, however, the gasses intermingle and the system’s entropy increases. It is said that in an isolated system, the entropy never decreases — the chaos never dims down without external force.

What is the entropy for a bucket with a ratio of four red balls to ten blue balls? Input your answer to at least three decimal places.
-(4/(4+10))log2(4/(4+10))-10/(10+4)log2(10/(4+10))
entropy formula
-(m/(m+n))*log2(m/(m+n))-(n/(m+n))*log2(n/(m+n))

In [23]:
import numpy as np

def entropy(m, n) :
    
    result = -(m/(m+n))*np.log2(m/(m+n))-(n/(m+n))*np.log2(n/(m+n))
    return result

print(entropy(m=1, n=2))

0.9182958340544896


# Multi-class Entropy

p1 = m/m+n
p2 = m/m+n
entropy = -p1*log2(p1) - p2*log2(p2)
multi-class
entropu = -p1*log2(p1) - p2*log2(p2) - ... - pn*log2(Pn) = 
  n
− ∑ pi log2(pi)
 i=1
 
The minimum value is still 0, when all elements are of the same value. The maximum value is still achieved when the outcome probabilities are the same, but the upper limit increases with the number of different outcomes. (For example, you can verify the maximum entropy is 2 if there are four different possibilities, each with probability 0.25.)
 

If we have a bucket with eight red balls, three blue balls, and two yellow balls, what is the entropy of the set of balls? Input your answer to at least three decimal places.

In [22]:
import numpy as np

def entropy(m, n, multi) :
    r = m + n
    for i in multi :
        r += i
    result = -(m/r)*np.log2(m/r)-(n/r)*np.log2(n/r)
    for j in multi :
        result += -(j/r)*np.log2(j/r)
    return result


#multi = {2}
#print(entropy(m=8, n=3, multi=multi))
multi = {1}
print(entropy(m=3, n=2, multi=multi))
# or we can also use just one array for all these parameters.

1.4591479170272448


# Information Gain

Information Gain Formula
Note that the child groups are weighted equally in this case since they're both the same size, for all splits. In general, the average entropy for the child groups will need to be a weighted average, based on the number of cases in each child group. That is, for mm items in the first child group and nn items in the second child group, the information gain is:

Information Gain = Entropy(Parent) - [ m/m+n entropy(Child 1) + n/m+n Entropy(Child2)]

example:

*Recommending Apps

-------------------------------
|Gender   | Occupation  | App |
*******************************
|    F.   |  Study      |   1 |
*******************************
|    F.   |  Work       |   2 |
*******************************
|    M.   |  Work       |   3 |
*******************************
|    F.   |  Work       |   2 |
*******************************
|    M.   |  Study      |   1 |
*******************************
|    M.   |  Study      |   1 |
*******************************

1 = game
2 = whatsapp
3 = snapchat
                  1.             2.          3. 
Entropy = -(3/6)log2(3/6)-(2/6)log2(2/6)-(1/6)log2(1/6)
= 1.46
Gender
F = entropy(1, 2) = 0.92 ** quantities (1 game, 2 whatsapp)
M = entropy(1, 2) = 0.92 ** quantities (1 snapchat, 2 game)
information gain = 1.46 - 0.92 = 0.54

Occupation
S = 0
M = 0.92
average = 0.46
information gain = 1.46 - 0.46 = 1

# Quiz for Maximizing Information Gain

Which of the following splitting criteria provides the most information gain for discriminating Mobugs from Lobugs?

In [25]:
import pandas as pd
import numpy as np

df = pd.read_csv('information_gain/ml-bugs.csv', delimiter = ';')

print(df)
print("#### Mobug ###")
df_mobug = df.loc[df['Species'] == 'Mobug']
print(len(df_mobug))
print(df_mobug)
print("#### Lobug ###")
df_lobug = df.loc[df['Species'] == 'Lobug']
print(len(df_lobug))
print(df_lobug)
print("#### Mobug Color == Brown ###")
df_mobug_brown = df_mobug.loc[df_mobug['Color'] == 'Brown']
print(len(df_mobug_brown))
print(df_mobug_brown)
print("#### Lobug Color == Brown ###")
df_lobug_brown = df_lobug.loc[df_lobug['Color'] == 'Brown']
print(len(df_lobug_brown))
print(df_lobug_brown)
print("#### Mobug Color != Brown ###")
df_mobug_not_brown = df_mobug.loc[df_mobug['Color'] != 'Brown']
print(len(df_mobug_not_brown))
print(df_mobug_not_brown)
print("#### Lobug Color != Brown ###")
df_lobug_not_brown = df_lobug.loc[df_lobug['Color'] != 'Brown']
print(len(df_mobug_not_brown))
print(df_mobug_not_brown)
print("#### Mobug Color == Blue ###")
df_mobug_blue = df_mobug.loc[df_mobug['Color'] == 'Blue']
print(len(df_mobug_blue))
print(df_mobug_blue)
print("#### Lobug Color == Blue ###")
df_lobug_blue = df_lobug.loc[df_lobug['Color'] == 'Blue']
print(len(df_lobug_blue))
print(df_lobug_blue)
print("#### Mobug Color == Green ###")
df_mobug_green = df_mobug.loc[df_mobug['Color'] == 'Green']
print(len(df_mobug_green))
print(df_mobug_green)
print("#### Lobug Color == Green ###")
df_lobug_green = df_lobug.loc[df_lobug['Color'] == 'Green']
print(len(df_lobug_green))
print(df_lobug_green)
print("#### Mobug Length < 17 ###")
df_mobug_length_less_than_17 = df_mobug.loc[df_mobug['Length (mm)'] < 17]
print(len(df_mobug_length_less_than_17))
print(df_mobug_length_less_than_17)
print("#### Lobug Length < 17 ###")
df_lobug_length_less_than_17 = df_lobug.loc[df_lobug['Length (mm)'] < 17]
print(len(df_lobug_length_less_than_17))
print(df_lobug_length_less_than_17)
print("#### Mobug Length < 20 ###")
df_mobug_length_less_than_20 = df_mobug.loc[df_mobug['Length (mm)'] < 20]
print(len(df_mobug_length_less_than_20))
print(df_mobug_length_less_than_20)
print("#### Lobug Length < 20 ###")
df_lobug_length_less_than_20 = df_lobug.loc[df_lobug['Length (mm)'] < 20]
print(len(df_lobug_length_less_than_20))
print(df_lobug_length_less_than_20)


def two_group_ent(first, tot):                        
    return -(first/tot*np.log2(first/tot) +           
             (tot-first)/tot*np.log2((tot-first)/tot))

tot_ent = two_group_ent(10, 24) # Mobug, Total (Mobug+Lobug)
# Mobug Length < 20 + Mobug Length < 17 = 15 / 24 * Lobug Length < 20 + Lobug Length < 17 = 11 / 
# Mobug Length < 20 + Mobug Length < 17 = 15                 
#  + Mobug Length < 17 + Lobug Length < 17  = 9 / total 24 * Mobug Length, Lobug Legth + Mobug Length
g17_ent = 15/24 * two_group_ent(11,15) + 9/24 * two_group_ent(6,9)              
g20_ent = 15/24 * two_group_ent(11,15) + 17/24 * two_group_ent(9,17)
color_green = 16/24 * two_group_ent(20,16) + 8/24 * two_group_ent(2,8)
color_blue = 16/24 * two_group_ent(20,16) + 10/24 * two_group_ent(4,10)
color_brown = 16/24 * two_group_ent(20,16) + 6/24 * two_group_ent(4,10)
answer = tot_ent - g17_ent 
answer2 = tot_ent - g20_ent
answer3 = tot_ent - color_green
answer4 = tot_ent - color_blue
answer5 = tot_ent - color_brown

print(f"g17 {answer} ") #correct answer
print(f"g20 {answer2} ")
print(f"green {answer3} ")
print(f"blue {answer4} ")
print(f"brown {answer5} ")

   Species  Color  Length (mm)
0    Mobug  Brown         11.6
1    Mobug   Blue         16.3
2    Lobug   Blue         15.1
3    Lobug  Green         23.7
4    Lobug   Blue         18.4
5    Lobug  Brown         17.1
6    Mobug  Brown         15.7
7    Lobug  Green         18.6
8    Lobug   Blue         22.9
9    Lobug   Blue         21.0
10   Lobug   Blue         20.5
11   Mobug  Green         21.2
12   Mobug  Brown         13.8
13   Lobug   Blue         14.5
14   Lobug  Green         24.8
15   Mobug  Brown         18.2
16   Lobug  Green         17.9
17   Lobug  Green         22.7
18   Mobug  Green         19.9
19   Mobug   Blue         14.6
20   Mobug   Blue         19.2
21   Lobug  Brown         14.1
22   Lobug  Green         18.8
23   Mobug   Blue         13.1
#### Mobug ###
10
   Species  Color  Length (mm)
0    Mobug  Brown         11.6
1    Mobug   Blue         16.3
6    Mobug  Brown         15.7
11   Mobug  Green         21.2
12   Mobug  Brown         13.8
15   Mobug  Brown    

  (tot-first)/tot*np.log2((tot-first)/tot))
