### Entropy

Entropy is a measure of how `disordered` a collection is.  
The `more` impure the feature is, the higher the entropy.  

Probability distribution is the `frequency` of the unique values.  
It turns out that a `logarithm` of the number of states is perfect for disorder.  

$ H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i) $

In [6]:
import pandas as pd
import numpy as np

# Set the initial traning data
A = ['apple']*1 + ['orange']*2 + ['banana']*2
B = ['apple']*5 + ['orange']*2 + ['banana']*0

# Probability distribution (by hand)
P1 = [1/5, 2/5, 2/5] 
P2 = [5/7, 2/7, 0/7]

# Probability distribution (normalize)
A = pd.Series(A)
B = pd.Series(B)
P3 = A.value_counts(normalize=True)
P4 = B.value_counts(normalize=True)

# Entropy (Shannon model)
E3 = -1 * np.sum(P3 * np.log2(P3))
E4 = -1 * np.sum(P4 * np.log2(P4))
assert E3 > E4

# Output results
print("Datasets:")
print("A =", A.values)
print("B =", B.values, "\n")

print("Probability distributions (by hand):")
print(P1)
print(P2, "\n")

print("Probability distributions (pandas):")
print(P3, '\n')
print(P4, '\n')

print("Entropies:")
print(E3)
print(E4)

Datasets:
A = ['apple' 'orange' 'orange' 'banana' 'banana']
B = ['apple' 'apple' 'apple' 'apple' 'apple' 'orange' 'orange'] 

Probability distributions (by hand):


AttributeError: 'list' object has no attribute 'values'