### Probability

Probability is the `chance` of something to happen.  
When you flip a `coin`, there is a probability of 0.5 (or 50% chance) to land on heads.  

It's `like` asking, "What are the chances of something to happen?"  
Probability is a number `between` 0 and 1, where 0 means "no way" and 1 means "definitely happening".  

$ P(\text{Heads}) = \frac{1}{2} = 0.5 $

In [1]:
# Coin Flip events
events = ['head', 'tail']

# Probability distribution (coin flip):
print('Head =', 1/2)
print('Tail =', 1/len(events))

Head = 0.5
Tail = 0.5


### Probability Distribution

Now, imagine you're not just flipping a coin but `rolling` a dice.  
There are more `outcomes` (1 through 6), each with its own probability.  

A probability distribution is a `list` with all these probabilities.  
It's like a `map` with all the possible outcomes and how likely they are.

This set of probabilities is the probability distribution for a fair six-sided dice. 

$ P(1) = \frac{1}{6} $ 
$ P(2) = \frac{1}{6} $
$ P(3) = \frac{1}{6} $
$ P(4) = \frac{1}{6} $
$ P(5) = \frac{1}{6} $
$ P(6) = \frac{1}{6} $


In [12]:
import pandas as pd
from icecream import ic

# Bags datasets
A = ['apple']*1 + ['orange']*2 + ['banana']*2
B = ['apple']*5 + ['orange']*2 + ['banana']*0
ic(A, B)

# Probability distribution (by hand)
P1 = [{'apple': 1/5}, {'orange': 2/5}, {'banana': 2/5}] 
P2 = [{'apple': 5/7}, {'orange': 2/7}]
ic(P1, P2)

# With pandas
P1 = pd.Series(A).value_counts(normalize=True)
P2 = pd.Series(B).value_counts(normalize=True)
ic(P1, P2);

ic| A: ['apple', 'orange', 'orange', 'banana', 'banana']
    B: ['apple', 'apple', 'apple', 'apple', 'apple', 'orange', 'orange']
ic| P1: [{'apple': 0.2}, {'orange': 0.4}, {'banana': 0.4}]
    P2: [{'apple': 0.7142857142857143}, {'orange': 0.2857142857142857}]
ic| P1: orange    0.4
        banana    0.4
        apple     0.2
        dtype: float64
    P2: apple     0.714286
        orange    0.285714
        dtype: float64


### Entropy

Entropy is a measure of how `disordered` a collection is.  
The more `impure` the feature is, the higher the entropy.  

Probability distribution is the `frequency` of the unique values.  
It turns out that a `logarithm` of the number of states is perfect for compute entropy.  

$ H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i) $  

In [14]:
import pandas as pd
import numpy as np
from icecream import ic

# Set the initial traning data
A = ['apple']*1 + ['orange']*2 + ['banana']*2
B = ['apple']*5 + ['orange']*2 + ['banana']*0
ic(A, B)

# Probability
P1 = pd.Series(A).value_counts(normalize=True)
P2 = pd.Series(B).value_counts(normalize=True)
ic(P1, P2)

# Entropy (Shannon model)
P1 = P1.values
P2 = P2.values
H1 = -1 * np.sum(P1 * np.log2(P1))
H2 = -1 * np.sum(P2 * np.log2(P2))
ic(H1, H2);

assert H1 > H2

ic("A entropy > B entropy | There is more disorder in A than B")
ic("Assertion passed");

ic| A: ['apple', 'orange', 'orange', 'banana', 'banana']
    B: ['apple', 'apple', 'apple', 'apple', 'apple', 'orange', 'orange']
ic| P1: orange    0.4
        banana    0.4
        apple     0.2
        dtype: float64
    P2: apple     0.714286
        orange    0.285714
        dtype: float64
ic| H1: 1.5219280948873621, H2: 0.863120568566631
ic| 'A entropy > B entropy | There is more disorder in A than B'
ic| 'Assertion passed'
