### Introduction to Information Theory
Invented by Claude E. Shannon. https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf

#### The Basic Idea of Shannon Entropy
Entropy quantifies the average amount of information inherent in a possible set of outcomes of a random variable. A random variable is a function that assigns a numerical value to each possible outcome of a random experiment.

$
H(X) = -\sum_x (p(x) \cdot \log_2{p(x)})
$

$\log_2$ is usually chosen to for representing information in binary notation. This formula is used to determine the amount of information needed to represent the outcome of a random experiment. In this case, the amount is quantified using bits (because $\log_2$) so the output would be the number of bits needed to represent the output.

#### An Example
Given a fair coin with the probability of heads being $p = 0.5$ and the probability of tails being $p = 0.5$, you can calculate how many bits are needed to represent the outcome of running the random experiemnt of flipping the coin once.

$
h = -(0.5 \cdot \log_2{0.5} + 0.5 \cdot \log_2{0.5}) = -(0.5 \cdot (-1) + 0.5 \cdot (-1)) = 1 bit
$

In [14]:
import math
import numpy as np

In [15]:
# I(X) = -log_2(p)
# the bits of information we receive for an event X
#    p being the probability of the event X happening
I = lambda X: -np.log2(1 / 2**(len(X)))
I("0010") # bits

np.float64(4.0)

In [18]:
# only captures the information of a single discrete event
def self_information(p):
    return -np.log2(p)
self_information(1 / 64)

np.float64(6.0)

#### shannon entropy
1. the information we gain by observing a random variables does not depend on what we call the elemnts or the presence of additional elemnts which have a probability of zero.
2. the information we gain by observing two random variables is no more than the sum of the information we gain by observing them separately. if they are independent, then it is exactly the sum.
3. the information gained when observing (nearly) certain events is (nearly) zero.

- H(X) = -E_x~p\[log2 p(x)\]
- if x is discrete, we sum, if its continuous we integrate
- a llog can naturally converta a probability distribution product to a sum of the individual parts
- negative log because any log of \[0,1\] is negative so want to make it positive
- the likley hood of events and their entropy should be monotonically and decreasingly related because we want the more rare events to have higher entropy than the more common ones as we gain more information from the more rare events (since we already have information from the common ones)
- the "surprise" grows limitless as the likelihood approaches 0

#### entropy of a slot machine
- with s1...sk symbols and p1...pk probabilites for those symbols
$
H(S) = \sum_i p_i \cdot I(s_i) = - \sum_i p_i \cdot \log p_i
$