In [2]:
import sys
sys.path.append('..')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import metrics
import utils

# Introduction

We would like to quantify the information conveyed by the realisation of an event, following some properties:
- likely events should have low information content, and events that are guaranteed to happen should have no information at all.
- less likely events should have higher information content.
- independents event should have additive information.  

These 3 properties are satisfied by the self-information of an event $x$:  
$$I(x) = - \log P(x)$$

$I(x)$ is written in units of nats. One nat is the amount of observation gained by observing an event of probability $1/e$.

## Shanon Entropy

We can quantify the amount of uncertainty in an entiere probability distribution $P(x)$ using the Shannon entropy $H(P)$:
$$H(P) = E_{x \sim P}[I(x)] = - E_{x \sim P}[\log P(x)]$$

Distributions that are nealy deterministic have low entropy, and distributions that are closer to uniform have high entropy.  

If $x$ is a discrete random variable, we get:
$$H(P) = - \sum_{x_i} P(x_i) \log P(x_i)$$

If $x$ is a continuous random variable with density $f(x)$. We call $H(x)$ the differential entropy:
$$H(P) = - \int f(x) \log f(x) dx$$

## Kullback-Leibler divergence

The KL divergence mesure how different are two probability distributions $P(X)$ and $Q(x)$ over the same probability space:

$$D_{KL}(P||Q) = E_{x \sim P} \left[ \log \frac{P(x)}{Q(x)} \right] = E_{x \sim P} [\log P(x) - \log Q(x)]$$

The KL divergence has many properties:
- it is $\geq 0$
- for discrete variable, it is $0$ if and only if $P$ and $Q$ are the same distribution.
- for continuous variable, it is $0$ if and only if $P$ and $Q$ are equal almost everywhere.  

It is can consired as some sort of distance between 2 distributions, but it's not a true distance metric because it's not symmetric: $D_{KL}(P||Q) \neq D_{KL}(Q||P)$.

For a discrete probability space, the KL divergence is:
$$D_{KL}(P||Q) = \sum_{x_i} P(x_i) \log \frac{P(x_i)}{Q(x_i)}$$

For a continous probability space with densities $p(x)$ and $q(x)$, the KL divergence is:
$$D_{KL}(P||Q) = \int p(x) \log \frac{p(x)}{q(x)} dx$$

## Cross Entropy

The cross entropy for two propability distributions $P(X)$ and $Q(X)$ over the same probability space is defined as:

$$H(P,Q) = - E_{X \sim P} \log Q(X)$$

The cross entropy is closely related to the KL divergence:
$$H(P,Q) = H(P) + D_{KL}(P||Q)$$
Minimizing the cross entropy with respect to $Q$ is equivalent to minimizing the KL divergence.  

For a discrete probability space, the cross-entropy is:
$$H(P,Q) = - \sum_{x_i} P(x_i) \log Q(x_i)$$

For a continuous probabiliyt space with densities $p(x)$ and $q(x)$, the cross-entropy is:
$$H(P,Q) = - \int P(x) \log Q(x) dx$$

## Mutual Information

The mutual information between two random variables is a measure of the mutual dependence between them. It mesures how much knowing one of these variables reduces uncertainty about the other.  
If $X$ and $Y$ are indepandant, their mutual information is $0$. On the other hand, if $X$ is a deterministic function of $Y$ and vice-versa, the mutual information is at it's maximum.  

It is defined as:
$$I(X;Y) = E_{x,y \sim P(x,y)} \left[ \log \frac{P(x,y)}{P(x)P(y)} \right]$$

Mutual information can be expressed using entropy:
$$I(X;Y) = H(Y) - H(Y|X)$$

Mutual information is also related to the KL divergence:
$$I(X;Y) = D_{KL} (P(x,y)||P(x)P(y))$$

Mutual information can be extended to the multivariate case:
$$I(X_1; \text{...}, X_{n+1}) = I(X_1;\text{...};X_n) - I(X_1;\text{...};X_n|X_{n+1})$$
$$\text{where } I(X_1;\text{...};X_n|X_{n+1}) = E_{X_{n+1}}[I(X_1;\text{...};X_n)|X_{n+1}]$$