In [None]:
%%HTML
<!-- Mejorar visualización en proyector -->
<style>
.rendered_html {font-size: 1.2em; line-height: 150%;}
div.prompt {min-width: 0ex; padding: 0px;}
.container {width:95% !important;}
</style>

In [None]:
%matplotlib notebook
%autosave 0
import numpy as np
import matplotlib.pyplot as plt
import torch
#import pyro

import ipywidgets as widgets
from functools import partial
slider_layout = widgets.Layout(width='600px', height='20px')
slider_style = {'description_width': 'initial'}
IntSlider_nice = partial(widgets.IntSlider, style=slider_style, layout=slider_layout, continuous_update=False)
FloatSlider_nice = partial(widgets.FloatSlider, style=slider_style, layout=slider_layout, continuous_update=False)
SelSlider_nice = partial(widgets.SelectionSlider, style=slider_style, layout=slider_layout, continuous_update=False)

# 1. Probabilities

### Random/Stochastic Variable (RV)

A variable to map the output of a random process: *throwing a coin/dice, predicting weather*
- Denoted by a capital letter: $X$

We don't know its value until we draw/sample from it: We observe the RV
- Observations are denoted with lowercase letters: $x \sim X$

We describe a RV through its domain and probability density/mass function

##### Example: Fair six-faced dice

- Domain (possible outputs): $[1, 2, 3, 4, 5, 6]$
- Probability mass function (discrete uniform): $[\frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}]$

##### Calisthenics:
- The probability of drawing a $1$ is $P(X=1) = P(1) = \frac{1}{6}$
- The probability of drawing a number greater or equal than $5$ is $P(X\geq 5) = \frac{1}{3}$
- The probability of drawing and odd number is $P(\text{odd}) = \frac{1}{2}$

### Joint, Marginal and Conditional probabilities

If we have two or more random variables we can define their joint pdf/pmf: $P(X,Y)$

From the joint we sum to obtain the marginal of $X$ or $Y$. This is the:

**Law of total probability (sum rule)**:

$$
\begin{align}
P(Y=y) &= \sum_{x \in \mathcal{X}} P(X=x, Y=y) \nonumber \\
&= \sum_{x \in \mathcal{X}} P(Y=y|X=x) P(X=x),
\end{align}
$$

where $P(Y=y|X=x)$ is the conditional probability of $y$ given $x$

$$
P(Y=y|X=x) = \frac{P(X=x, Y=y)}{P(X=x)}
$$

(iif $P(X=x) \neq 0$)

this is a special case of the 

**Chain rule of probabilities (product rule)**:

For example with four variables:
$$
\begin{align}
P(x_1, x_2, x_3, x_4) &= P(x_4|x_3, x_2, x_1) P(x_3, x_2, x_1) \nonumber \\
&= P(x_4|x_3, x_2, x_1) P(x_3| x_2, x_1) P(x_2, x_1) \nonumber \\
&= P(x_4|x_3, x_2, x_1) P(x_3| x_2, x_1) P(x_2 |x_1) P(x_1) \nonumber \\
\end{align}
$$

### Bayes Theorem

Combining the product and sum rule for two random variables we can write


$$
P(y | x) = \frac{P(x|y) P(y)}{P(x)} = \frac{P(x|y) P(y)}{\sum_{y\in\mathcal{Y}} P(x|y) P(y)}
$$

We call $P(y|x)$ the **posterior** distribution of $y$: 
> What we know of $y$ after we observe $x$ 

We call $P(y)$ the **prior** distribution of $y$
> What we know of $y$ before observing $x$


In [None]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(8.5, 3.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.view_init(elev=45., azim=-45)
ax2 = fig.add_subplot(1, 2, 2)
x = np.arange(-4, 5, 1); y = np.arange(-4, 5, 1)
X, Y = np.meshgrid(x, y); XY = np.zeros_like(X)
XY[-3, 2:-2] = 1; XY[2, 2:-2] = 1; XY[2:-2, 4] = 1
XY = XY/np.sum(XY)

def update_plot(x_cond):
    ax.cla()
    ax.bar(x, np.sum(XY, axis=1), zdir='x', color='b', zs=-4)
    ax.bar(y, np.sum(XY, axis=0), zdir='y', color='r', zs=5)
    colors = np.array([['m']*len(x)]*len(y))
    colors[:, x_cond-5] = 'b'
    ax.bar3d(X.ravel(), Y.ravel(), np.zeros_like(XY.ravel()), 1, 1, XY.ravel(), color=colors.ravel())
    ax.set_xlim([-4, 5]); ax.set_xlabel('X')
    ax.set_ylim([-4, 5]); ax.set_ylabel('Y')
    ax2.cla()
    ax2.bar(y, XY[X==x_cond]/np.sum(XY[X==x_cond]), color='b')
    ax2.set_title("P(Y|X={0})".format(x_cond))
    ax2.set_ylim([0, 0.55])
    ax2.set_xlim([-4, 4])
    
widgets.interact(update_plot, x_cond=IntSlider_nice(min=-4, max=4, value=4));

**Independence**

Two independent RV:
$$
\begin{align}
P(x, y)  &= P(x)P(y|x)\nonumber \\
&= P(x)P(y) \nonumber
\end{align}
$$

> Knowing that $x$ happened does not help me to know if $y$ happened

**Conditional independence**

Two RVs are conditionally independent given a third one
$$
P(x, y|z)  = P(x|z)P(y|z)
$$


# The meaning of probability

**Meaning 1:** We observe the outcome of a random experiment (event) several times and we count

We flip a coin 5 times and get [x, x, o, x, o]

- The probability of x is 3/5
- The probability of o is 2/5

We have estimated the probability from the **frequency** of x and o

> This is called the **Frequentist** interpretation of probability

**Meaning 2:** Probability is the **degree of belief** of an event

Probabilities describe **assumptions** and also describe **inference given those assumptions**

> This is called the **Bayesian** interpretation of probability

#### What is the difference for us?

Main difference is on the inference
- Frequentist: Write the likelihood, get its maximum: **point estimates**
- Bayesian: Parameters have distributions too: **Set priors get posteriors**

# 2. Inference

> Drawing conclusions from facts/evidence through reasoning and scientific premises

In our case

> Find the least uncertain answer to a problem based on data and a model 

- Our scientific premises and assumptions goes into the model
- The facts are the data

### Tasks in statistical inference

- Level 1: Fit a model to the data
- Level 2: Compare and validate between models
- Level 3: Answer questions with our model: **Hypothesis testing**

### Level 1: Maximum likelihood

We have a model $\mathcal{M}_i$ with parameters $\theta$

> We want to estimate $\theta~$ that best fit the data $\mathcal{D}$

We start by writing Bayes Theorem
$$
p(\theta|\mathcal{D}, \mathcal{M}_i) = \frac{p(\mathcal{D}| \theta, \mathcal{M}_i) p(\theta|\mathcal{M}_i)}{p(\mathcal{D}|\mathcal{M}_i)}
$$

In the **bayesian approach** we want to find the posterior of $\theta~$

But let's start with the following

- we only care for a point estimate of $\theta~$ 
- we assume that the prior distribution $p(\theta|\mathcal{M}_i)$ is uniform (uninformative) 

Then we can write

$$
\begin{align}
\hat \theta &= \text{arg} \max_\theta p(\theta|\mathcal{D}, \mathcal{M}_i) \nonumber \\
&= \text{arg} \max_\theta p(\mathcal{D}| \theta, \mathcal{M}_i) \nonumber
\end{align}
$$

> This is known as the **Maximum likelihood estimator (MLE)** of $\theta~$

The forms the basis of **frequentist approach** for parameter estimation
- Propose a likelihood
- Get its arg maximum

and we can see it as particular case of the bayesian approach

#### Appendix: Bernoulli distribution

A distribution for binary outcomes $x\in \{0, 1\}$

The pmf is
$$
p(x|p) = \begin{cases} p & \text{if } x=1 \\ 1-p & \text{if } x=0  \end{cases}
$$

which can be written as 
$$
p(x|p) = p^x (1-p)^{1-x}
$$

In [None]:
import scipy.stats
fig, ax = plt.subplots(figsize=(6, 3))

@widgets.interact(p=FloatSlider_nice(min=0, max=1, value=0.5, step=0.1))
def update(p):
    x = scipy.stats.bernoulli.rvs(p, size=1000)
    ax.cla()
    ax.hist(x, density=True, range=(0, 1), bins=10)

### MLE for a coin

Observations from a coin 

$$
\mathcal{D} = [x_1, x_2, \ldots, x_N]
$$

where $x_i \in \{0, 1\}$

**Assumption 1:** Observations are **independent and identically distributed (iid)**

$$
p(\mathcal{D}|\theta, \mathcal{M}_i) = \prod_{i=1}^N p(x_i|\theta, \mathcal{M}_i)
$$

**Assumption 2:** Bernoulli model with parameter $\theta \in [0, 1]$ for the observations

$$
p(x_i|\theta, \mathcal{M}_i) = \theta^{x_i} (1- \theta)^{1-x_i}
$$


> What is the MLE of $\theta~$?

**Trick of the trade:** The arg maximum of $p(x)$ is the same as $\log p(x)$

$$
\begin{align}
\hat \theta &= \text{arg} \max_\theta p(\theta|\mathcal{D}, \mathcal{M}_i) \nonumber \\
&= \text{arg} \max_\theta \log p(\theta|\mathcal{D}, \mathcal{M}_i) \nonumber \\
&= \text{arg} \max_\theta \log p(\mathcal{D}| \theta, \mathcal{M}_i) \nonumber \\
&= \text{arg} \max_\theta \sum_{i=1}^N \log p(x_i| \theta, \mathcal{M}_i) \nonumber \\
&= \text{arg} \max_\theta \sum_{i=1}^N x_i \log (\theta) + (1 -x_i) \log(1-\theta) \nonumber 
\end{align}
$$

We can take the derivate, set it to zero, and get the MLE 

$$
\hat \theta = \frac{1}{N} \sum_{i=1}^N x_i
$$


## Priors and Maximum a Posteriori

Let's lift the assumption that the prior is uniform 

- We are still looking for a point estimate of $\theta~$ 
- We keep the *iid* assumption and we consider the "log trick"

We can write

$$
\begin{align}
\hat \theta &= \text{arg} \max_\theta \log p(\theta|\mathcal{D}, \mathcal{M}_i) p(\theta|\mathcal{M}_i) \nonumber \\
&= \text{arg} \max_\theta \sum_{i=1}^N \log p(x_i| \theta, \mathcal{M}_i) + \log p(\theta|\mathcal{M}_i) \nonumber 
\end{align}
$$

> This is called the **Maximum a posteriori (MAP)** estimate of $\theta~ $

The MAP estimate corresponds to the mode of $p(\theta|\mathcal{D}, \mathcal{M}_i)$

#### In addition to the model (likelihood) we have to set the prior $p(\theta)$

This can be a [sensible choice](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations)

#### Appendix: Beta distribution

A distribution for $x \in [0, 1]$, *e.g* probabilities

The pdf is 

$$
\text{Beta}(x|\alpha, \beta) = \frac{x^{\alpha-1} (1-x)^{\beta-1}}{B(\alpha, \beta)}
$$

where $B(x,y) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)}$ and $\Gamma(x)$ is the [Gamma function](https://en.wikipedia.org/wiki/Gamma_function)

For $\alpha=\beta=1$ we get the Uniform distribution in $[0, 1]$

In [None]:
fig, ax = plt.subplots(figsize=(6, 3))

@widgets.interact(alpha=FloatSlider_nice(min=0.001, max=10, value=0.5, step=0.1), 
                  beta=FloatSlider_nice(min=0.001, max=10, value=0.5, step=0.1))
def update(alpha, beta):
    x = scipy.stats.beta.rvs(alpha, beta, size=1000)
    ax.cla()
    ax.hist(x, density=True, range=(0, 1), bins=10)

#### Map for the coin


We will use a Beta prior for $\theta ~$

$$
p(\theta|\mathcal{M}_i) = \text{Beta}(\theta| \alpha, \beta) = \frac{\theta^{\alpha-1} (1-\theta)^{\beta-1}}{B(\alpha, \beta)}
$$

Omitting the terms that do not depend on $\theta~$ we get the MAP 
$$
\hat \theta= \text{arg} \max_\theta \sum_{i=1}^N x_i \log (\theta) + (1 -x_i) \log(1-\theta) +(\alpha -1) \log(\theta) + ( \beta -1) \log(1-\theta) 
$$

By setting the derivate to zero we obtain

$$
\hat \theta = \frac{1}{N+\alpha - \beta} (\alpha -1 + \sum_{i=1}^N x_i)
$$

Note that it reduces to the MLE for $\alpha=\beta=1$ (uniform)


> If we know something about the coin before observing the data we add it through $\alpha$ and $\beta$

## Bayesian inference

With MAP and MLE we get point estimates

> How good are these estimates? Can we trust them? What is their uncertainty?

We answer this through confidence interval, bootstrap, cross-validation

In a "full" Bayesian approach we select likelihood/prior and aim for the posterior of $\theta~$,

$$
p(\theta|\mathcal{D}, \mathcal{M}_i) = \frac{p(\mathcal{D}| \theta, \mathcal{M}_i) p(\theta|\mathcal{M}_i)}{p(\mathcal{D}|\mathcal{M}_i)}
$$

> If we have the posterior we know everything about $\theta~$

But, how do we get the posterior?

### Analytical posterior

In some "very special cases" the posterior is analytically tractable

Enter the [**conjugate priors**](https://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions)

#### Posterior for the coin


The likelihood of the coin (Bernoulli) is

$$
\begin{align}
p(\mathcal{D}|\theta, \mathcal{M}_i) &= \prod_{i=1}^N p(x_i|\theta, \mathcal{M}_i) \nonumber \\
&= \prod_{i=1}^N \theta^{x_i} (1-\theta)^{1-x_i} \nonumber \\
&= \theta^{\sum_i x_i}(1-\theta)^{N-\sum_i x_i} \nonumber 
\end{align}
$$

The prior is Beta

$$
p(\theta|\mathcal{M}_i) = \text{Beta}(\theta| \alpha , \beta) = \frac{\theta^{\alpha-1} (1-\theta)^{\beta-1}}{B(\alpha, \beta)}
$$

The posterior is

$$
p(\theta|\mathcal{D}, \mathcal{M}_i) = \frac{1}{Z} \theta^{\alpha +\sum_i x_i - 1}(1-\theta)^{\beta +N-\sum_i x_i-1},
$$
where $Z$ is a normalizing constant

We recognize that the posterior is also Beta:

$$
p(\theta|\mathcal{D}, \mathcal{M}_i) = \text{Beta}(\theta| \hat \alpha , \hat \beta),
$$

with $\hat \alpha= \alpha +\sum_i x_i$ and $\hat \beta= \beta +N-\sum_i x_i$

> We say that Beta is conjugate to the Bernoulli distribution: It produces a Beta posterior

#### Example: Influence of $\alpha$, $\beta$ and the number of observations from the coin

In [None]:
coins = scipy.stats.bernoulli.rvs(p=0.7, size=1000)

In [None]:
p_plot = np.linspace(0, 1, num=1000)
fig, ax = plt.subplots(figsize=(6, 3))

def update_plot(N, a, b):
    ax.cla()
    # Beta(a, b)
    prior = scipy.stats.beta(a, b)
    p = np.sum(coins[:N])
    # Bernoulli
    likelihood = (p_plot**p)*(1-p_plot)**(N-p)
    likelihood = likelihood*1000/np.sum(likelihood)
    # Beta(hat a, hat b)
    posterior = scipy.stats.beta(a + np.sum(coins[:N]), b + N - np.sum(coins[:N]))
    ax.plot(p_plot, prior.pdf(p_plot), label='prior')
    ax.plot(p_plot, likelihood, label='likelihood')
    ax.plot(p_plot, posterior.pdf(p_plot), label='posterior')
    plt.legend()
    
    
widgets.interact(update_plot, N=SelSlider_nice(options=[1, 2, 5, 10, 20, 50, 100, 200, 500]),
                 a=FloatSlider_nice(min=0.0, max=10, value=1),
                 b=FloatSlider_nice(min=0.0, max=10, value=1));

In the Bayesian approach online problems are just updates to the posterior

In [None]:
from matplotlib import animation

fig, ax = plt.subplots(figsize=(6, 3))
a = b = 1
def update_plot(k):
    ax.cla()
    ax.plot(p_plot, scipy.stats.beta(a + np.sum(coins[:k]), 
                                     b + k - np.sum(coins[:k])).pdf(p_plot), label=str(i))
    ax.set_title(k)

anim = animation.FuncAnimation(fig, update_plot, frames=1000, interval=200, 
                               repeat=True, blit=False)

## Level 2 Inference

Note that we are missing the

**Evidence/Marginal likelihood:** The normalizing constant $p(\mathcal{D}|\mathcal{M}_i)$


# 3. Information Theory

> What is information? Can we measure it?

Information Theory is the mathematical study of the quantification and transmission of information proposed by **Claude Shannon** on this seminal work: *A Mathematical Theory of Communication*, 1948

Shannon considered the output of a noisy source as a random variable $X$ taking $M$ possible values $\mathcal{A} = \{x_1, x_2, x_3, \ldots, x_M\}$

Each value $x_i$ have an associated probability $P(X=x_i) = p_i$

> What is the amount of information carried by $x_i$?

Shannon defined the amount of information as

$$
I(x_i) = \log_2 \frac{1}{p_i},
$$

which is measured in **bits**

> One bit is the amount of information needed to choose between two **equiprobable** states



#### Example: A meteorological station that sends tomorrow's weather prediction

The dictionary of messages: (1) Rainy, (2) Cloudy, (3) Partially cloudy, (4) Sunny

Their probabilities are: $p_1=1/2$, $p_2=1/4$, $p_3=1/8$, $p_4=1/8$

The minimum number of yes/no questions (equiprobable) needed to guess tomorrow's weather:

- Is it going to rain? 
- No: Is it going to be cloudy?
- No: Is it going to be sunny?

Amount of information:
- Rainy: $\log_2 \frac{1}{p_1} = \log_2 2 = 1$ bits
- Cloudy: $2$ bits 
- Partially cloudy and Sunny: $3$ bits

> The larger the probability the smallest information it carries

> Amount of information is also called surprise

## Shannon's entropy

After defining the amount of information for a state Shannon's defined the average information of the source $X$ as

$$
H(X) = \mathbb{E}_{x\sim X}\left [\log_2 \frac{1}{P(x)} \right] = - \sum_{i=1}^M p_i \log_2 p_i  ~ \text{[bits]}
$$

and called it the **entropy** of the source

> Entropy is the "average information of the source"

#### Properties:
- Entropy is nonnegative: $H(X)>0$
- Entropy is equal to zero when $p_j = 1 \wedge p_i = 0, i \neq j$
- Entropy is maximum when $X$ is uniformly distributed $p_i = \frac{1}{M}$, $H(X) = \log_2(M)$

> The more random the source is the larger its entropy

Differential entropy for continuous variables as 

$$
H(p) = - \int p(x) \log p(x) \,dx ~ \text{[nats]}
$$

where $p(x)$ is the probability density function (pdf) of $X$

## Relative Entropy: Kullback Leibler divergence

Consider a continuous random variable $X$ and two distributions $q(x)$ and $p(x)$ defined on its probability space

The relative entropy between these distributions is 
$$
\begin{align}
D_{\text{KL}} \left [ p(x) || q(x) \right] &= \mathbb{E}_{x \sim p(x)} \left [ \log \frac{p(x)}{q(x)} \right ] \nonumber \\
&= \mathbb{E}_{x \sim p(x)} \left [ \log p(x) \right ]  - \mathbb{E}_{x \sim p(x)} \left [ \log q(x) \right ],  \nonumber \\
&= \int p(x) \log p(x) \,dx  - \int p(x) \log q(x) \,dx  \nonumber 
\end{align}
$$
which is also known as the Kullback-Leibler divergence

- The left hand side term is the negative entropy of p(x)
- The right hand side term is called the **cross-entropy of q(x) relative to p(x)** 
    - Cross entropy is the average information of distribution q(x)

#### Intepretations of KL
- Coding: Expected number of "extra bits" needed to code p(x) using a code optimal for q(x)
- Bayesian modeling: Amount of information lost when q(x) is used as a model for p(x)

#### Properties

- Non-negative
- Equal to zero only if $p(x) \equiv q(x)$
- Additive for independent distributions
- Related to Mutual Information: $\text{MI}(X, Y) = D_{\text{KL}} \left [ p(x, y) || p(x)p(y) \right]$


**Important:** 

KL divergence is asymmetric
$$
D_{\text{KL}} \left [ p(x) || q(x) \right] \neq D_{\text{KL}} \left [ q(x) || p(x) \right]
$$
- Not a proper distance (no triangle inequility either)
- Forward and Reverse KL have different meanings (we will explore them soon)

# Generative models

Assume that we have $N$ continuous observations 
$$
(x_1, x_2, \ldots, x_N)
$$ 

These observations come from a certain distribution which we don't know 

$$
x_i \sim p^*(x)
$$

The goal of **generative modeling** is to learn a probabilistic model 

$$
p_\theta(x)
$$ 

with parameters $\theta$ that "mimics" $p^*(x)$, *i.e.*

> match  $p_\theta (x)$ to $p^*(x)$

We can express this mathematically by 
1. Select a parametric form for $p_\theta (x)$
1. Write the difference between $p_\theta (x)$  and $p^*(x)$
1. Minimize this difference as a function of $\theta$

> How do we compute the difference between probability distributions?


## KL divergence

One way to do this is through the Kullback-Leibler (KL) divergence

$$
\begin{align}
D_{\text{KL}} \left [ p^*(x) || p_\theta(x) \right] &= \mathbb{E}_{x \sim p^*(x)} \left [ \log \frac{p^*(x)}{p_\theta(x)} \right ] \nonumber \\
&= \mathbb{E}_{x \sim p^*(x)} \left [ \log p^*(x) \right ]  - \mathbb{E}_{x \sim p^*(x)} \left [ \log p_\theta(x) \right ],  \nonumber 
\end{align}
$$
where
$$
\mathbb{E}_{x\sim p(x)} [q(x) ] = \int p(x) q(x) \,dx
$$
is the expected value of $q(x)$ given that $x$ is sampled from $p(x)$

Note that the KL divergence is non-negative but **asymmetric** (not a proper distance)


**Problem:** We don't know $p^*(x)$, so we cannot evaluate $\mathbb{E}_{x \sim p^*(x)} \left [ \log p^*(x) \right ]$


## Relation with Maximum Likelihood

We want to minimize the KL divergence as a function of $\theta$

The term $\mathbb{E}_{x \sim p^*(x)} \left [ \log p^*(x) \right ]$ does not depend on $\theta$

So

$$
\min_\theta D_{\text{KL}} \left [ p^*(x) || p_\theta(x) \right] = \max_\theta\mathbb{E}_{x \sim p^*(x)} \left [ \log p_\theta(x) \right ]
$$

> Minimizing the KL divergence between the real distribution and the model $\equiv$ maximizing the log likelihood of the model given the data

## Additional material

- Daniel Commenges, ["Information Theory and Statistics: an overview"](https://arxiv.org/pdf/1511.00860.pdf)