In [None]:
%%HTML
<!-- Make fonts readable at 1024x768 -->
<style>
.rendered_html { font-size:0.7em; }
</style>

In [None]:
# Imports and matplotlib configuration
import numpy as np
import scipy.signal
%matplotlib notebook
import matplotlib.pylab as plt
from matplotlib import animation
from ipywidgets import interact, Button, Output, Box
from IPython.display import display
from style import *

# Others
from scipy.special import erf
gaussian_pdf = lambda x, mu=0, s=1: np.exp(-0.5*(x-mu)**2/s**2)/(s*np.sqrt(2*np.pi))
gaussian_cdf = lambda x, mu=0, s=1: 0.5 + 0.5*erf((x-mu)/(s*np.sqrt(2)))

## Universidad Austral de Chile

# INFO337 - Herramientas estadísticas para la investigación

# Statistical inference 

### A course of the masters in informatics program

### https://github.com/magister-informatica-uach/INFO337

***
<a id="index"></a>

# Contents
***

1. Statistical inference
1. [Maximum Likelihood Estimation](#MLE)
1. [Non-parametric modeling](#nonparametric)
1. [Bayesian approach on parametric modeling](#bayesian)
1. [Maximum a posteriori](#MAP)
1. [Appendix](#appendix)


### References

1. Hastie, Tibshirani and Friedman, "The elements of statistical learning" 2nd Ed., *Springer*, **Chapter 8**
1. Murphy, "Machine Learning: A Probabilistic Perspective", *MIT Press*, 2012, **Chapter 5**
1. Ivezic, Connolly, VanderPlas and Gray, "Statistic, Data Mining, and Machine Learning in Astronomy", *Princeton University Press*, 2014, **Chapters 4 and 5**

***

***
# Statistical inference
***

**Inference:** 

<center>*Draw conclusions from facts through a scientific premise*</center>

**Statistical inference**:
- Facts: Observed data
- Premise: Probabilistic model
- Conclusion: An unobserved quantity of interest
- Objective: Quantify the uncertainty of the conclusion given the data and the premise


Examples of statistical inference tasks:
- **Parameter estimation:** What is the best estimate of a model parameter based on the observed data?
- **Confidence estimation:** How trustworthy is our point estimate?
- **Hypothesis testing:** Is the data consistent with a given hypothesis or model?

***

***
## Parametric and nonparametric models

To conduct inference we start by defining a statistical model. Models can be broadly classified as:

- **Parametric models:** 
    - It corresponds to an analytical function  (distribution) with free parameters
    - Has an *a-priori* fixed number of parameters
    - In general: Stronger assumptions, easier to interpret, faster to use
    
    
- **Non-parametric models:** 
    - Distribution-free model but they do have parameters and assumptions (e.g. dependence)
    - The number of parameters depends on the amount of training data
    - In general: More flexible, harder to train
    
**Statistical modeling how to's**
- How to collect the data?
- How to construct a probabilistic model?
- How to incorporate expert (*a priori*) knowledge?
- How to interpret results? How to make predictions from data?
***

***
## Frequentist and Bayesian inference

There are two paradigms or perspectives for statistical inference: Frequentist (F) or classical and Bayesian (B). 

There are conceptual differences between these paradigms, for example

**Definition of probability:**
- F: Relative frequency of an event. An objective property of the real world
- B: Degree of subjective belief. Probability statements can be made not only on data but also on parameters and models themselves

**Interpretation of parameters:**
- F: They are unknown and fixed constants
- B: They have distributions that quantify the uncertainty of our knowledge about them. We can compute expected values of the parameters

***

***
## Frequentist approach on parametric modeling 

**Parametric inference:** We assume that observations follow a distribution, *i.e.* observations are a realization of a random process (sampling) 

The conceptual (iterative) steps of parametric inference are:
1. **Model fitting:** Find parameters by fitting data to the current model
1. **Model proposition:** Propose a new model that accommodates important features of the data better than the previous one

In the frequentist approach Step 1 is typically solved using **Maximum Likelihood Estimation (MLE)**, Method of Moments (MoM) or the M-estimator. 
***

***
# The likelihood function
***
- The likelihood is a quantitative description of our experiment (measuring process)
- The likelihood is the starting point for **parametric modeling** in both F and B paradigms
- The likelihood tells us how good our model is with respect to the **observed data**


### Formally speaking

- We have an experiment that we model as a set of R.Vs $X_1, X_2, \ldots, X_N$
- We have observations/realizations from our R.Vs $\{x_i\} = x_1, x_2, \ldots, x_N$
- We assume that the R.Vs follow a particular probability distribution $x_i \sim f(x_i, \theta)$
- The distribution has (unknown) parameters $\theta$
- The likelihood is a function of the parameters which is defined from the joint distribution

$$
\begin{align}
L(\theta) &= P(X_1=x_1, X_2=x_2, \ldots, X_N=x_n) \nonumber \\
&= f(x_1, x_2, \ldots, x_N | \theta) \nonumber
\end{align}
$$
- Assuming that our observations are **independent and identically distributed** (iid)
$$
\begin{align}
L(\theta) &= f(x_1| \theta) \cdot f(x_2| \theta) \cdot \ldots \cdot f(x_N| \theta) \nonumber \\
&= \prod_{i=1}^N f(x_i| \theta) \nonumber
\end{align}
$$
<br>
<center>"Given $\{x_i\}$, How likely is it that it was generated by $L(\theta)=\prod_{i=1}^N f(x_i| \theta)$?"</center>
<center>"Given $\{x_i\}$, How likely is it that the unknown parameter was $\theta$?"</center>
 

***

### Note: Likelihood is not probability

- The likelihood of a single value is given by the true pdf 
- The likelihood of a set is not normalized to 1, *i.e.* in general the likelihood is not a valid pdf
- The likelihood by itself cannot be interpreted as a probability of $\theta$
- Given a fixed data set the likelihood is defined as a function of $\theta$

***

***

**Example: Likelihood of a single Gaussian dist. sample** 

If our observation (data point) $x$ was drawn from $\mathcal{N}(\mu, \sigma^2)$, *i.e.* $f(x|\theta) =  \mathcal{N}(\mu, \sigma^2)$ then the likelihood of $x$ is

$$
L(\theta=\{\mu, \sigma^2\}) = f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left ( -\frac{(x-\mu)^2}{2\sigma^2}\right)
$$


In [None]:
plt.close('all'); fig, ax = plt.subplots(figsize=(7, 3))
x = np.linspace(-10, 10, num=10000)
def update(xi, mu, sigma):
    p = gaussian_pdf(x, mu, sigma)
    ax.cla();
    ax.plot(x, p); ax.fill_between(x, 0, p, alpha=0.5)
    likelihood = gaussian_pdf(xi, mu, sigma)
    ax.plot([xi, xi], [0, likelihood], 'k--')
    ax.scatter(xi, likelihood, color='k', s=100, zorder=100)
    ax.set_xlim([-5, 5]); ax.set_ylim([0, np.amax(p)*1.1])
    ax.set_title("Likelihood $\mu$ and $\sigma$ given $x$=%0.2f: %0.2e" %(xi, likelihood))

interact(update, 
         xi=FloatSlider_nice(description=r"$x$", min=-5, max=5, value=0.), 
         mu=FloatSlider_nice(description=r"$\mu$", min=-3, max=3, value=0.), 
         sigma=FloatSlider_nice(description=r"$\sigma$", min=0.1, max=2., value=1.));

***

**Example: Likelihood of N Gaussian dist. samples ***

- Let's say we have N random numbers and assume they are Gaussian *iid*
- We can compute their likelihood using the formula above for a given set of parameters:

$$
L(\theta=\{\mu, \sigma^2\}) = f(\{x\} | \mu, \sigma^2) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left ( -\frac{(x_i-\mu)^2}{2\sigma^2}\right)
$$

In [None]:
from mpl_toolkits.axes_grid1.axes_divider import make_axes_locatable
mu_hat = np.linspace(-2.2, 2.2, num=200); s_hat = np.linspace(0.2, 2.2, num=200)
X, Y = np.meshgrid(mu_hat, s_hat)
plt.close('all'); fig, ax = plt.subplots(figsize=(7, 4));
cax = make_axes_locatable(ax).append_axes("right", size="5%", pad="2%")
def update(N, mu, sigma, seed):
    ax.cla(); logL = np.zeros(shape=X.shape);
    ax.set_xlabel(r"$\mu$"); ax.set_ylabel(r"$\sigma$")
    np.random.seed(seed); xhat = mu + sigma*np.random.randn(N)
    for i, mu_hat_ in enumerate(mu_hat):
        for j, s_hat_ in enumerate(s_hat):
            logL[j, i] = -0.5*len(xhat)*np.log(2.*np.pi*s_hat_**2) - 0.5*np.sum((xhat-mu_hat_)**2)/s_hat_**2
    levels = [k*np.amax(logL) for k in np.logspace(0, 0.5, num=20)]
    ax.scatter(mu, sigma, s=100, c='k', zorder=100)
    CS = ax.contour(X, Y, (logL), levels=levels[::-1], cmap=plt.cm.Blues, linewidths=3); 
    fig.colorbar(CS, cax=cax)
    
interact(update, 
         N=SelectionSlider_nice(options=[10, 100, 1000]),
         mu=FloatSlider_nice(description=r"$\mu$", min=-2, max=2, value=0.), 
         sigma=FloatSlider_nice(description=r"$\sigma$", min=0.5, max=2., value=1.),
         seed=IntSlider_nice(min=0, max=100));

- The value of the likelihood itself does not hold much meaning
- But it can be used to make comparisons between different parameter vectors/models
- **The larger the likelihood the better the model**

***

[&larr; Go back to the index](#index)

***

<a id="MLE"></a>

# Maximum Likelihood Estimation (MLE)

***
In parametric modeling we are interested in finding $\theta$ that best fit our observations. 

One method to do this is **MLE**:

1. Select a distribution/model for the observations and formulate the likelihood $L(\theta)$
1. Search for the $\theta$ that maximize $L(\theta)$ given the data
$$
\hat \theta = \text{arg} \max_\theta L(\theta),
$$
where the point estimate $\hat \theta$  is called the **maximum likelihood estimator** of $\theta$
1. Determine the confidence region of $\hat \theta$ either analytically or numerically (bootstrap, cross-validation)
1. Make conclusions about your model (hypothesis test)


**Important**: A wrong assumption in step 1 can ruin your inference. How to select a model?

***

***

### Example: MLE  for the mean of a Gaussian distribution

Let us:
- consider a set of N measurements $\{x_i\}_{i=1,\ldots, N}$ which corresponds to my weight :)
- assume that the instrument used to measure weight has an error that follows a Gaussian distribution with variance $\sigma^2$

**System interpretation:** The measurements can be viewed as noisy realizations of the true weight $\mu$

$$
x_i = \mu + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2),
$$

hence 

$$
f(x_i) = \mathcal{N}(x_i |\mu,\sigma^2) \quad \forall i
$$

The likelihood of the the true weight $\mu$ given the measurements and the variance $\sigma^2$ is 
$$
L(\mu) = f(\{x_i\}| \mu, \sigma^2) = \prod_{i=1}^N f(x_i| \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \prod_{i=1}^N  \exp  \left( -\frac{(x_i-\mu)^2}{2\sigma^2} \right)
$$

**Objective:** Find the value of $\mu$ that maximize the likelihood given $\{x_i\}$

***

**Trick of the trade: The log likelihood** 
- In many cases (exponential family) it is more practical to find the maximum of the logarithm of the likelihood
- Logarithm is a monotonic function and its maximum is the same as its argument.
***

In this case the log likelihood is
$$
\begin{align}
\log L (\mu) &= \log \prod_{i=1}^N f(x_i|\mu, \sigma^2) \nonumber \\
&= \sum_{i=1}^N \log f(x_i|\mu, \sigma^2) \nonumber \\
&= - \frac{1}{2} \sum_{i=1}^N \log 2\pi\sigma^2 - \frac{1}{2} \sum_{i=1}^N  \frac{(x_i-\mu)^2}{\sigma^2}  \nonumber  \\
&=  - \frac{N}{2} \log 2\pi\sigma^2 - \frac{1}{2\sigma^{2}}   \sum_{i=1}^N (x_i-\mu)^2 \nonumber 
\end{align}
$$
We maximize by making the derivative of the log likelihood equal to zero

$$
\frac{d  \log L (\mu)}{d\mu} =  \frac{1}{\sigma^{2}}  \sum_{i=1}^N (x_i-\mu) =0
$$

Finally the MLE of $\mu$ is 
$$
\hat \mu = \frac{1}{N} \sum_{i=1}^N x_i, \quad \sigma >0
$$

***

### Example: MLE for the variance of a Gaussian dist.

The MLE estimator of the variance can be obtained using the same procedure:

$$
\log L (\mu, \sigma^2) =  - \frac{N}{2} \log 2\pi\sigma^2 - \frac{1}{2\sigma^{2}}   \sum_{i=1}^N (x_i-\mu)^2 
$$

$$
\frac{d  \log L (\mu, \sigma^2)}{d\sigma^2} =  - \frac{N}{2} \frac{1}{\sigma^2} + \frac{1}{2\sigma^{4}}\sum_{i=1}^N (x_i-\mu)^2 =0
$$

$$
\hat \sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i- \hat\mu)^2
$$

- If the true mean is not known then this is a biased estimator of the true variance
- MLE can produce biased estimators

In [None]:
plt.close('all'); fig, ax = plt.subplots(figsize=(8, 4)); ax2 = ax.twinx()
np.random.seed(0)
x = 80 + np.random.randn(10000)
#x = 80 + 2*np.random.rand(1000)  # What happens if the data is not normal
k_list = [int(x) for x in np.logspace(0, 4, num=50)]
hat_mu = np.array([np.sum(x[:k])/k for k in k_list])
hat_var = np.array([np.sum((x[:k]-hat_mu[i])**2)/(k) for i, k in enumerate(k_list)])
ax.plot(k_list, hat_mu); ax2.plot(k_list, hat_var, linestyle='--'); ax.set_xscale('log')
ax.set_xlabel('Number of samples'); 
ax.set_ylabel('$\hat \mu$ (solid line)'); ax2.set_ylabel('$\hat \sigma^2$ (dashed line)');

***

### Extra: Biased and unbiased estimator

For a parameter $\theta$ and an estimator $\hat \theta$, if
$$
\mathbb{E}[\hat \theta] = \theta,
$$
then $\hat \theta$ is an unbiased estimator of $\theta$

***

### Example: MLE of the mean of a Gaussian dist.

Is the MLE of the mean of $x\sim N(\mu, \sigma^2)$ unbiased?
$$
\begin{align}
\mathbb{E}[\hat \mu] &= \mathbb{E} \left[ \frac{1}{N} \sum_{i=1}^N x_i \right]  \nonumber \\
&= \frac{1}{N} \sum_{i=1}^N \mathbb{E}[x_i] = \frac{1}{N} \sum_{i=1}^N \mu = \mu  \nonumber
\end{align}
$$
YES

***
### Example: MLE of the variance of a Gaussian dist.

First lets expand the expression of the MLE of the variance
$$
\begin{align}
\hat \sigma^2 &= \frac{1}{N} \sum_{i=1}^N \left(x_i- \frac{1}{N}\sum_{j=1}^N x_j \right)^2 \nonumber \\
&= \frac{1}{N} \sum_{i=1}^N x_i^2 - \frac{1}{N^2} \sum_{i=1}^N \sum_{j=1}^N x_i  x_j \nonumber \\
&= \frac{1}{N} \sum_{i=1}^N x_i^2 - \frac{1}{N^2} \sum_{i=1}^N \sum_{j\neq i} x_i x_j - \frac{1}{N^2} \sum_{i=1}^N x_i^2 \nonumber  \\
&= \frac{N-1}{N^2} \sum_{i=1}^N x_i^2 - \frac{1}{N^2} \sum_{i=1}^N \sum_{j \neq i} x_i x_j  \nonumber
\end{align}
$$
Then applying the expected value operator we get
$$
\begin{align}
\mathbb{E}[\hat \sigma^2] &= \frac{N-1}{N^2} \sum_{i=1}^N \mathbb{E} [x_i^2] - \frac{1}{N^2} \sum_{i=1}^N \sum_{j \neq i} \mathbb{E} [x_i] \mathbb{E} [x_j] \nonumber  \\
&= \frac{N-1}{N} (\sigma^2 + \mu^2) - \frac{N-1}{N} \mu^2 \nonumber \\
&= \frac{N-1}{N} \sigma^2 \nonumber 
\end{align}
$$
The MLE estimator is biased! 


**Unbiased estimator:** If we multiply it by a constant we obtain the well known unbiased estimator of the variance
$$
\hat \sigma_{u}^2 = \frac{N}{N-1} \hat \sigma^2 = \frac{1}{N-1} \sum_{i=1}^N (x_i- \hat\mu)^2
$$

***
### Extra: Connection with MSE cost and least squares

The log likelihood when we assume a Gaussian distribution is
$$
\log L (\mu, \sigma^2) =  - \frac{N}{2} \log 2\pi \sigma^2 - \frac{1}{2\sigma^2}   \sum_{i=1}^N (x_i-\mu)^2,
$$
if we assume that the variance is known and fixed then the MLE solution
$$
\begin{align}
\hat \mu &= \text{arg}\max_{\mu} \log L(\mu, \sigma^2) \nonumber \\
&= \text{arg}\max_{\mu} \text{Const} -   \text{Const} \sum_{i=1}^N (x_i-\mu)^2 \nonumber
\end{align}
$$
is equivalent to minimizing the argument of 
$$
\text{MSE} = \sum_{i=1}^N (x_i-\mu)^2 = \| x - \mu \|^2,
$$
the well-known Mean Square Error criterion (MSE). In summary:


MLE for a Gaussian dist. with known variance 
$\equiv$
**Least Squares:** Minimizing the MSE

***
### Exercise: MLE for a Bernoulli distribution

A magician friend of yours has bought a "magic coin". He asks you to obtain for him the probability of obtaining a head with such coin.

The coin has two outputs (head/tail) so we can assume that it follows a Bernoulli distribution

$$
f(x|p) = p^x (1-p)^{1-x}, ~~ x \in \{0, 1\}
$$

Your friend tosses the coin N times and records the outputs $\{x_i\}$

- **Objective 1:** Find and analytic expression for $\hat p$
- **Objective 2:** Use your expression and find $\hat p$ for the following "coin toss vector"    
        coin_toss = np.random.binomial(n=1, p=0.75, size=N)
    How large needs $N$ to be so that $|\hat p - p|/p < 0.05$ (try 10 different seeds)


In this case the log likelihood is 
$$
\begin{align}
\log L(p) &= \log \prod_{i=1}^N p^{x_i} (1-p)^{1- x_i} \nonumber \\
&= \sum_i x_i \log (p)  + (1-x_i) \log(1-p) \nonumber \\
&= \log (p) \sum_i x_i +  \log(1-p) \left(N - \sum_i x_i\right) \nonumber
\end{align}
$$

***
And the MLE of the mean is
$$
\begin{align}
\frac{d\log L(p)}{dp} &=  \frac{ \sum_i x_i}{p} -  \frac{N - \sum_i x_i}{1-p} =0 \nonumber \\
&\implies \hat p = \frac{1}{N} \sum_{i=1}^N x_i \nonumber \\
\end{align}
$$
***
which is equivalent to the MLE of the mean of a Gaussian dist.

In [None]:
plt.close('all'); fig, ax = plt.subplots(figsize=(7, 4))
x = np.random.binomial(n=1, p=0.75, size=100)
p = np.linspace(1e-2, 1-1e-2, num=100)
logL = np.log(p)*np.sum(x) + np.log(1-p)*(len(x)-np.sum(x))
ax.plot(p, logL); ax.scatter(p[np.argmax(logL)], np.amax(logL), s=100, c='k', zorder=100)
ax.plot([p[np.argmax(logL)], p[np.argmax(logL)]], [np.amin(logL), np.amax(logL)], linestyle='--')
ax.set_xlabel('p'); ax.set_ylabel('log L(p)');
display("Best p: %0.4f" %(p[np.argmax(logL)]))

***
### Extra: Connection with cross entropy cost

The log likelihood assuming a Bernoulli distribution

$$
\log L(p) = \sum_i x_i \log (p)  + (1-x_i) \log(1-p) 
$$

is the negative of the binary cross-entropy, *i.e.* maximizing the log likelihood is equivalent to minimize the cross-entropy cost.
***

***

### Example: MLE of a Gaussian mixture

Let's imagine that our *iid* data come from a mixture of Gaussians with K components

$$
f(x_i|\pi,\mu,\sigma^2) = \sum_{k=1}^K \pi_k \mathcal{N}(x|\mu_k, \sigma_k^2),
$$
where $\sum_{k=1}^K \pi_k = 1$ and $\pi_k \in [0, 1] ~~ \forall k$

We can write the log likelihood as

$$
\log L(\pi,\mu,\sigma^2) = \sum_{i=1}^N \log \sum_{k=1}^K \pi_k \mathcal{N}(x|\mu_k, \sigma_k^2)
$$

- Oh my! We cannot obtain analytical expressions for the parameters as before
- We have to resort to iterative methods/optimizers, Can you name any?
- **Expectation Maximization** (we will see this in a future class)

In [None]:
plt.close('all'); fig, ax = plt.subplots(figsize=(7, 4))
x = np.random.randn(100, 2); y = 5 + 2*np.random.randn(100, 2)
ax.scatter(x[:, 0], x[:, 1]); ax.scatter(y[:, 0], y[:, 1]);

***

## Optimality properties and uncertainty of MLEs 

Assuming that the data truly comes from the specified model the MLE is
- **Consistent:** The estimate converge to the true parameter as data points increase
$$
\lim_{N\to \infty} \hat \theta = \theta
$$
- **Asymptotically normal:** The distribution of the estimate approaches a normal centered at the true parameter. 
$$
\lim_{N\to \infty} p(\hat \theta) = \mathcal{N}(\hat \theta | \theta, \sigma_\theta^2)
$$
- **Minimum variance:** The estimate achieve the theoretical minimal variance given by the Cramer-Rao bound

***

### Cramer-Rao lower bound:
Inverse of the expected Fisher information, *i.e* the second derivative of $- \log L$ with respect to $\theta$
$$
\sigma_{nm}^2 =  \left (- \frac{d^2 \log L (\theta)}{d\theta_n \theta_m} \bigg\rvert_{\theta = \hat\theta}\right)^{-1}
$$
Note that
- $\sigma_{nm}^2$ is the minimum variance achieved by an unbiased estimator.
- $\sigma_{nn}^2$ give the marginal error bars 
- If $\sigma_{nm} \neq 0 ~~ n\neq m$, then errors are correlated, *i.e* some combinations of parameters might be better determined than others
***

### Example

Considering a Gaussian likelihood
$$
\log L (\mu, \sigma^2) =  - \frac{N}{2} \log 2\pi\sigma^2 - \frac{1}{2\sigma^{2}}   \sum_{i=1}^N (x_i-\mu)^2 
$$

What is the uncertainty of the MLE 
$$
\hat \mu = \frac{1}{N} \sum_{i=1}^N x_i
$$

In this case the Cramer-rao bound
$$
\begin{align}
\sigma_{\hat\mu}^2  &= \left (- \frac{d^2 \log L(\mu, \sigma^2)}{d\mu^2} \bigg\rvert_{\mu=\hat\mu}\right)^{-1}  \nonumber \\
&=  \left (- \frac{1}{\sigma^2} \frac{d}{d\mu}  \sum_{i=1}^N (x-\mu) \bigg\rvert_{\mu=\hat\mu}\right)^{-1}  \nonumber \\
&=  \left ( \frac{N}{\sigma^2}  \bigg\rvert_{\mu=\hat\mu}\right)^{-1} = \frac{\sigma^2}{N}  \nonumber
\end{align}
$$
- This is known as the standard error of the mean
- Also, $p(\hat \mu) \to \mathcal{N}(\hat \mu| \mu, \sigma^2/N)$

In [None]:
plt.close('all'); fig, ax = plt.subplots(figsize=(7, 3.5))
ax.set_title('Mean of Gaussian distributed data')
N, mu_real, s_real = int(1e+4), 2.23142, 1.124123
np.random.seed(0)
x = mu_real + s_real*np.random.randn(N)
mu_estimator = np.array([np.mean(x[:i]) for i in range(1, N)])

ax.plot([1, N], [mu_real, mu_real], 'k--', label='Real')
ax.plot(range(1,N), mu_estimator, label='MLE');
ax.fill_between(np.arange(1, N), mu_estimator-s_real/np.sqrt(np.arange(1, N)), 
                mu_estimator+s_real/np.sqrt(np.arange(1, N)), alpha=0.5);
ax.set_xscale('log'); ax.set_label('Number of samples'); plt.legend();

***

## Bootstrap

The uncertainty of a point-estimate can be non-parametrically calculated using **bootstrap resampling**

In bootstrap you generate new datasets that follow the properties of the original one 

<img src="img/bootstrap_diagram.png">

The conceptual steps are:

1. Create a new set by randomly selecting $N$ observations with replacement
1. Compute the value of your estimator on the new dataset
1. Go back to one until have $T$ values
1. Now you have an empirical distribution for the estimator. Use it to get a confidence interval

**Note:** There are many types of bootstrap tests with different properties and assumptions (more on this in a future class)

In [None]:
plt.close('all'); fig, ax = plt.subplots(figsize=(7, 4))
def update(N, T):
    ax.cla(); ax.set_xlabel('x')
    np.random.seed(0)
    x = np.random.randn(N) # zero mean, unit variance
    mle_mu = np.mean(x)    
    mle_mu_bs = [np.mean(np.random.choice(x, size=len(x), replace=True)) for k in range(T)]
    hist_val, hist_lim, _ = ax.hist(mle_mu_bs, density=True, alpha=0.6)
    t = np.linspace(hist_lim[0], hist_lim[-1], num=200)
    ax.plot(t, gaussian_pdf(t, mu=mle_mu, s=1/np.sqrt(len(x))), 'k-', linewidth=4)  
    ax.scatter(np.mean(x), 0, c='k', s=100, zorder=100)
    display("Empirical confidence interval at 0.95 = [%0.4f, %0.4f]" %(np.sort(mle_mu_bs)[int(0.05*T)], 
                                                                       np.sort(mle_mu_bs)[int(0.95*T)]))    
interact(update, N=SelectionSlider_nice(options=[10, 100, 1000, 10000], value=100),
         T=SelectionSlider_nice(options=[10, 100, 1000, 10000], value=100));

***

## Goodness of fit

- We can compare estimated parameters through the likelihood
- The MLE estimators give the maximum value of the likelihood
- But how good is it? "Best" might still be poor...

In the Gaussian likelihood with fixed variance 
$$
\begin{align}
\log L (\mu) &=  - \frac{N}{2} \log 2\pi\sigma^2 - \frac{1}{2}   \sum_{i=1}^N \frac{(x_i-\mu)^2}{\sigma^2}  \nonumber \\
&=  \text{Const} - \frac{1}{2}   \sum_{i=1}^N z_i^2,  \nonumber \\
\end{align}
$$
where $z_i = (x_i-\mu)/\sigma$.

Meet the $\chi^2$ distribution with $k$ degrees of freedom (dof), the distribution of a **sum of k independent and standard normal distributed RVs**

$\chi^2$ cheat sheet:
- It has support on $\mathbb{R}^+$
- Degrees of freedom $k \in \mathbb{N}$
- $f(x) = \frac{1}{2^{k/2} \Gamma(k/2)}  x^{k/2-1} e^{-x/2}$
- Mean: $k$
- Variance: $2k$

Note that the distribution does not depend on $\mu$ and $\sigma^2$

In [None]:
from scipy.stats import chi2

fig, ax = plt.subplots(figsize=(7, 4))
line = ax.plot([], [], linewidth=4)
ax.set_title('Mean: dof, Variance: 2dof')
def update(dof):
    x = np.linspace(chi2.ppf(0.01, dof), chi2.ppf(0.99, dof), 100)
    px = chi2.pdf(x, dof)
    line[0].set_xdata(x); ax.set_xlim([np.amin(x)*0.9, np.amax(x)*1.1])
    line[0].set_ydata(px); ax.set_ylim([np.amin(px)*0.9, np.amax(px)*1.1])    
    
interact(update, dof=SelectionSlider_nice(options=[1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]));

In this  case

$$
\log L (\mu) \propto  - \frac{1}{2} \chi_N^2
$$

So if:

$$
\chi_N^2  \ll N  + \sqrt{2N}
$$

Then it is very likely that the data was generated by a model $\mu=\hat \mu$  

***

***

## Model comparison

- What if the data was not Gaussian distributed in the first place? 
- How to compare models with different number of parameters? 
- In general the more number of parameters the better the fit (overfitting)
- How to score models taking into account their complexity?

Two options:
1. Cross-validation and bias/variance trade-off (based on finite data)
1. Akaike information criterion (AIC) (based on asymptotic approximation)

For a model with $k$ parameters and N data points the AIC is 

$$
\text{AIC} = -2 \log L(\hat \theta) + 2k + \frac{2k(k+1)}{N-k-1},
$$

which you seek to minimize

***

**Parsimony principle** (aka Occam's Razor): Choose the simplest scientific explanation that fits the evidence. 

Also related: KISS principle

***

[&larr; Go back to the index](#index)

***

<a id="nonparametric"></a>

# Nonparametric statistical modeling

***

**Recap:** Statistical models that do not assume an underlying distribution

Most famous example: **The histogram**

- The histogram is a numerical representation of a distribution 
- The histogram allow us to visualize our data and explore its statistical features
- The histogram is built by dividing the data range in **bins** and counting the observations that fall on a given bin
- The parameters of the histogram are the size and location of the bins

***

The importance of setting the number of bins right:

In [None]:
plt.close('all'); fig, ax = plt.subplots(figsize=(7, 4))
x = np.linspace(-11, 10, num=1000)
px = 0.7*gaussian_pdf(x, mu=-4, s=2) + 0.3*gaussian_pdf(x, mu=3, s=2)
N = 1000; np.random.seed(0)
hatx = np.concatenate((-4 + 2*np.random.randn(int(0.7*N)), 
                       (3 + 2*np.random.randn(int(0.3*N)))))

def update(nbins): 
    ax.cla()
    ax.plot(x, px, 'k-', linewidth=4, alpha=0.8)
    hist, bin_edges = np.histogram(hatx, bins=nbins, density=True)
    ax.bar(bin_edges[:-1], hist, width=bin_edges[1:] - bin_edges[:-1], 
           edgecolor='k', align='edge', alpha=0.8)
    
interact(update, nbins=SelectionSlider_nice(options=[1, 2, 5, 10, 20, 50, 100], value=5));

- A small number of bins omits the features of the distribution
- A large number of bins introduce noise

***

***
## Histogram in practice

How to select the width/number of bins?

- Cross validation
    - AMISE: (Asymptotic) Mean integrated square error
- Rules of thumb, *e.g.* Scott's rule and Silverman's rule
    - Proportional to the scale of data
    - Inversely proportional to the number of samples 
    - Obtained through assumptions

***

### Silverman's rule

The width of the bins is 

$$
h = 0.9 \frac{\min[\sigma, 0.7412 (q_{75} - q_{25})]}{N^{1/5}},
$$

where $N$ is the number of observations, $\sigma$ is the standard deviation and $q_{75}-q_{25}$ is the interquartile range. 


**Silverman's assumption**: The unknown density is Gaussian


Assuming uniformly spaced bins then the number of bins is

$$
N_{bins} = \frac{\max(x)-\min(x)}{h}
$$

***
### Other considerations

- Bins could have different boundaries (offsets)
- Bins could have different widths
- Multiresolution approach (wavelet style)

***

## Kernel density estimation (KDE)

- Other option for non-parametric density estimation is KDE
- In KDE each point has its "own bin", and bins can overlap
- KDE does not require choosing bin boundaries, only bin width

The unidimensional KDE for a set $\{x_i\}_{i=1,\ldots, N}$ is

$$
\hat f_h(x) = \frac{1}{Nh} \sum_{i=1}^N \kappa \left ( \frac{x - x_i}{h} \right)
$$

where $h$ is called the **kernel bandwidth** or kernel size and $\kappa(u)$ is the **kernel function** that need to be positive, zero mean and integrate to unity.

For example, one broadly used kernel is 

$$
\kappa(u) = \frac{1}{\sqrt{2\pi}} \exp \left ( - \frac{u^2}{2} \right),
$$

the Gaussian kernel. 

**Other widely used kernels:** Exponential, Top-hat, Epanechnikov

***

<center><b>KDE in a nutshell</b>: Place a kernel on top of each point and get the average</center>

***

**Avoid confusion:** 
- Assuming that the data is **Gaussian distributed** and doing KDE with the **Gaussian kernel** are very **different things**! 
- Using the Gaussian kernel for non-Gaussian data is perfectly fine.



In [None]:
from sklearn.neighbors.kde import KernelDensity
plt.close('all'); fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(x, px, 'k-', linewidth=4, alpha=0.8)
line_kde = ax.plot(x, np.zeros_like(x))
hs = 0.9*np.std(hatx)*N**(-1/5)
def update(k, kernel): 
    kde = KernelDensity(kernel=kernel, bandwidth=hs*k).fit(hatx.reshape(-1, 1))
    line_kde[0].set_ydata(np.exp(kde.score_samples(x.reshape(-1, 1))))
    
interact(update, k=SelectionSlider_nice(description="$k =h/h_s$", options=[1/8, 1/4, 1/2, 1, 2, 4], value=1),
        kernel=SelectionSlider_nice(options=["gaussian", "exponential", "epanechnikov", "tophat"]));

## Other non-parametric methods

- Splines, kernel regression
- Support Vector Machine and Gaussian Processes
- Nearest neighbors
- Neural nets?

[&larr; Go back to the index](#index)

***
<a id="bayesian"></a>

# Bayesian approach to parametric modeling

***

**Recap:** The Bayesian premise
- Inference is made by producing probability density functions (pdf): **posterior**
- Model the uncertainty of the data, experiment, parameters, etc. as a **joint pdf**
- $\theta$ is a R.V., *i.e.* it follows a distribution: **prior**

The Bayes theorem and the law of total probability tell us

$$
p(\theta| \{x\}) = \frac{p(\{x\}, \theta)}{p(\{x\})}= \frac{p(\{x\}|\theta) p(\theta)}{\int p(\{x\}|\theta) p(\theta) d\theta} \propto p(\{x\}|\theta) p(\theta),
$$



- In Bayesian model fitting we seek the **posterior** (parameters given the data) 
- The posterior is build from the **likelihood**, **prior** and **evidence** (marginal data likelihood)
- The posterior can be small if either the likelihood or the prior are small



***
### Why/When should I use the Bayesian formalism?

- In many cases the Bayesian inference will not differ much from MLE
- In general the Bayesian inference is harder to compute and requires more sophisticated methods

Then? 
- We can integrate unknown/missing/uninteresting (nuisance) parameters
- Principled way of injecting prior knowledge (regularization)
- Built-in error bars


***

### The Bayesian inference procedure

1. Formulate data likelihood
1. Choose a prior
1. Build a joint distribution (relation of all parameters)
1. Determine the posterior using Bayes Theorem
1. Find MAP and credible regions
1. Do hypothesis test
1. **Criticize:** Evaluate how appropriate the model is and suggest improvements

***

***

## Priors

Priors summarize what we know about the parameters before-hand, for example
- a parameter is bounded/unbounded (Normal/Cauchy)
- a parameter is positive (Half-normal, Half-Cauchy, Lognormal, Inverse Gamma)
- a parameter is positive-semidefinite (Inverse Wishart, LKJ)
- a parameter follows a simplex (Dirichlet)

Priors can be 
- Informative, *e.g.* my parameter is $\mathcal{N}(\theta|\mu=5.4, \sigma^2=0.1)$
- Weakly informative, *e.g.* my parameter is $\mathcal{N}(\theta|\mu=0, \sigma^2=100.)$
- Uninformative (objective), *e.g.* my parameter is positive

Priors should 
- add positive probabilistic weights on possible values
- no weight to impossible values
- help regularize the solution

Other guidelines to select priors:
- **Conjugate priors:** Given a likelihood the posterior has the same distribution as the prior
- Maximum entropy principle

Stan prior choice recommendations:

https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations


[&larr; Go back to the index](#index)

***
<a id="MAP"></a>

# Maximum *a posteriori* (MAP) estimation

***

In the Bayesian setting the best "point estimate" of the parameters of the model is given by the MAP

$$
\hat \theta = \text{arg} \max_\theta p(\theta|\{x\}) =  \text{arg} \max_\theta p(\{x\}| \theta) p(\theta),
$$

where we "omit" the evidence because it does not depend on $\theta$

Applying the logarithm (monotonic) we can decouple the likelihood from the prior

$$
\hat \theta = \text{arg} \max_\theta \log p(\{x\}| \theta) + \log p(\theta),
$$


- MAP estimation is also referred as penalized MLE
- We saw that the likelihood can be interpreted as the error between model and data (*e.g* MSE, cross-entropy)
- The prior can be interpreted as a regularizer on the parameters (more on this later)

***

### Example: MAP of the mean of a Gaussian dist.

We want to find the MAP for the weight of your professor. 

Assuming that the likelihood is Gaussian with known variance we have

$$
\log p(\{x\}|\theta) = \log L (\mu)  = - \frac{N}{2} \log 2\pi\sigma^2 - \frac{1}{2\sigma^{2}}   \sum_{i=1}^N (x_i-\mu)^2, 
$$

and further assuming that the true weight has a Gaussian prior $\mathcal{N}(\mu|\mu_0, \sigma^2_0)$

$$
\log p(\theta) = -\frac{1}{2} \log 2 \pi \sigma^2_0 - \frac{1}{2 \sigma^2_0}  (\mu - \mu_0)^2,
$$

then we set the derivative to zero

$$
\frac{d}{d\mu} \log p(\{x\}|\theta) + \log p(\theta) =   \frac{1}{\sigma^{2}}   \sum_{i=1}^N (x_i-\mu)  - \frac{1}{ \sigma^2_0}  (\mu - \mu_0) = 0,
$$
and we get the MAP estimate

$$
\hat \mu =  \left(\frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1} \left(\frac{N\bar x}{\sigma^2}  + \frac{\mu_0}{\sigma^2_0} \right),
$$
where $\bar x = \frac{1}{N} \sum_{i=1}^N x_i$.

**IMPORTANT:** Do not confuse $\sigma^2$ (the noise variance) and $\sigma^2_0$ (prior variance)


***

### Particular cases 

- The MAP estimator for a  standard normal prior $\mathcal{N}(\mu| 0, 1)$ is
$$
\hat \mu =  \left(\frac{N}{\sigma^2} + 1 \right)^{-1} \left(\frac{N\bar x}{\sigma^2} \right) = \frac{1}{1 + \sigma^2/N} \bar x,
$$
note that when 
$$
\lim_{N \to \infty} \hat \mu = \bar x,
$$
which is the MLE solution. Can you explain why this happens?


- Similarly, the MAP estimator for a normal prior $\mathcal{N}(\mu| 0, \sigma^2_0)$ with $\sigma^2_0 \to \infty$
$$
\hat \mu =  \left(\frac{N}{\sigma^2} \right)^{-1} \left(\frac{N\bar x}{\sigma^2} \right) =  \bar x,
$$
which is again the MLE solution. The prior is uninformative of $\mu$ so it has no influence on the result

***

***

### General case

Note that
$$
\begin{align}
\hat \mu &=  \left(\frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1} \left(\frac{N\bar x}{\sigma^2}  + \frac{\mu_0}{\sigma^2_0} \right)  \nonumber \\
&=  \frac{N \bar x \sigma^2_0 + \mu_0 \sigma^2}{N\sigma^2_0+ \sigma^2} = \frac{\bar x + \mu_0 \frac{\sigma^2}{\sigma^2_0 N}}{1 + \frac{\sigma^2}{\sigma^2_0 N}}  \nonumber \\
&= w \bar x + (1-w) \mu_0, \qquad w = \frac{1}{1 + \frac{\sigma^2}{\sigma^2_0 N}}  \nonumber
\end{align}
$$
The MAP estimate of the mean is an average between the prior mean $\mu_0$ and the MLE solution.

If either $\sigma^2_0$ or $N$ are large wrt $\sigma^2$, then $w=1$ and the MLE is recovered


**Reflect on the following:** The prior has more influence when
- You have few samples
- Your samples are noisy

***

*** 
### Extra: MAP intepretation as a penalized MLE/regularized LS

The MAP estimate of the mean of a Gaussian dist with known variance using a zero-mean normal prior is
$$
\begin{align}
\hat \mu &= \text{arg} \max_\mu  \log p(\{x\}| \mu, \sigma^2) + \log p(\mu) \nonumber \\
&= \text{arg} \max_\mu   - \frac{N}{2} \log 2\pi\sigma^2 - \frac{1}{2\sigma^{2}}   \sum_{i=1}^N (x_i-\mu)^2 -  \frac{1}{2\sigma_0^2} \mu^2 \nonumber \\
&= \text{arg} \min_\mu \frac{1}{2\sigma^{2}}   \sum_{i=1}^N (x_i-\mu)^2 +  \frac{1}{2\sigma_0^2} \mu^2 \nonumber \\
&= \text{arg} \min_\mu \|x-\mu\|^2  + \lambda \|\mu \|^2, \nonumber
\end{align}
$$
where $\lambda = \frac{\sigma^2}{\sigma_0^2}$. 

We recognize the last equation as a regularized least squares problem
- A Gaussian prior yields a L2 regularizer (ridge regression)
- A Laplacian prior yields a L1 regularizer (LASSO)
***

***

## Full posterior with conjugate prior 

The MAP is only a point estimate. For the mean of a Gaussian dist we can get the full posterior analytically

$$
\begin{align}
p(\theta |\{x\}) &\propto p(\{x\} |\theta ) p(\theta ) \nonumber \\
&\propto \exp \left ( \frac{1}{2\sigma^2} \sum_i (x_i - \mu)^2 \right) \exp \left ( \frac{1}{2\sigma_0^2} (\mu - \mu_0)^2 \right) \nonumber \\
&\propto \exp \left ( -\frac{1}{2 \hat \sigma^2} (\mu - \hat \mu)^2 \right),  \nonumber 
\end{align}
$$
where 
$$
\hat \sigma^2 = \left(\frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1} 
$$

***
**This shows that the Gaussian is conjugate to itself**

***
Other way to show this is to use the [property of Gaussian pdf multiplication](http://www.tina-vision.net/docs/memos/2003-003.pdf)

$$
\mathcal{N}(x|\mu_1, \sigma_1^2) \mathcal{N}(x|\mu_2, \sigma_2^2) = C \mathcal{N}\left(x\bigg\rvert \frac{\sigma_1^2 \sigma_2^2}{\sigma_1^2 + \sigma_2^2}\left( \frac{\mu_1}{\sigma_1^2} + \frac{\mu_2}{\sigma_2^2}\right), \frac{\sigma_1^2 \sigma_2^2}{\sigma_1^2 + \sigma_2^2}\right)
$$

where $C$ is a scaling constant


In [None]:
plt.close('all'); fig, ax = plt.subplots(figsize=(10, 4))
x = np.linspace(-10, 10, num=10000)
def update(mu, sigma2, alpha, beta, N):
    np.random.seed(0); xi = mu + np.sqrt(sigma2)*np.random.randn(N)
    ax.cla(); ax.set_xlim([-5, 5]);
    likelihood = gaussian_pdf(x, np.mean(xi), np.sqrt(sigma2/N))
    ax.scatter(mu, 0, c='k', s=100, zorder=100)
    ax.plot(x, likelihood, label='MLE'); ax.fill_between(x, 0, likelihood, alpha=0.5)
    prior = gaussian_pdf(x, alpha, np.sqrt(beta)) 
    ax.plot(x, prior, label='prior'); ax.fill_between(x, 0, prior, alpha=0.5)
    s2_pos = (N/sigma2 + 1./beta)**-1
    mu_pos = (np.sum(xi)/sigma2 + alpha/beta)*s2_pos;
    posterior = gaussian_pdf(x, mu_pos, np.sqrt(s2_pos));
    ax.plot(x, posterior, label='posterior'); ax.fill_between(x, 0, posterior, alpha=0.5)
    plt.legend()    

interact(update, 
         mu=FloatSlider_nice(description=r"$\mu$", min=-3, max=3, value=2.), 
         sigma2=FloatSlider_nice(description=r"$\sigma^2$", min=0.1, max=10., value=1.),
         alpha=FloatSlider_nice(description=r"$\alpha$", min=-3, max=3, value=0.), 
         beta=FloatSlider_nice(description=r"$\beta$", min=0.1, max=10., value=1.),
         N=SelectionSlider_nice(options=[1, 2, 5, 10, 20, 50, 100], value=10));

***

## Conjugate priors when $\sigma^2$ is unknown

Up to now we assumed that the variance  $\sigma^2$ was known 
- Assuming that the mean $\mu$ is known the conjugate prior for the variance is an inverse-Gamma distribution
$$
\sigma^2 \sim \text{IG}(\alpha, \beta), \qquad p (x|\alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{-\alpha-1} e^{-\frac{\beta}{x}}
$$
- If both the mean and variance are unknown 
    - Multiplying the normal prior and the IG prior does not yield a conjugate prior (assumes independence of $\mu$ and $\sigma$)
    - In this case the conjugate prior is hierarchical
    $$
    \begin{align}
    p(x_i|\mu, \sigma^2) &= \mathcal{N}(\mu, \sigma^2)  \nonumber \\
    p(\mu|\sigma^2) &= \mathcal{N}(\mu_0, \sigma^2/\kappa_0)  \nonumber \\
    p(\sigma^2) &= \text{IG}(\alpha, \beta)  \nonumber
    \end{align}
    $$
    - And it is called normal-inverse-gamma (NIG), a four parameter distribution 
    
    
***
An extremely good document by Kevin Murphy on conjugate priors for the Gaussian dist: https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf

***

***

### Extra: Mean of the posterior

Other point estimate that can be used to characterize the posterior is

$$
\hat \theta = \mathbb{E}[\theta|\{x\}] = \int \theta p(\theta| \{x\}) d\theta,
$$

*i.e.* the mean or expected value of the posterior.

***

***

## Future topics

- Empirical Bayes: Model in which the hyperparameters are estimated from data instead of fixed before-hand
- Hierarchical Bayes: Priors have hyperpriors with hyperhyperparameters
- Markov Chain Monte Carlo (MCMC): Algorithm to sample from a distribution. We will use it to learn complex Bayesian models

***

***

## Sneak peak: Probabilistic programming with [PyMC3](https://docs.pymc.io/)

***

PyMC3 is a library to learn bayesian models using MCMC and VI (variational inference)

In [None]:
import pymc3 as pm
print('Running on PyMC3 v{}'.format(pm.__version__))
np.random.seed(0)
N, mu_t, sigma_t = 100, 10, 2
x = mu_t + sigma_t*np.random.randn(N)
print("MLE: %f %f" %(np.mean(x), np.std(x)/np.sqrt(N)))
with pm.Model() as demo:
    #mu_0 = pm.Normal('mu0', mu=0, sd=10, shape=1)
    sigma = pm.HalfNormal('s', sd=100, shape=1)
    mu = pm.Normal('mu', mu=0, sd=100, shape=1)
    x_observed = pm.Normal('x_obs', mu=mu, sd=sigma, observed=x)
    trace = pm.sample(draws=10000, tune=2000, init='advi', n_init=20000, cores=4, chains=2, 
                      live_plot=True)
pm.summary(trace)

In [None]:
pm.traceplot(trace, figsize=(8, 4));

In [None]:
help(pm.sample)

[&larr; Go back to the index](#index)

*** 
<a id="appendix"></a>

## Appendix: Gaussian distribution
***

- The Gaussian dist. has domain in $\mathbb{R}$, it has two parameters $\mu$ and $\sigma^2$ and is rather easy to interpret

- The Gaussian/Normal probability density function (PDF) is defined as 
$$
f(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left ( -\frac{(x-\mu)^2}{2\sigma^2}\right),
$$
where $\mu$ is the mean and $\sigma^2$ is the variance

- The Gaussian dist. is symmetric around $\mu$ and it has only one mode (unimodal) centered at $\mu$

- The cumulative density function (CDF) of a Gaussian is
$$
F(x) = \int_{-\infty}^x  f(z) dz = \frac{1}{2} \left ( 1 + \text{erf} \left(\frac{x-\mu}{\sigma \sqrt{2}} \right) \right)
$$
where the error function (erf) is 
$$
\text{erf}(x) = \frac{2}{\sqrt{\pi}} \int_{0}^x \exp(-t^2) dt
$$

- Using $\int \exp(-a(x+b)^2) dx = \sqrt{\frac{\pi}{a}}$ we can easily see that
$$
\int f(x|\mu, \sigma^2) dx = \frac{1}{\sqrt{2\pi\sigma^2}} \int \exp \left ( -\frac{(x-\mu)^2}{2\sigma^2}\right) dx = 1
$$

- The first order moment of a Gaussian R.V. is
$$
\begin{align}
\mathbb{E}[X] &= \int x f(x|\mu, \sigma^2) dx  \nonumber \\ 
&= \frac{1}{\sqrt{2\pi\sigma^2}} \int x \exp \left ( -\frac{(x-\mu)^2}{2\sigma^2}\right) dx  \nonumber \\
&= \frac{1}{\sqrt{2\pi\sigma^2}} \int  (z+\mu) \exp \left ( -\frac{z^2}{2\sigma^2}\right) dz  \nonumber \\
&= \mu  \nonumber
\end{align}
$$
and its second order moment is 
$$
\mathbb{E}[X^2] = \int x^2 f(x|\mu, \sigma^2) dx = \mu^2 + \sigma^2
$$

- The variance of a Gaussian can be defined from its first and second moment as

$$
\sigma^2 = \text{Var}[X] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \mathbb{E}[(X - \mathbb{E}[X])^2] 
$$

- In the Gaussian dist. the mode (maximum of the pdf) coincides with the $\mu$

- Tends to $\delta(x-\mu)$ for $\sigma^2 \rightarrow 0$

- **Central Limit Theorem:** The sum of independent RVs is approximately Gaussian distributed

- Maximum entropy distribution when the mean and variance are specified

In [None]:
plt.close('all'); fig, ax = plt.subplots(1, 2, figsize=(8, 4))
dt=1e-4; x = np.arange(-5, 5, step=dt)

def update(xi, xf):
    for axis in ax:
        axis.cla(); 
        axis.set_xlim([-5, 5]);
    ax[0].plot(x, gaussian_pdf(x)); ax[1].plot(x, gaussian_cdf(x));
    xrange = np.arange(xi, xf, step=dt)
    ax[0].fill_between(xrange, 0, gaussian_pdf(xrange), alpha=0.5)
    ax[1].scatter([xi, xf], [gaussian_cdf(xi), gaussian_cdf(xf)], s=100, c='k', zorder=100)
    ax[1].text(xi+0.5, gaussian_cdf(xi), "Init"); ax[1].text(xf+0.5, gaussian_cdf(xf), "End")
    ax[0].set_title("$\int_{x_i}^{x_f} f(x) dx$ = %0.4f" %(np.sum(gaussian_pdf(xrange))*dt))
    area = gaussian_cdf(xf) - gaussian_cdf(xi)
    ax[1].set_title("$F(x_f) - F(x_i)$ = %0.4f" %(area if area >= 0 else 0))

interact(update, 
         xi=FloatSlider_nice(description="$x_i$", min=-5, max=5), 
         xf=FloatSlider_nice(description="$x_f$", min=-5, max=5));

## Appendix: Multivariate Gaussian distribution


- It has domain in $\mathbb{R}^D$ and has parameters $\mu \in \mathbb{R}^D$ and $\Sigma \in \mathbb{R}^{D\times D}$. The covariance $\Sigma$ is defined as 
$$
\Sigma = \begin{pmatrix} 
\Sigma_{11} & \Sigma_{12} & \ldots \Sigma_{1D} \\
\Sigma_{21} & \Sigma_{22}^2 & \ldots \Sigma_{2D} \\
\vdots & \vdots & \ddots \vdots \\
\Sigma_{D1} & \Sigma_{D2} & \ldots \Sigma_{DD} \\
\end{pmatrix},
$$
it is a square symmetric matrix and 
$$
\Sigma_{ij} = \mathbb{E} \left [ (X_i-\mathbb{E}[X_i]) (X_j-\mathbb{E}[X_j])\right] 
$$

- Its probability density function is

$$
f(x| \mu, \Sigma) = \frac{1}{\sqrt{(2\pi)^{D} |\Sigma|}} \exp \left( - \frac{1}{2} (x-\mu)^T \Sigma^{-1} (x - \mu) \right)
$$

- Diagonal covariance. If $\Sigma = I \sigma^2$ with $\sigma^2 \in \mathbb{R}^D$

$$
f(x| \mu, \Sigma) = \frac{1}{\sqrt{(2\pi)^{D} \prod_{d=1}^D \sigma_d^2}} \exp \left( - \frac{1}{2} \sum_{d=1}^D \frac{(x_d-\mu_d)^2}{\sigma_d^2} \right)
$$

- Isotropic or spherical covariance. If $\Sigma = I \sigma^2$ with $\sigma^2 \in \mathbb{R}$

$$
f(x| \mu, \Sigma) = \frac{1}{\sqrt{(2\pi)^{D} \sigma^{2D}}} \exp \left( - \frac{1}{2\sigma^2} \sum_{d=1}^D (x_d-\mu_d)^2 \right)
$$


## Appendix: Exponential family

Distributions with a pdf having the following form

$$
p(x|\theta) = h(x) \exp \left( \eta(\theta) \cdot T(x) - A(\theta) \right),
$$

where $\eta(\theta)$ and $T(x)$ are vectors with length $|\theta|$. $\eta$ is called the natural parameter vector.

In the Gaussian case we can recognize

$$
T(x) = \left[ x , x^2 \right] , \qquad \eta(\mu, \sigma^2) = \left[ \frac{\mu}{\sigma^2} , -\frac{1}{2\sigma^2} \right]
$$

and

$$
h(x) = \frac{1}{\sqrt{2\pi}}, \qquad A(\mu, \sigma^2) =  \frac{\mu^2}{2\sigma^2} + \log(\sigma)
$$

Then
$$
\begin{align}
p(x|\theta) &= \frac{1}{\sqrt{2\pi}} \exp \left( \frac{x\mu}{\sigma^2}  -\frac{x^2}{2\sigma^2}  -\frac{\mu^2}{2\sigma^2} - \log(\sigma) \right)  \nonumber \\
&= \frac{1}{\sqrt{2\pi}\sigma} \exp \left( -\frac{1}{2\sigma^2} (x - \mu)^2 \right)  \nonumber
\end{align}
$$

An important property is that a pdf in the exponential family has a conjugate prior

$$
p(\theta|\alpha, \beta)  \propto \exp \left ( \alpha \eta(\theta) - \beta A(\theta) \right),
$$
where $\alpha \in \mathbb{R}^{|\theta|}$ and $\beta>0$ are the parameters of the prior.