In [None]:
# Imports and matplotlib configuration
import numpy as np
import scipy.signal
%matplotlib notebook
import matplotlib.pylab as plt
from matplotlib import animation
from ipywidgets import interact, Button, Output, Box
from IPython.display import display
from style import *

# Others
from scipy.special import erf
gaussian_pdf = lambda x, mu=0, s=1: np.exp(-0.5*(x-mu)**2/s**2)/(s*np.sqrt(2*np.pi))
gaussian_cdf = lambda x, mu=0, s=1: 0.5 + 0.5*erf((x-mu)/(s*np.sqrt(2)))

## Universidad Austral de Chile

# INFO337 - Herramientas estadísticas para la investigación

# Statistical inference: Modeling

### A course of the masters in informatics program

### https://github.com/magister-informatica-uach/INFO337

## References

1. Hastie, Tibshirani and Friedman, "The elements of statistical learning" 2nd Ed., *Springer*, **Chapter 8**
1. Murphy, "Machine Learning: A Probabilistic Perspective", *MIT Press*, 2012, **Chapter 5**
1. Ivezic, Connolly, VanderPlas and Gray, "Statistic, Data Mining, and Machine Learning in Astronomy", *Princeton University Press*, 2014, **Chapters 4 and 5**



## What is statistical inference?


**Inference:** 

<center>*Draw conclusions from facts through a scientific premise*</center>

**Statistical inference**:
- Facts: Observed data
- Premise: Probabilistic model
- Conclusion: An unobserved quantity of interest
- Objective: Quantify the uncertainty of the conclusion given the data and the premise


Examples of statistical inference tasks:
- **Parameter estimation:** What is the best estimate of a model parameter based on the observed data?
- **Confidence estimation:** How trustworthy is our point estimate?
- **Hypothesis testing:** Is the data consistent with a given hypothesis or model?

### Parametric and nonparametric models

To conduct inference we start by defining a statistical model. Models can be broadly classified as:

- **Parametric models:** 
    - It corresponds to an analytical function  (distribution) with free parameters
    - Has an *a-priori* fixed number of parameters
    - In general: Stronger assumptions, easier to interpret, faster to use
    
    
- **Non-parametric models:** 
    - Distribution-free model but they do have parameters and assumptions (e.g. dependence)
    - The number of parameters depends on the amount of training data
    - In general: More flexible, harder to train
    
**Statistical modeling how to's**
- How to collect the data?
- How to construct a probabilistic model?
- How to incorporate expert (*a priori*) knowledge?
- How to interpret results? How to make predictions from data?
***


### Frequentist and Bayesian inference

There are two paradigms or perspectives for statistical inference: Frequentist (F) or classical and Bayesian (B). 

There are conceptual differences between these paradigms, for example

**Definition of probability:**
- F: Relative frequency of an event. An objective property of the real world
- B: Degree of subjective belief. Probability statements can be made not only on data but also on parameters and models themselves

**Interpretation of parameters:**
- F: They are unknown and fixed constants
- B: They have distributions that quantify the uncertainty of our knowledge about them. We can compute expected values of the parameters


## Frequentist approach on parametric modeling 

**Parametric inference:** We assume that observations follow a distribution, *i.e.* observations are a realization of a random process (sampling) 

The conceptual (iterative) steps of parametric inference are:
1. **Model fitting:** Find parameters by fitting data to the current model
1. **Model proposition:** Propose a new model that accommodates important features of the data better than the previous one

In the frequentist approach Step 1 is typically solved using **Maximum Likelihood Estimation (MLE)**, Method of Moments (MoM) or the M-estimator. 


### The likelihood function is

- a quantitative description of our experiment (measuring process)
- the starting point for **parametric modeling** in both F and B paradigms
- a metric that tells us how good our model is with respect to the **observed data**


### Formally speaking

- We have an experiment that we model as a set of R.Vs $X_1, X_2, \ldots, X_N$
- We have observations/realizations from our R.Vs $\{x_i\} = x_1, x_2, \ldots, x_N$
- We assume that the R.Vs follow a particular probability distribution $x_i \sim f(x_i, \theta)$
- The distribution has (unknown) parameters $\theta$
- The likelihood is a function of the parameters which is defined from the joint distribution

$$
\begin{align}
L(\theta) &= P(X_1=x_1, X_2=x_2, \ldots, X_N=x_n) \nonumber \\
&= f(x_1, x_2, \ldots, x_N | \theta) \nonumber
\end{align}
$$

Assuming that our observations are **independent and identically distributed** (iid)

$$
\begin{align}
L(\theta) &= f(x_1| \theta) \cdot f(x_2| \theta) \cdot \ldots \cdot f(x_N| \theta) \nonumber \\
&= \prod_{i=1}^N f(x_i| \theta) \nonumber
\end{align}
$$

<br>
<center>"Given $\{x_i\}$, How likely is it that it was generated by $L(\theta)=\prod_{i=1}^N f(x_i| \theta)$?"</center>
<center>"Given $\{x_i\}$, How likely is it that the unknown parameter was $\theta$?"</center>
 

***

### Note: Likelihood is not probability

- The likelihood of a single value is given by the true pdf 
- The likelihood of a set is not normalized to 1, *i.e.* in general the likelihood is not a valid pdf
- The likelihood by itself cannot be interpreted as a probability of $\theta$
- Given a fixed data set the likelihood is defined as a function of $\theta$

***

### Example: Likelihood of N Gaussian dist. samples 

- Let's say we have N random numbers and assume they are Gaussian *iid*
- We can compute their likelihood using the formula above for a given set of parameters:

$$
L(\theta=\{\mu, \sigma^2\}) = f(\{x\} | \mu, \sigma^2) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left ( -\frac{(x_i-\mu)^2}{2\sigma^2}\right)
$$


In [None]:
from mpl_toolkits.axes_grid1.axes_divider import make_axes_locatable
mu_hat = np.linspace(-2.2, 2.2, num=200); 
s_hat = np.linspace(0.2, 2.2, num=200)
X, Y = np.meshgrid(mu_hat, s_hat)
fig, ax = plt.subplots(figsize=(7, 4), tight_layout=True);
#cax = make_axes_locatable(ax).append_axes("right", size="5%", pad="2%")

def update(N, mu, sigma, seed):
    ax.cla(); logL = np.zeros(shape=X.shape);
    ax.set_xlabel(r"$\mu$"); ax.set_ylabel(r"$\sigma$")
    np.random.seed(seed); xhat = mu + sigma*np.random.randn(N)
    for i, mu_hat_ in enumerate(mu_hat):
        for j, s_hat_ in enumerate(s_hat):
            logL[j, i] = -0.5*len(xhat)*np.log(2.*np.pi*s_hat_**2) - 0.5*np.sum((xhat-mu_hat_)**2)/s_hat_**2
    levels = [k*np.amax(logL) for k in np.logspace(0, 0.5, num=20)]
    ax.scatter(mu, sigma, s=100, marker='d', c='white', zorder=100)
    CS = ax.contourf(X, Y, (logL), levels=levels[::-1], cmap=plt.cm.Blues); 
    #fig.colorbar(CS, cax=cax)
    
interact(update, 
         N=SelectionSlider_nice(options=[10, 100, 1000], value=100),
         mu=FloatSlider_nice(description=r"$\mu$", min=-2, max=2, value=0.), 
         sigma=FloatSlider_nice(description=r"$\sigma$", min=0.5, max=2., value=1.),
         seed=IntSlider_nice(min=0, max=100));


- The value of the likelihood itself does not hold much meaning
- But it can be used to make comparisons between different parameter vectors/models
- **The larger the likelihood the better the model**


## Maximum Likelihood Estimation (MLE)


In parametric modeling we are interested in finding $\theta$ that best fit our observations. 

One method to do this is **MLE**:

1. Select a distribution/model for the observations and formulate the likelihood $L(\theta)$
1. Search for the $\theta$ that maximize $L(\theta)$ given the data
$$
\hat \theta = \text{arg} \max_\theta L(\theta),
$$
where the point estimate $\hat \theta$  is called the **maximum likelihood estimator** of $\theta$
1. Determine the confidence region of $\hat \theta$ either analytically or numerically (bootstrap, cross-validation)
1. Make conclusions about your model (hypothesis test)


**Important**: A wrong assumption in step 1 can ruin your inference. How to select a model?


###  Example: MLE  for the mean of a Gaussian distribution

Let us:
- consider a set of N measurements $\{x_i\}_{i=1,\ldots, N}$ which corresponds to my weight :)
- assume that the instrument used to measure weight has an error that follows a Gaussian distribution with variance $\sigma^2$

**System interpretation:** The measurements can be viewed as noisy realizations of the true weight $\mu$

$$
x_i = \mu + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2),
$$

hence 

$$
f(x_i) = \mathcal{N}(x_i |\mu,\sigma^2) \quad \forall i
$$

The likelihood of the the true weight $\mu$ given the measurements and the variance $\sigma^2$ is 
$$
L(\mu) = f(\{x_i\}| \mu, \sigma^2) = \prod_{i=1}^N f(x_i| \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \prod_{i=1}^N  \exp  \left( -\frac{(x_i-\mu)^2}{2\sigma^2} \right)
$$

**Objective:** Find the value of $\mu$ that maximize the likelihood given $\{x_i\}$

***

**Trick of the trade: The log likelihood** 
- In many cases (exponential family) it is more practical to find the maximum of the logarithm of the likelihood
- Logarithm is a monotonic function and its maximum is the same as its argument.
***

In this case the log likelihood is

$$
\begin{align}
\log L (\mu) &= \log \prod_{i=1}^N f(x_i|\mu, \sigma^2) \nonumber \\
&= \sum_{i=1}^N \log f(x_i|\mu, \sigma^2) \nonumber \\
&= - \frac{1}{2} \sum_{i=1}^N \log 2\pi\sigma^2 - \frac{1}{2} \sum_{i=1}^N  \frac{(x_i-\mu)^2}{\sigma^2}  \nonumber  \\
&=  - \frac{N}{2} \log 2\pi\sigma^2 - \frac{1}{2\sigma^{2}}   \sum_{i=1}^N (x_i-\mu)^2 \nonumber 
\end{align}
$$

We maximize by making the derivative of the log likelihood equal to zero

$$
\frac{d  \log L (\mu)}{d\mu} =  \frac{1}{\sigma^{2}}  \sum_{i=1}^N (x_i-\mu) =0
$$

Finally the MLE of $\mu$ is 
$$
\hat \mu = \frac{1}{N} \sum_{i=1}^N x_i, \quad \sigma >0
$$

### Example: MLE for the variance of a Gaussian distribution

The MLE estimator of the variance can be obtained using the same procedure:

$$
\log L (\mu, \sigma^2) =  - \frac{N}{2} \log 2\pi\sigma^2 - \frac{1}{2\sigma^{2}}   \sum_{i=1}^N (x_i-\mu)^2 
$$

$$
\frac{d  \log L (\mu, \sigma^2)}{d\sigma^2} =  - \frac{N}{2} \frac{1}{\sigma^2} + \frac{1}{2\sigma^{4}}\sum_{i=1}^N (x_i-\mu)^2 =0
$$

$$
\hat \sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i- \hat\mu)^2
$$

- If the true mean is not known then this is a biased estimator of the true variance
- MLE can produce biased estimators


In [None]:
fig, ax = plt.subplots(figsize=(8, 4), tight_layout=True); 
ax2 = ax.twinx()
np.random.seed(0)
x = 80 + np.random.randn(10000)
#x = 80 + 2*np.random.rand(1000)  # What happens if the data is not normal
k_list = [int(x) for x in np.logspace(0, 4, num=50)]
hat_mu = np.array([np.sum(x[:k])/k for k in k_list])
hat_var = np.array([np.sum((x[:k]-hat_mu[i])**2)/(k) for i, k in enumerate(k_list)])
ax.plot(k_list, hat_mu); ax2.plot(k_list, hat_var, linestyle='--'); ax.set_xscale('log')
ax.set_xlabel('Number of samples'); 
ax.set_ylabel('$\hat \mu$ (solid line)'); ax2.set_ylabel('$\hat \sigma^2$ (dashed line)');

### Biased and unbiased estimator

For a parameter $\theta$ and an estimator $\hat \theta$, if

$$
\mathbb{E}[\hat \theta] = \theta,
$$

then $\hat \theta$ is an unbiased estimator of $\theta$


#### Example: Is the MLE of $\mu$ unbiased?

$$
\begin{align}
\mathbb{E}[\hat \mu] &= \mathbb{E} \left[ \frac{1}{N} \sum_{i=1}^N x_i \right]  \nonumber \\
&= \frac{1}{N} \sum_{i=1}^N \mathbb{E}[x_i] = \frac{1}{N} \sum_{i=1}^N \mu = \mu  \nonumber
\end{align}
$$

YES


#### Example: Is the MLE of $\sigma^2$ unbiased?

First lets expand the expression of the MLE of the variance

$$
\begin{align}
\hat \sigma^2 &= \frac{1}{N} \sum_{i=1}^N \left(x_i- \frac{1}{N}\sum_{j=1}^N x_j \right)^2 \nonumber \\
&= \frac{1}{N} \sum_{i=1}^N x_i^2 - \frac{1}{N^2} \sum_{i=1}^N \sum_{j=1}^N x_i  x_j \nonumber \\
&= \frac{1}{N} \sum_{i=1}^N x_i^2 - \frac{1}{N^2} \sum_{i=1}^N \sum_{j\neq i} x_i x_j - \frac{1}{N^2} \sum_{i=1}^N x_i^2 \nonumber  \\
&= \frac{N-1}{N^2} \sum_{i=1}^N x_i^2 - \frac{1}{N^2} \sum_{i=1}^N \sum_{j \neq i} x_i x_j  \nonumber
\end{align}
$$

Then applying the expected value operator we get

$$
\begin{align}
\mathbb{E}[\hat \sigma^2] &= \frac{N-1}{N^2} \sum_{i=1}^N \mathbb{E} [x_i^2] - \frac{1}{N^2} \sum_{i=1}^N \sum_{j \neq i} \mathbb{E} [x_i] \mathbb{E} [x_j] \nonumber  \\
&= \frac{N-1}{N} (\sigma^2 + \mu^2) - \frac{N-1}{N} \mu^2 \nonumber \\
&= \frac{N-1}{N} \sigma^2 \neq \sigma^2  \nonumber 
\end{align}
$$

NO


If we multiply it by a constant we obtain the well known unbiased estimator of the variance

$$
\hat \sigma_{u}^2 = \frac{N}{N-1} \hat \sigma^2 = \frac{1}{N-1} \sum_{i=1}^N (x_i- \hat\mu)^2
$$

### MLE of a Gaussian mixture

Let's imagine that our *iid* data come from a mixture of Gaussians with K components

$$
f(x_i|\pi,\mu,\sigma^2) = \sum_{k=1}^K \pi_k \mathcal{N}(x|\mu_k, \sigma_k^2),
$$
where $\sum_{k=1}^K \pi_k = 1$ and $\pi_k \in [0, 1] ~~ \forall k$

We can write the log likelihood as

$$
\log L(\pi,\mu,\sigma^2) = \sum_{i=1}^N \log \sum_{k=1}^K \pi_k \mathcal{N}(x|\mu_k, \sigma_k^2)
$$

- Oh my! We cannot obtain analytical expressions for the parameters as before
- We have to resort to iterative methods/optimizers, Can you name any?
- **Expectation Maximization** (we will see this in a future class)


## Optimality properties and uncertainty of MLEs 

Assuming that the data truly comes from the specified model the MLE is

- **Consistent:** The estimate converge to the true parameter as data points increase

$$
\lim_{N\to \infty} \hat \theta = \theta
$$

- **Asymptotically normal:** The distribution of the estimate approaches a normal centered at the true parameter. 

$$
\lim_{N\to \infty} p(\hat \theta) = \mathcal{N}(\hat \theta | \theta, \sigma_\theta^2)
$$

> This is a consequence of the central limit theorem!

For *i.i.d.* $\{X_i\}, i=1,\ldots,N$ with $\mathbb{E}[X] < \infty$ and $\text{Var}[X] < \infty$ then

$$
\lim_{N\to\infty} \sqrt{N} (\bar X - \mathbb{E}[X]) = \mathcal{N}(0, \sigma^2)
$$

> **Consequence:** Because MLE have asymptotically normal distributions then log likelihood ratio have asymptotically a *chi-square* distributions (more on this later)

- **Minimum variance:** The estimate achieve the theoretical minimal variance given by the Cramer-Rao bound


### Cramer-Rao lower bound

It is the inverse of the expected Fisher information, *i.e* the second derivative of $- \log L$ with respect to $\theta$

$$
\sigma_{nm}^2 =  \left (- \frac{d^2 \log L (\theta)}{d\theta_n \theta_m} \bigg\rvert_{\theta = \hat\theta}\right)^{-1}
$$

Note that

- $\sigma_{nm}^2$ is the minimum variance achieved by an unbiased estimator.
- $\sigma_{nn}^2$ give the marginal error bars 
- If $\sigma_{nm} \neq 0 ~~ n\neq m$, then errors are correlated, *i.e* some combinations of parameters might be better determined than others


#### Example: Cramer-Rao bound for the MLE of $\mu$

Considering a Gaussian likelihood

$$
\log L (\mu, \sigma^2) =  - \frac{N}{2} \log 2\pi\sigma^2 - \frac{1}{2\sigma^{2}}   \sum_{i=1}^N (x_i-\mu)^2 
$$

What is the uncertainty of the MLE 

$$
\hat \mu = \frac{1}{N} \sum_{i=1}^N x_i
$$

In this case the Cramer-rao bound

$$
\begin{align}
\sigma_{\hat\mu}^2  &= \left (- \frac{d^2 \log L(\mu, \sigma^2)}{d\mu^2} \bigg\rvert_{\mu=\hat\mu}\right)^{-1}  \nonumber \\
&=  \left (- \frac{1}{\sigma^2} \frac{d}{d\mu}  \sum_{i=1}^N (x-\mu) \bigg\rvert_{\mu=\hat\mu}\right)^{-1}  \nonumber \\
&=  \left ( \frac{N}{\sigma^2}  \bigg\rvert_{\mu=\hat\mu}\right)^{-1} = \frac{\sigma^2}{N}  \nonumber
\end{align}
$$

- This is known as the **standard error of the mean**
- Also, $p(\hat \mu) \to \mathcal{N}(\hat \mu| \mu, \sigma^2/N)$

In [None]:
fig, ax = plt.subplots(figsize=(7, 3.5), tight_layout=True)
ax.set_title('Mean of Gaussian distributed data')
N, mu_real, s_real = int(1e+4), 2.23142, 1.124123
np.random.seed(0)
x = mu_real + s_real*np.random.randn(N)
mu_estimator = np.array([np.mean(x[:i]) for i in range(1, N)])

ax.plot([1, N], [mu_real, mu_real], 'k--', label='Real')
ax.plot(range(1,N), mu_estimator, label='MLE');
ax.fill_between(np.arange(1, N), 
                mu_estimator-s_real/np.sqrt(np.arange(1, N)), 
                mu_estimator+s_real/np.sqrt(np.arange(1, N)), alpha=0.5);
ax.set_xscale('log'); 
ax.set_label('Number of samples'); 
plt.legend();

## Comparing models using the likelihood

- We can compare estimated parameters using the likelihood
- We can formulate a hypothesis test for the MLE's using the asymptotic distribution

Some of these tests are based on the $\chi^2$ distribution with $k$ degrees of freedom

The mean of the $\chi^2$ is $k$ and the variance is $2k$

In [None]:
from scipy.stats import chi2

fig, ax = plt.subplots(figsize=(7, 4))

def update(dof):
    ax.cla()
    ax.set_title('Mean: dof, Variance: 2dof')
    x = np.linspace(chi2.ppf(0.01, dof), chi2.ppf(0.99, dof), 100)
    px = chi2.pdf(x, dof)
    ax.plot(x, px, linewidth=2)
    ax.set_xlim([np.amin(x)*0.9, np.amax(x)*1.1])
    ax.set_ylim([np.amin(px)*0.9, np.amax(px)*1.1])  
    display("95% quantile at:", chi2.ppf(0.95, dof))
    
    
interact(update, dof=SelectionSlider_nice(options=[1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]));

###  Hypothesis testing: Wald-test

Suppose we wish to test

$$
\mathcal{H}_0: \theta = \theta_0
$$
$$
\mathcal{H}_A: \theta \neq \theta_0
$$

Under the null we can write 

$$
\text{Wald} = \frac{(\hat \theta - \theta_0)^2}{\left (- \frac{d^2 \log L (\theta)}{d\theta^2} \bigg\rvert_{\theta = \hat\theta}\right)^{-1}} = (\hat \theta - \theta_0)^2 \sigma_{\hat \theta}^2 \to \chi^2_1
$$

If $\text{Wald}$ is greather than the $(1-\alpha)100\%$ quantile of $\chi^2_1$ we reject the null

### Hypothesis testing: Log-likelihood ratio test or Wilks test

Suppose we wish to test

$$
\mathcal{H}_0: \theta = \theta_0
$$
$$
\mathcal{H}_A: \theta =\theta_1
$$

We can write a ratio between likelihoods

$$
\lambda(\mathcal{D}) = \frac{L(\theta_0|\mathcal{D})}{L(\theta_1|\mathcal{D})} 
$$

Asymptotically, under the null, we have  
$$
-2 \log \lambda(\mathcal{D}) \to \chi^2_1
$$

If $-2 \log \lambda(\mathcal{D})$ is greather than the $(1-\alpha)100\%$ quantile of $\chi^2_1$ we reject the null

## Model comparison using other criteria

- How to compare models with different number of parameters? 
- In general the more number of parameters the better the fit (overfitting)
- How to score models taking into account their complexity?

Two options:
1. Cross-validation and bias/variance trade-off (based on finite data)
1. Akaike information criterion (AIC) (based on asymptotic approximation)

For a model with $k$ parameters and N data points the AIC is 

$$
\text{AIC} = -2 \log L(\hat \theta) + 2k + \frac{2k(k+1)}{N-k-1},
$$

which one seeks to minimize

***

**Parsimony principle** (aka Occam's Razor): Choose the simplest scientific explanation (few parameters) that fits the evidence (high likelihood)

Also related: KISS principle

***

## Homework 1: MLE for a Bernoulli distribution

A magician friend of yours has bought a "magic coin". 

He asks you to obtain for him the probability of obtaining a head with such coin.

The coin has two outputs (head/tail) so we can assume that it follows a Bernoulli distribution

$$
f(x|p) = p^x (1-p)^{1-x}, ~~ x \in \{0, 1\}
$$

Your friend tosses the coin N times and records the outputs $\{x_i\}$

- 1) Find an analytic expression for $\log L(p)$ 
- 2) Find and analytic expression for $\hat p$ from the first derivative of $\log L(p)$. Find $\mathbb{E}[\hat p]$, is it a biased or unbiased estimator?
- 3) Find the Fisher information for $\hat p$ and its variance $\sigma_{pp}^2$
- 4) Now consider 
    
        coins = scipy.stats.binom.rvs(n=1, p=0.75, random_state=1234, size=100)
    
    as the $100$ coin tosses and plot $\log L(p)$ as a function of $p$, highlight the value of $\hat p$

- 5) Consider that the tosses were obtained serially. Plot $\hat p$ versus $N$. Use shading to show the uncertainty of the estimator as a function of $N$. How large needs $N$ to be so that $|\hat p - p|/p < 0.05$?
