In [None]:
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'width': 1920,
        'height': 1080,
        'scroll': True,
})
%config InlineBackend.figure_format = 'retina'

# Week 07, ASTR 596: Fundamentals of Data Science


## Posterior Predictive Checks, and connecting the Bayesian and Frequentist Worlds
#### (not on your midterm)

### Gautham Narayan 
##### <gsn@illinois.edu>

# <center> How is the Midterm going? </center>


## <center> What questions do you have? </center>

## Recap: 

<table>
    <tr>
        <td><img src="review.png" width=100%></td>
    </tr>
</table>

## <center> What do you feel least comfortable with? (Remember the stats dept. offers semester long courses on any of these topics!) </center>

## Recap:

We've talked about:
* Rejection sampling
* **Metropolis-Hastings**
    * Random walks are robust but inefficient - suppress random walk behavior to improve efficiency at the cost of complexity (interpretability) and applicability 
* The behavior or MCMC (burnin, autocorrelation, impact of starting position, if things are stationary...)
* How to tell if samples were useful (mixing chains, G-R statistic, number of effective samples, **looking at your data**) 

## Recap:

More useful MCMC tools:
* **Affine-invariant MC** (emcee) - works great as long as posterior is "nice" after affine transformation
    * counter-examples: Rosenbrock function, eggbox 
* **Parallel-tempering** (now, ptemcee) - adds chains at multiple temperatures (we care about T=1) 
    * connection to simulated annealing
    * computationally more intensive, even with a low number of dimensions
* **Gibbs Sampling** (special choice of prior, but if applicable to your problem, acceptance fraction = 1.)
    * connection to probability integral transform 
    * not very general, but most common way to deal with high dimensional spaces 
* Useful if overlooked - **you don't have to update all of the parameters of your model the same way**
    * i.e. you can look up something like **Metropolis within Gibbs**

## Recap:    
    
* All of these are specializations of MH (different starting positions, different number of chains, different proposals distributions) 
    * **None require evaluation of derivatives** - which is the general situation you'll be in with astrophysics
    * There are packages that will do automatic differentiation for you (e.g. `pymc3` which uses `theano`, `tensorflow`), but these can be "fragile" with real data

## Big Picture: Frequentist Statistics

* **Frequentists** make statements about the data (or statistics or estimators= functions of the data), conditional on the parameter: 

# $$p(D|\theta) \mathrm{\;or\;} p(f(D) |\theta)$$

* The goal is to get a “point estimate” or confidence intervals with good properties/coverage under “long-run” repeated experiments in the magic wonderland of Asymptopia.
    * Confidence intervals - arguments are based on datasets that could have happened but didn't.

## Big Picture: Bayesian Statistics

* Bayesians make statements about the probability of parameters conditional on the dataset $D$ that you actually observed

# $$p(\theta|D)$$

* This requires an interpretation of probability as a quantifying degree of belief in a hypothesis. This exists without any data even - i.e. **the prior**
    * Credible regions - arguments are based on variables you wish you observed but didn't (nuisance parameters/latent variables) 
    * This is not an equivalent shortcoming to the frequentist approach - it actually matches how reality works

## Big Picture: MCMC

* The Bayesian answer is the full posterior density, quantifying the "state of knowledge" after seeing the data
    * The likelihood is not a probability density in the parameters.
        * But multiply by a prior (even flat) and the posterior is a probability that obeys clear rules:
            * Conditional/Marginal probability
* **Numerical estimates (such as samples using Monte Carlo methods) are attempts to (imperfectly) summarize the posterior**
    * These techniques give us ways to deal with high-dimensional spaces e.g. many latent variables
        * Convert messy integrals to simple sums over samples
            * **All those frequentist statistics/estimators are still useful given MCMC samples!**

    
## Advice: MCMC

* MCMC is a terrible optimizer. If you just want the "best-fit", some local/global optimizer is often quicker.
    * These are often useful for reasonable starting guesses
    * Bayesians will often need to introduce latent variables/nuisance parameters/things you don't observe but wish you did 
        * YOU SHOULD THINK ABOUT THIS FOR Q3 ON THE MIDTERM!!! 
            * Coming up with the likelihood is not the same as writing down the model
    * These parameters make the problem very high dimensional, even if you don't care about them
    * The point of it is to sample the full posterior distribution so you have reasonable **credible regions**
        * You are scientists and this is what you actually want
            * **You write down the model and likelihood, and the Bayesian framework tells you what distribution of parameters is feasible given your data and your prior belief**

# For Fun: MCMC sampler visulations for different functions, without an annoying soundtrack:

# <center>[http://chi-feng.github.io/mcmc-demo/](http://chi-feng.github.io/mcmc-demo/)</center>

(There is only so many times you can listen to the Harlem shake - Hungarian dances for sorting algorithms are more fun)

## Posterior-predictive checks

* Nothing about the Bayesian framework we've discussed tells us if our model is right
    * MCMC can give us very precise, but very wrong inferences, if the model itself is inadequate
     
     
Let's look at the posterior again
## $$ P(\theta|D) \mathrm{\; is \; really \;} P(\theta|D, H) $$ 
i.e. assuming the hypothesis $H$ is itself correct.

## Posterior-predictive checks

Frequentists have a way to express the question we're asking

## $$P(D|H)$$
i.e. "how likely is the data given the hypothesis", which is similar to but not exactly the same as  "how likely is the data given the model parameters of this hypothesis"

The two are related though!

## $$P(D|H) = \int_{\theta} P(D|\theta) \cdot P(\theta|H)$$

This is the **predictive distribution** - the distribution of imaginary datasets if the hypothesis/model is true.

* i.e. if you have some observations $y_D$, and you can infer a model and then ask what we would expect to see in hypothetical replications of the same experiment.

* if the model is right, you expect to see something similar to what you did the first time

## Posterior-predictive checks

* THIS IS A FREQUENTIST IDEA!
* The idea is to generate data from the model using parameters from draws from the posterior.

## Big picture

* Bayes theory is needed to *estimate* parameters, conditional on observations and a model we are considering
* Frequentist theory is needed to *critique* a model conditioned on the data we observe, by exploring if the model actually is *likely* to generate data like our observations in the first place

## In Class Exercise:

We'll use our data from HW2 to try posterior predictive checks on a simple linear model.

In [None]:
# RUN THIS

%matplotlib inline
import numpy as np
import scipy.stats as st
from matplotlib import pyplot as plt
import scipy.optimize
from astroML.datasets import fetch_hogg2010test
import pymc3 as pm

In [None]:
# AND THIS

# Get data: this includes outliers
data = fetch_hogg2010test()
x = data['x']
y = data['y']
dy = data['sigma_y']

# convert the data to numpy arrays
x  = np.array(x)
y  = np.array(y)
dy = np.array(dy)

# Next, we're going to use `pymc3` to fit this model, but we're going to be naive and use a Gaussian likelihood, despite knowing there are outliers.

In [None]:
with pm.Model() as model:
    # write down expressions for the priors on the slope and intercept of the line 
    
    # https://docs.pymc.io/api/distributions/continuous.html
    # they are all in the form of 
    # Distribution('variable name', parameters of distribution)
    # e.g. Normal('blah', mu=0, sigma=1)

    # write down your model

    
    # and write down your likelihood function
    # pymc3 accepts keywords, observed for the data

    
    # and then to sample!
    samples_ols = pm.sample(5000, cores=2) # draw 5000 posterior samples 

# Next make a traceplot of your samples

# If your traceplot looks ok, get some summary statistics from your trace

# Now lets see how if our data looks anything like data generated from the model suggest

In [None]:
# RUN THIS
ppc = pm.sample_posterior_predictive(samples_ols, samples=500, model=model)

# Plot the posterior predictive samples and the data, and 200 draws from the posterior

## Finally, change the likelihood to a Student-T likelihood, and add a prior on $\nu$, the number of degrees of freedom of the distribution. and compare the posterior predictive samples from this with your earlier OLS samples. 

## [PyMC3 implementation of the Outlier model we used with `emcee`](https://docs.pymc.io/notebooks/GLM-robust-with-outlier-detection.html)

(Feel free to use on your midterm if you prefer)