# Lecture #8: Metropolis-Hastings and Gibbs
## AM 207: Advanced Scientific Computing
### Stochastic Methods for Data Analysis, Inference and Optimization
### Fall, 2021

<img src="fig/logos.jpg" style="height:150px;">

### Lecture #8 Summary

#### Grades!

1. **For those concerned about your final course grade.** Pathways to an `A`:
  - low `A` on the final project, two 7's two 8's two 9's and three 10's 
  - low `A` on the final project, two 8's and seven 9's
  - mid `A` on the final project, four 8's and five 9's
  - high `A` on the final project, two 7's two 8's five 9's
  - *etc...*<br><br>
  
2. **How are we doing?**
  - the average on HW#0 is 9.44 (matches historical performance)
  - the average on HW#1 is 9.64 (matches historical performance)<br><br>
  
3. **For those concerned about the grades on individual assignments.** 
  - TFs grade using a single rubric to reduce variance
  - Rubric is formative: points are not assigned to individual parts (this can be punitive), points indicate ladders (from "mostly correct with minor errors" to "major misunderstandings -- student should look again at the materials")
  - Rubric prevents too many points from being deducted (e.g. a -0.25 is not TF trying to nitpick "there is a slight error", it's actually indicating a significant error "there is a major error but student clearly knows what's going on")
  - TFs instructed to focus on providing lots of comments
  - We try to write the questions and comments in a way that is very clear, but obviously sometimes you will find them confusing. But **there are no trick questions** (we are not trying to trick you into losing points) <br><br>
  
4. **How do I write the broader impact statement????!!!!**
  - We are looking for **engagement** and **concrete connections between theory and practice**.
    - Non-example: "I wouldn't recommend using this model because it's not accurate. When the model isn't accurate it might harm patients". 
      - How can the model harm patients? Who is using this model, what decisions are they making based on the model, and what is the relationship between the user and the patients (what if the patients are empowered to reject the model decisions?)?
      - For whom is the model not accurate? For whom is the model accurate? Should we disregard a model completely because it doesn't work for some people? Is it fair to deploy a model that benefits only some people?
      - A model can be useful for things OTHER THAN prediction? Did you consider other possibly advantages of this model? Is this model interpretable? Easy to implement and learn? Cheap to deploy?
  - If you're not sure what the broader impact is (often times it won't be super clear), come chat with us at any time. Also discuss with eachother on Piazza.<br><br>
  

---

#### How Do I Know That My Sampler is an MCMC Sampler for the target $\pi^{\text{target}}$?
1. Check that my chain is ***aperiodic***.
  - **Finite State Space:** check that the period of each state is 1, where the period of a state is the `gcd` of the lengths of all paths starting and ending.
  - **Continuous State Space:** Periodicity is defined as the largest $n$ such that there is a pairwise disjoint family of $n$ number of measurable sets where $T(x, A_{j+1}) = 1$ for all $x\in A_j$ and $T(x, A_1) = 1$ for all $x\in A_n$, where $T(x,\cdot)$ is the measure induced by the transition kernel $T$.<br><br>
    It's **sufficient** that the proposal distribution is positive everywhere in the sample space.<br><br>
  
2. Check that my chain is ***irreducible***.
  - **Finite State Space:** check that for some $n$ the transition matrix $T^n$ has all positive entries.
  - **Continuous State Space:** a kernel is ***Harris positive recurrent*** if it is $\varphi$-irreducible (i.e. there is some $n$ such that for all $A$ we have $\varphi(A)>0$, $T^n(x, A) > 0$) with stationary distribution $\pi$ and for all $\pi(A)>0$, all $x$, we have $Pr(\tau_A <\infty|X_0 =x)=1$, where $\tau_A = \inf\{n\geq 1: X_n \in A\}.$
  
    It's **sufficient** that the proposal distribution is positive everywhere in the sample space.<br><br>
    
3. By the Fundamental Theorem, we get that there exists a unique stationary distribution $\pi^{\text{sta}}$ that is the limiting distribution $\pi^{\infty}$:
$$
\pi^{\text{sta}} = \pi^{\infty}
$$

4. We check that the sampler satisfy ***detailed balance*** with respect to $\pi^{\text{target}}$.
5. Then we know that $\pi^{\text{target}}$ is stationary, since the stationary distribution is unique for our sampler, and hence limiting:
$$
\pi^{\text{target}} = \pi^{\text{sta}} = \pi^{\infty}
$$

---

#### Samplers with Asymptotic Guarantees Can Be Lemons!

1. **Burn-in:** The first chunch of samples from an MCMC sampler will be worthless. We need to discard these samples.
2. **Mixing:** How do we know when the sampler is finally sampling from $\pi^{\text{target}}$?
  - Look at the trace plots
    <img src="fig/mixing.jpg" style="height:150px;">
  - Trace plots can be misleading!
    <img src="fig/trace.pdf" style="height:150px;">
  - Look at the correlation plots
    <img src="fig/autocorr.jpg" style="height:350px;">
3. **Remember that all your metrics are flawed!**

#### More Rigorous Checks for Convergence

Look for:
1. Large segments of the ***chain*** (sequence of samples) should have give similar statistics (mean, variance etc)
2. Low correlations within states of the chain
3. "Reasonably high" acceptance rate of proposed steps
4. Multiple chains initialized from different initial points give similar results

Best practics:
1. Always run multiple chains initialized from very different random starting points
2. Always run your chains for as long as you can then burn and thin
3. Always check all relevant convergence diagnostics
4. Never be too certain: **remember that there is no "proof" of convergence for finite chains!**
5. Keep reading about best practice!

---

#### What Are the Broader Impacts of MCMC Theory?

1. How much technical know-how does it take to understand how MCMC samplers work?
  - Do ML researchers need to understand how MCMC samplers work? How much do they need to understand?
  - Do data scientists need to understand how MCMC samplers work? How much do they need to understand?
  - Do domain experts (like clinicians) need to understand how MCMC samplers work? How much do they need to understand?
  - How would each group of people know when the sampler has failed? <br><br>
  
2. How much do you **think** you need to understand in order to comfortably use MCMC samplers? Does your percieved technical overhead of understanding MCMC discourage you from using this method? Innovating in this field?<br><br>

3. What kinds of power hierarchies are created by us using fancy models with fancy theory -- i.e. how much power do affected communities have in changing the technology or detecting/attributing fault? How can we equalize this distribution of power?