# Sampling
A sample from a distribution $p(x)$ is a single realization $x$ whose probability distribution is $p(x)$

## Ancestral Sampling
Given a DAG and the ability to sample from each of its factors given parents, we can then sample from the join distribution over all the nodes by ancestral sampling, i.e. start with some root, at each step, sample from any conditional distribution that haven't visited yet, whose parents have all been sampled. 

## Simple Monte Carlo
Given $\{x^{(r)}\}^R \sim p(x)$, then we can define $\hat E = R^{-1}\sum^R f(x^{(r)}) \approx E_{x\sim p}[f(x)]$. 

$\hat E$ is an unbiased estimator as 
\begin{align*}
\hat E &= E[R^{-1} \sum^R f(x^{(r)})]\\
&= R^{-1}\sum^R E[f(x^{(r)})]\\
&= R^{-1}\sum^R E[f(x)]\\
&= E
\end{align*}

and the variance decreases as $R$ increases
\begin{align*}
var(\hat E) &= var\bigg[R^{-1}\sum^R f(x^{(r)})\bigg]\\
&= R^{-2}var\bigg[\sum^R f(x^{(r)})\bigg]\\
&= R^{-2}\sum^R var[f(x^{(r)})] &\text{samples are indep.}\\
&= R^{-1}var(f(x))
\end{align*}

## Importance Sampling
Assume we are to estimate some $p$ from some $\tilde p$ s.t. $p(x) = \tilde p(x)/ Z$ and we further assume some simpler density $q$ form which it is easy to sample from and easy to evaluate $\tilde q$ s.t. $q(x) = \tilde q(x) / Z_q$

Then, generate $R$ samples from $q$, i.e. 
$$\{x^{(r)}\}^R$ \sim q(x)$$
If these points are samples from $p(x)$ then we could estimate $\Phi$ by 
$$\Phi = E_{x\sim p}(\phi(x)) \approx R^{-1}\sum^R \phi(x^{(r)}) = \hat\Phi$$

#### Weights
Note that the samples are generated from $q$, instead of $p$ so that we need some weights $w$ to trade off this difference, i.e. 
$$\tilde w_r = \frac{\tilde p(x^{(r)})}{\tilde q(x^{(r)})}$$
so that 
\begin{align*}
\Phi &= \int \phi(x)p(x)dx\\
&= \int \phi(x)\frac{p(x)}{q(x)}q(x)dx \\
&\approx R^{-1}\sum^R \phi(x^{(r)})\frac{p(x^{(r)})}{q(x^{(r)})}\\
&= \frac{Z_q}{Z_p} \frac{1}{R}\sum_{r=1}^R \phi(x^{(r)}) \cdot \frac{\tilde p(x^{(r)})}{\tilde q(x^{(r)})} \\ &= \frac{Z_q}{Z_p} \frac{1}{R}\sum_{r=1}^R \phi(x^{(r)}) \cdot \tilde w_r \\ &= \frac{\frac{1}{R}\sum_{r=1}^R \phi(x^{(r)}) \cdot \tilde w_r}{\frac{1}{R}\sum_{r=1}^R \tilde w_r} \\ &= \frac{1}{R}\sum_{r=1}^R \phi(x^{(r)}) \cdot w_r \\ &= \hat \Phi_{iw}
\end{align*}

However, take note that $\hat\Phi_{iw}$ is a biased estimator, although consistent.

## Rejection Sampling 
Similar to importance sampling, but in this case we find some easier $q$ s.t. $$\forall x. c\tilde q(x) > \tilde p(x)$$
Then, the algorithm goes like

for each iteration
1. generate $x \sim q(x)$
2. generate $u \sim Unif[0, c\tilde q(x)]$
3. evaluate $\tilde p(x)$
  - if $u > \tilde p(x)$, reject
  - else accept $x$ by append it to $\{x^{(r)}\}$

#### Problem
In high dimensions, such $c\tilde q$ will be harder to find, hence $c$ must be huge and the acceptance rate will then be exponentially small in proportional to number of dimensions

### Metropolis-Hastings Method
Instead of using one $q$, let $q$ be some function depends on the current state of $x^{(t)}$, for example, $N(x^{(t)}, \Sigma)$

Then, for each iteration, 
 - generate $x' \sim q(x'|x^{(t)})$
 - compute $a = \frac{\tilde p(x')q(x^{(t)}|x')}{\tilde p(x^{(t)})q(x'|x^{(t)})}$
 - if $a\geq 1$ then accepted, otherwise accept with probability $a$. 
 - Update $x^{(t+1)} =\begin{cases} x' &\text{if accept}\\x^{(t)} & \text{if refuse}\end{cases}$
 
 ### Problem with MH method
 - we are generating from a dependent sequences, and we cannot really estimate the variance. 
 - Unable to know when it "converge", i.e. obtain enough samples that are effectively independent samples from $p$. 