# 2.2 Bayesian Inference

## ***Vocabulary***

**frequentist view**
- a framework for statistical estimation, the view of estimating unknown but fixed parameters from randomized data sets.

**subjective absolute**

**posterier distribution**

**prior**
- a distribution that we assign to a parameter if we don't observe any information about the parameter.

**gaussian probability formula**
$$ P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma_1}exp(-\frac{(x_i-\theta)^2}{2\sigma_1^2}) $$

# Lecture Notes #

## ***2.2.0 Bayesian Inference***

#### **Introduction**

<br>
<center>
    <img width="60%" src="images/2.2.1.png" alt="Professor Notes" />
</center>
<br>

Bayesian inference is a powerful set of techniques for parameter estimation. It has the advantage of incorporating prior information as well as quantifying uncertainty. While MLE is a frequentist view, bayesian inference is a significantly different view.

In Bayesian inference, they key idea is that the unknown parameter **is not** viewed as a deterministic value. It is viewed as a random variable. 

The idea behind this is that, even though theta is actually fixed, we don't actually observe theta directly. We only have very limited, partial information about theta. That is something called a **subjective absolute**. Essentially, to us, theta is a random variable.

In Bayesian inference, we are going to assume theta is a random variable, then we are going to explicitly calculate its posterier distribution given observation, using what's called Bayes' rule.

#### **Bayes' Rule**

First, the posterier distribution is notated as:

$$ p(\theta|D)$$

And the likelihood function is notated as:

$$ p(D|\theta) $$

And the prior (see vocabulary section) is notated as:

$$ p(\theta) $$

And the marginal distribution of the data is notated as:

$$ \int p(D|\theta)\;(p(\theta)\;d\theta $$

Putting it all together, Bayes' rule is as follows:

$$ p(\theta|D) = \frac{p(D|\theta)\;p(\theta)}{p(D)} $$

Where the marginal distribution is being used as a normalization constant to ensure the posterier distribution is normalized to have an integration of 1.

#### **Proving Bayes' Rule**

<br>
<center>
    <img width="60%" src="images/2.2.2.png" alt="Professor Notes" />
</center>
<br>

#### **Bayesian Inference Illustrative Example**

<br>
<center>
    <img width="60%" src="images/2.2.3.png" alt="Professor Notes" />
</center>
<br>

In this example, we see that $\theta$ is a binary variable that outputs 1 if the sun exploded, and 0 otherwise. $x$ is also a binary variable that outputs 1 if the alarm goes off and 0 otherwise.

If the alarm fires, do we believe the device? We have two conflicting pieces of evidence. 
1. This device is very accurate, with $\alpha = 0.0001$ in this case (professor set).
2. The likelihood of the sun exploding today or any other day is infintesimally small.

How can we combine these two pieces of evidence? Luckily, that is what bayesian inference can do.

---

First let's try MLE, which we find will not work in this case:

<br>
<center>
    <img width="60%" src="images/2.2.4.png" alt="Professor Notes" />
</center>
<br>

The reason why it fails is because we only have one data point, and we do not use any other prior knowledge.

---

Now, using Bayesian inference:

<br>
<center>
    <img width="60%" src="images/2.2.5.png" alt="Professor Notes" />
</center>
<br>

Thus, the decision of whether $\theta$ should be 0 or 1 can be written as:

<br>
<center>
    <img width="60%" src="images/2.2.6.png" alt="Professor Notes" />
</center>
<br>

So we can say we should predict $\theta = 1$ if:

<br>
<center>
    <img width="60%" src="images/2.2.7.png" alt="Professor Notes" />
</center>
<br>

Predicting if $\theta$ should be 0 can be derived the same way. In this case $\theta$ is 0.

## ***2.2.1 More Examples***

Recall the formula for Bayesian Inference:

<br>
<center>
    <img width="60%" src="images/2.2.8.png" alt="Professor Notes" />
</center>
<br>

Note that since the normalization constant, $P(D)$, does not rely on $\theta$, the Bayesian Inference formula is proportional to the formula on the right for the purpose of estimating $\theta$.

### **Example 1**

<br>
<center>
    <img width="60%" src="images/2.2.9.png" alt="Professor Notes" />
</center>
<br>

In this example we would like to predict the commute time at our new apartment. We have some **prior knowledge** from our friend about the commute time, and we have a few **observations** we have made ourselves while testing the drive. 

In this problem, $\theta$ is the commute time. We can use a Gaussian distribution to capture the prior knowledge, where $p(\theta) \textasciitilde \mathcal{N}(\mu_0, \sigma^2_0)$, and $\mu_0 = 30$, and $\sigma_0 = 10$. Our observations can be denoted $x_1, \dots, x_n$, where $x_i = \theta + \sigma_1 \xi_i$, $\xi_i \textasciitilde \mathcal{N}(0,1)$, $\sigma_1 = 5$ (in practice you can estimate sigma).

Based on the assumption that we know $\sigma_1$, we can actually define the likelihood function:

$$ P(x_i|\theta)\textasciitilde \mathcal{N}(\theta,\sigma_1^2) $$

Then, to estimate the posterior distribution:

$$ P(\theta|D) = \frac{P(D|\theta)\;P(\theta)}{P(D)}  \propto P(D|\theta)\;P(\theta)$$

Now, since our observations are a set of data points, we will take the product like we did for MLE:

$$ P(\theta|D) = [\prod_{i=1}^nP(x_i|\theta)]\;P(\theta) $$

Then, since we decided that the observations were Gaussian, we can use the formula for a Gaussian (see Vocabulary section). Luckily, since the first term in the formula is a constant since we are estimating $\theta$, we can ignore it, thus:

$$ P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma_1}exp(-\frac{(x_i-\theta)^2}{2\sigma_1^2}) \propto exp(-\frac{(x_i-\theta)^2}{2\sigma_1^2})$$

So the posterior distribution will be:

$$ \propto [\prod_{i=1}^n exp(-\frac{(x_i-\theta)^2}{2\sigma_1^2})] exp(-\frac{(\theta-\mu_0)^2}{2\sigma_0^2}) $$

$$ \propto exp(-\sum_{i=1}^n \frac{(\theta-x_i)^2}{2\sigma_1^2} -\frac{(\theta-\mu_0)^2}{2\sigma_0^2}) $$

We can see that this expression is an exponential of some quadratic function about $\theta$, so to simplify it, we can write this as a quadratic function of $\theta$:

$$ = exp(-\frac{1}{2}(A\theta^2-2B\theta+C)) $$

We can now solve for $A, B,$ and $C$ by looking back at the posterior we solved for:

$$ A = \sum_{i=1}^n\frac{1}{\sigma^2} $$
$$ A = \frac{n}{\sigma^2}+\frac{1}{\sigma^2} $$
$$ B = \sum_{i=1}^n \frac{x_i}{\sigma_1^2} + \frac{\mu_0}{\sigma^2_o} $$
$$ C = do\;not\;care,\;it\;is\;a\;constant $$

Back to further simplifying the quadratic equation we created:

$$ exp(-\frac{1}{2} A(\theta=\frac{B}{A})^2+const) $$

And we can view this as the likelihood of a Gaussian distribution:

$$ \sim \mathcal{N}(\frac{B}{A}, \frac{1}{A}) $$

Using the above, and our solved values for $A$ and $B$:

$$ \mu_p = \frac{B}{A} = \frac{\sum_{i=1}^n \frac{x_i}{\sigma_1^2} + \frac{\mu_0}{\sigma^2_o}}{\frac{n}{\sigma^2}+\frac{1}{\sigma^2}} $$

$$ \sigma_p^2 = \frac{1}{A} = (\frac{n}{\sigma^2}+\frac{1}{\sigma^2})^{-1} $$

### **Example 2**

#### **Setup**

<br>
<center>
    <img width="60%" src="images/2.2.10.png" alt="Professor Notes" />
</center>
<br>

In this case we will be using Bayesian inference to solve linear regression, which is typically solved using least squares estimation. However, Bayesian inference has the advantage of quantifying the uncertainty in the data and the parameter estimation.
So if we want an uncertainty estimation about how accurate our estimate of $\theta$ might be, this is what Bayesian inference can provide.

#### **Initialize the Prior Distribution**

In this case, we will treat $\theta$ as a random variable, and assume the prior is a Gaussian distribution. In reality, you must decide what the prior will be, but we can typically assume a Gaussian so that is what we are using for this problem. $\mu_0$ and $\sigma_0$ need to be set by us, and this could come from some prior information. Or, if you don't have a lot of information, we can assume $\mu_0 = 0$ and $\sigma_0$ is some very large number.

**TLDR;** If you have information about $\theta$, you can set a sharp prior. Otherwise, you can set a very generalized and uninformative prior such as $\mu_0 = 0$ and $\sigma_0 =$ some very large number.

<br>
<center>
    <img width="60%" src="images/2.2.12.png" alt="Professor Notes" />
</center>
<br>


#### **Determining the Likelihood**

We can assume the $y_i$ is equal to our $x_i$ times $\theta$ plus potentially some Gaussian noise:

$$ y_i = x_i^T \theta + \sigma_i\xi_i $$

Where $\sigma_1$ is some variance that we set or derive from the data, and $\xi_i$ is some standard Gaussian noise: $\xi_i \sim \mathcal{N}(0,1)$.

Then we can write the likelihood as:

$$ P(\{y_i, x_i\}\mid\theta) $$

Simplifying using the chain rule:

$$ = P(y_i\mid x_i, \theta)\;P(x_i) $$

And since the $P(x_i)$ is a constant in terms of $\theta$, we can ignore it in further calculations.

#### **Finding the Posterior Distribution**

Plugging in the probabilities (recall the data points were i.i.d.), then simplifying using the chain rule:

<br>
<center>
    <img width="60%" src="images/2.2.13.png" alt="Professor Notes" />
</center>
<br>

Then, since we used a Gaussian prior and a Gaussian likelihood, we can use the Gaussian probability formula:

<br>
<center>
    <img width="60%" src="images/2.2.14.png" alt="Professor Notes" />
</center>
<br>

Where we turn the prior into a quadratic equation in order to isolate and solve for $\mu$ and $\sigma^2$.

# Personal Notes #