# Bayesian Linear Regression

## What is the problem?

Given inputs $X$ and outputs $\mathbf{y}$, we want to find the best parameters $\boldsymbol{\theta}$, such that predictions $\hat{\mathbf{y}} = X\boldsymbol{\theta}$ can estimate $\mathbf{y}$ very well. In other words, we want L2 norm of errors $||\hat{\mathbf{y}} - \mathbf{y}||_2$, as low as possible. 

## Applying Bayes Rule

In this problem, we will {ref}`model the distribution of parameters <parameters-framework>`. 

\begin{equation}
\underbrace{p(\boldsymbol{\theta}|X, \mathbf{y})}_{\text{Posterior}} = \frac{\overbrace{p(\mathbf{y}|X, \boldsymbol{\theta})}^{\text{Likelihood}}}{\underbrace{p(\mathbf{y}|X)}_{\text{Evidence}}}\underbrace{p(\boldsymbol{\theta})}_{\text{Prior}}
\end{equation}

\begin{equation}
p(\mathbf{y}|X) = \int_{\boldsymbol{\theta}}p(\mathbf{y}|X, \boldsymbol{\theta})p(\boldsymbol{\theta})d\boldsymbol{\theta}
\end{equation}

We are interested in posterior $p(\boldsymbol{\theta}|X, \mathbf{y})$ and to derive that, we need prior, likelihood and evidence terms. Let us look at them one by one.

### Prior

Let's assume a multivariate Gaussian prior over the $\boldsymbol{\theta}$ vector.

$$
p(\theta) \sim \mathcal{N}(\boldsymbol{\mu}_0, \Sigma_0)
$$

### Likelihood

Given a $\boldsymbol{\theta}$, our prediction is $X\boldsymbol{\theta}$. Our data $\mathbf{y}|X$ will have some irreducible noise which needs to be incorporated in the likelihood. Thus, we can assume the likelihood distribution over $\mathbf{y}$ to be centered at $X\boldsymbol{\theta}$ with random i.i.d. homoskedastic noise with variance $\sigma^2$:

$$
p(\mathbf{y}|X, \theta) \sim \mathcal{N}(X\boldsymbol{\theta}, \sigma^2I)
$$

### Maximum Likelihood Estimation (MLE)

Let us find the optimal parameters by differentiating likelihood $p(\mathbf{y}|X, \boldsymbol{\theta})$ w.r.t $\boldsymbol{\theta}$.

\begin{equation}
p(\mathbf{y}|X, \boldsymbol{\theta}) = \frac{1}{\sqrt{(2\pi)^n |\sigma^2I|}}\exp \left( (\mathbf{y} - X\boldsymbol{\theta})^T(\sigma^2I)^{-1}(\mathbf{y} - X\boldsymbol{\theta}) \right)
\end{equation}

Simplifying the above equation:

\begin{equation}
p(\mathbf{y}|X, \boldsymbol{\theta}) = \frac{1}{(2\pi\sigma^2)^{\frac{n}{2}}}\exp \left( \sigma^{-2}(\mathbf{y} - X\boldsymbol{\theta})^T(\mathbf{y} - X\boldsymbol{\theta}) \right)
\end{equation}

Taking log to simplify further:

\begin{align}
\log p(\mathbf{y}|X, \boldsymbol{\theta}) &= (\mathbf{y} - X\boldsymbol{\theta})^T(\mathbf{y} - X\boldsymbol{\theta}) + \log \sigma^{-2} + \log \frac{1}{(2\pi\sigma^2)^{\frac{n}{2}}}\\
\frac{d}{d\boldsymbol{\theta}} \log p(\mathbf{y}|X, \boldsymbol{\theta}) &= \frac{d}{d\boldsymbol{\theta}}(\mathbf{y} - X\boldsymbol{\theta})^T(\mathbf{y} - X\boldsymbol{\theta})\\
&= \frac{d}{d\boldsymbol{\theta}}(\mathbf{y}^T - \boldsymbol{\theta}^TX^T)(\mathbf{y} - X\boldsymbol{\theta})\\
&= \frac{d}{d\boldsymbol{\theta}} \left[ \mathbf{y}^T\mathbf{y} - \mathbf{y}^TX\boldsymbol{\theta} - \boldsymbol{\theta}^TX^T\mathbf{y} + \boldsymbol{\theta}^TX^TX\boldsymbol{\theta}\right]\\
&= -(\mathbf{y}^TX)^T - X^T\mathbf{y} + 2X^TX\boldsymbol{\theta} = 0\\
\therefore X^TX\boldsymbol{\theta} &= X^T\mathbf{y}\\
\therefore  \boldsymbol{\theta}_{MLE} &= (X^TX)^{-1}X^T\mathbf{y}
\end{align}

We used some of the formulas from [this cheatsheet](http://www.gatsby.ucl.ac.uk/teaching/courses/sntn/sntn-2017/resources/Matrix_derivatives_cribsheet.pdf) but they can also be derived from scratch.

### Maximum a posteriori estimation (MAP)

We know from {ref}`the previous discussion <MAP-1>` that:

$$
\arg \max_{\boldsymbol{\theta}} p(\boldsymbol{\theta}|X, \mathbf{y}) = \arg \max_{\boldsymbol{\theta}} p(\mathbf{y}|X, \boldsymbol{\theta})p(\boldsymbol{\theta})
$$

Now, differentiating $p(\mathbf{y}|X, \boldsymbol{\theta})p(\boldsymbol{\theta})$ w.r.t $\theta$ by reusing some of the steps from MLE:

