# Linear Regression

This notebook builds a simple linear regression model.
- Derives the MLE solution for estimating weights under Gaussian noise
- Verifies the closed-form least squares solution using gradient descent
- Implements Bayesian inference with Gaussian priors
- Visualizes predictive uncertainty from the posterior

## Setup

In [1]:
import numpy as np

## MLE for Linear Regression with Gaussian Noise

We're modeling outputs of $y \in \mathbb{R}^n$ as noisy linear combinations of inputs $X \in \mathbb{R}^{n \times d}$:

$$y = X w + \epsilon$$

Where: 
- $x_i \in \mathbb{R}^d$ is the $i$-th row of $X$, representing one data point
- $w \in \mathbb{R}^d$ is the parameter vector
- Each output $y_i$ is given by $y_i = x_i^T w + \epsilon_i$
- The noise term $\epsilon \sim \mathcal{N}(0, \sigma^2 I_n)$ is i.i.d. gaussian

Conditioned on $X$, the outputs are:

$$y \mid X \sim \mathbb{N}(X w, \sigma^2 I)$$

When $d = 1$, we are fitting a line and when $d = 2$, we are fitting a plane. Also, note that there is no bias term although that could be added by adjusting the model.

$$X' = \begin{bmatrix} X & \mathbf{1} \end{bmatrix}, w' = \begin{bmatrix} w \\ b \end{bmatrix}$$

### Likelihood

The probability density function of $y$ is the following.

$$ p(y | X, w, \sigma^2) = \prod_{i=1}^{n} \mathcal{N}(y_i | x_i^T w, \sigma^2)$$

Where:

$$\mathcal{N}(y_i | x_i^T, w, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp (-\frac{1}{2 \sigma^2} (y_i - x_i^T w)^2)$$

When we want the log-likelihood and remove the constants, we get the following.

$$\log(p(y | X, w, \sigma^2)) \sim -\sum_{i=1}^{n} (y_i - x_i^T w)^2 \sim -||y - Xw||^2$$

This means that maximizing the log-likelihood is equivalent to minimizing the squared error.

$$\min_{w} ||y - Xw||^2$$

### Closed-form Solution

We can find the formula for the weights that minimize the least squared error by solving for the closed-form solution. 

$$L(w) = ||y - Xw||^2 = y^T y - 2 w^T X^T y + w^T X^T X w$$

So, we take the derivative of the loss.

$$
\begin{align}
\frac{d}{dw} L(w) &= -2 X^T y + 2 X^T X w \\
&= -2 X^T (y - Xw)
\end{align}
$$

We can set the derivative for $0$ and solve for $w$.

$$
\begin{align}
-2 X^T (y - Xw) &= 0 \\
X^T (y - Xw) &= 0 \\
X^T y &= X^T Xw \\
w &= (X^T X)^{-1} X^T y
\end{align}
$$

Thus, the least squares solution is the following.

$$w = (X^T X)^{-1} X^T y$$

### Verifying using Gradient Descent

We're trying to minimize the negative log-likelihood:

$$L(w) = ||y - Xw||^2$$

The gradient of the loss tells us the direction of steepest increase in loss, so moving in the opposite direction leads us toward lower loss.

$$\frac{d}{dw} L(w) = -2X^T(y - Xw)$$

In [None]:
def gaussian_pdf(x, mu, sigma):
    """
    Gaussian probability density function

    Args:
        x: point(s) to get the probability of generating from a normal distribution
        mu: mean of the normal distribution
        sigma: standard deviation of the normal distribution

    Returns:
        The probability density at x
    """
    coef = 1 / (np.sqrt(2 * np.pi * sigma ** 2))
    exponent = -((x - mu) ** 2 / (2 * sigma ** 2))
    return coef * np.exp(exponent)

np.float64(0.3520653267642995)