# Curve fitting: Probabilistic perspective

## Overview

The previous section introduced the linear regression model. We fit this model into the given data by
using least squares and minimizing the sum of square errors. In this section, we want to look into the problem from
a probabilistic perspective.

## Curve fitting: Probabilistic perspective

We already know the goal of the linear regression model; make predictions on the target variale $y$ given some new
input data $x$. Let's assume that the corresponding target $y$ has a normal distribution with a mean equal to $\hat{y}(x, \mathbf{w})$ i.e. 


$$p(y | x, \mathbf{w}, \sigma^2) = N(y | \hat{y}(x, \mathbf{w}), \sigma^2)$$



Recall that the vector $\mathbf{w}$ encopases the model parameters.

We can draw the likelihood for the training data $\mathbf{x}$ and the corresponding labels $\mathbf{y}$

$$p(\mathbf{y} | \mathbf{x}, \mathbf{w}, \sigma^2) = \prod_{i}^{N} N(y_i | \hat{y}(x_i, \mathbf{w}), \sigma^2)$$

In order to maximize the likelihood function, we can take its logarithm. We can maximize with respect to $\mathbf{w}$ in order to obtain
$\mathbf{w}_{ML}$. In this case the _SSE_ error function arises as a consequence of maximizing the likelihood assuming a normal distribution [1].
We can further use the maximum likelihood to determibe the variance of the normal disrtibution. This will be, see [1],

$$\sigma^2 = \frac{1}{N} \sum_{i}^N  \left( \hat{y}(x_i, \mathbf{w}_{ML}) - y_i \right )^2$$

Knowing $\sigma$ and $\mathbf{w}$ means that we have a probabilistic model in our disposal. We can thus have 
a probability distributions ove $y$ rather than a simple point estimate given by $\hat{y}$

### Maximum posterior

Let's further assume the following prior distribution for $\mathbf{w}$

$$p(\mathbf{w}, \alpha) = N(\mathbf{w}, \Sigma),~~ \Sigma = \alpha^{-1} \mathbf{I} $$

$\alpha$ is a hyperparameter that has to be specified beforehand. We can use Bayes' theorem and rewrite, see [1],

$$p(\mathbf{w}| \mathbf{x}, \mathbf{y}, \alpha, \sigma^2) = p(\mathbf{y} | \mathbf{x}, \mathbf{w}, \alpha, \sigma^2)p(\mathbf{w}, \alpha)$$

We can use MAP to find the most probable value of $\mathbf{w}$. We can find that the maximum of the posterior is given by, see [1]

$$max p(\mathbf{w}| \mathbf{x}, \mathbf{y}, \alpha, \sigma^2) = \frac{1}{2\sigma^2} \sum_{i}^{N} \left(\hat{y}(x_i, \mathbf{w}) - y_i \right )^2 + \frac{\alpha}{2}\mathbf{w}^T\mathbf{w}$$

Thus maximizing the posterior distribution is equivalent to minimizing the _SSE_ with a regularization parameter $\alpha \sigma^2$.

## References

1. Christopher M. Bishop, _Pattern Recognition and Machine Learning_ Springer, 2006.