## Least Squares is Maximum Likelihood

We've seen how to learn a linear model that minimizes the sum of squares error on the training dataset. For example, let's say we learn the true parameters of the dataset we have been working with:

\\[
\theta_0 = 100\\
\theta_1 = 5.0\\
\hat{y} = \theta_0 + \theta_1 x = 100 + 5.0 x
\\]

Now, we know from the training dataset that this relationship isn't always exactly true: practically no datapoints lie exactly on this line. For every training datapoint $x_i$, there is an error $y_i - \hat{y}_i$. My question is this: how do we account for this error? Why isn't $y_i$ exactly $\hat{y}_i$?

In one of the first readings, we said that we would assume that there was some normally distributed noise that was added to $\hat{y}_i$. The theory was that there were potentially many unobserved factors which impact $y_i$ in a way that cannot be accounted for by knowing $x_i$. We assume the error is normally distributed because the Central Limit Theorem says that the cumulative effect of many independent unobserved factors is just like a single normally distributed variable.

That is, we are saying that:

\\[
y \sim \theta_0 + \theta_1 x + \mathcal{N}(\mu = 0, \sigma = \sigma')
\\]

Where $\mathcal{N}(\mu = 0, \sigma = \sigma')$ means some noise that is generated from a normal distribution with mean zero, and some variance $\sigma'$. The $\sim$ tilde means "is distributed as". This means:

\\[
\text{Pr}_\theta[Y = y | X = x]
=
\text{Pr}_\theta[\mathcal{N}(\mu = 0, \sigma = \sigma') =  y - (\theta_0 + \theta_1 x)]
\\]

By the definition of $\mathcal{N}$, that means:

\\[
\text{Pr}_\theta[Y = y | X = x]
=
    \frac{1}{\sqrt{2\pi\sigma'^2}}
    \exp \left(
        -\frac{(y - (\theta_0 + \theta_1 x))^2}{2\sigma'^2}
    \right)
\\]

Don't let this big formula scare you. Let's note a couple things. First, this probability is maximized by $y = \theta_0 + \theta_1 x$. That means that $\theta_0 + \theta_1 x$ is the *most probable* value of $y$ given $x$.

Second, note that the probability that $y = \theta_0 + \theta_1 x + \epsilon$ is always equal to the probability that $y = \theta_0 + \theta_1 x - \epsilon$. That means that the probability distribution is *symmetric* arround $\theta_0 + \theta_1 x$. That means that $\theta_0 + \theta_1 x$ is both the median and mean value of $y$ given $x$.

Using this probability distribution, we can calculate a probability for the *entire training dataset* $\mathcal{D}$:

\\[
\text{Pr}_\theta[\mathcal{D}]
=
\prod_i \text{Pr}_\theta[Y = y_i | X = x_i]
= \prod_i 
    \frac{1}{\sqrt{2\pi\sigma'^2}}
    \exp \left(
        -\frac{(y_i - (\theta_0 + \theta_1 x_i))^2}{2\sigma'^2}
    \right)
\\]

The reason I write $\text{Pr}_\theta[\mathcal{D}]$ is because this is the probability of the dataset being generated, presuming that the true model is $y = \theta_0 + \theta_1 + \mathcal{N}(0, \sigma')$. Different choices of $\theta$ would lead to different probability distributions.

We call $\text{Pr}_\theta[\mathcal{D}]$ the *likelihood* of $\theta$. It is not quite the same as the *probability* of $\theta$. We will explore that difference later when we learn about Bayesian statistics.

It is very common to want to consider the best $\theta$ to be the one has the *maximum likelihood*. That is, we want to choose the $\theta$ under which it would be most probable to generate our dataset $\mathcal{D}$. This seems sensible: why would we want to prefer a $\theta'$ if it does a worse job at predicting a dataset like the one we trained on?

Whenever we have a big product of probabilities, it is common to work in the *negative log space*. This uses the following rule:

\\[
a \times b \times c
= \exp \left( \log \left(a \times b \times c\right)\right)
= \exp \left( \log a + \log b + \log c\right)
\\]

Likewise:

\\[
-\log \text{Pr}_\theta[\mathcal{D}]
=
- \sum_i \log \text{Pr}_\theta[Y = y_i | X = x_i]
\\]

Now, the $\log$ function is a *monotonic transformation*. That means that if $a < b$, then $\log a < \log b$. Since I took the *negative* log, we have $a < b$ implies $-\log b < -\log a$. That means that maximizing the probability $\text{Pr}_\theta[\mathcal{D}]$ is the same as minimizing $-\log\text{Pr}_\theta[\mathcal{D}]$.

So let us now work this out further:

\\[
\begin{align}
- \sum_i \log \text{Pr}_\theta[Y = y_i | X = x_i]
&=
-\sum_i \log
\Big[
    \frac{1}{\sqrt{2\pi\sigma'^2}}
    \exp \left(
        -\frac{(y_i - (\theta_0 + \theta_1 x_i))^2}{2\sigma'^2}
    \right)
\Big]
\\
&=
-\sum_i
    \log \left( \frac{1}{\sqrt{2\pi\sigma'^2}} \right)
    -
    \left(
        \frac{(y_i - (\theta_0 + \theta_1 x_i))^2}{2\sigma'^2}
    \right)
\end{align}
\\]

For the purposes of minimization, we cannot change $\log \left( \frac{1}{\sqrt{2\pi\sigma'^2}} \right)$ by changing $\theta$. Therefore it is a constant. Likewise $2\sigma'^2$ is a constant. I will get rid of them, since they don't matter:

\\[
-\sum_i -(y_i - (\theta_0 + \theta_1 x_i))^2 = \sum_i (y_i - (\theta_0 + \theta_1 x_i))^2
\\]

And now I have gotten where I want to. This is the sum of squares error. What this shows is that minimizing the sum of squares error leads to the maximum likelihood estimate of $\theta$. That is: minimizing the sum of squares error will pick the $\theta$ which maximizes $\text{Pr}_\theta[\mathcal{D}]$.