### Quadratic Optimization of One Variable

We've now seen the idea of Newton's Method. We saw how it works for a one-dimensional variable $x$, and we did mention a bit how it might work for a multi-dimensional variable $x$.

Finding the best $\hat\theta_0, \hat\theta_1$ is solving a multi-dimensional minimization problem, even though our predictor variable is only one-dimensional. We will either have to use the multi-dimensional Newton's Method, or else Gradient Descent.

However, let's cheat for a moment and say we know that $\hat\theta_0 = 100$. That is the true value of $\theta_0$. If we pretend we magically know this, then we only need to find $\hat\theta_1$, which is a one-dimensional minimization problem.

Minimizing a quadratic function of one variable is ridiculously simple. Look:

\\[
\begin{align}
f(x) &= a x^2 + bx + c\\
\frac{\partial f}{\partial x} &= 2a x + b\\
\frac{\partial^2 f}{\partial x^2} &= 2a\\
\end{align}
\\]

Now look at that. Newton's method in optimization says you should find the zero of $f'$. The way it does this is by using the tangent line at $x_i$ and seeing where that hits zero. Newton's Method normally has to do multiple steps because the slope normally changes as you change $x_i$.

However, in the case of a quadratic function, the second derivative is *constant*: $2a$. That means the slope never changes, so that the first derivative is truly a line. In that case the update $x_i -\frac{f'(x_0)}{f''(x_0)}$ is *exactly correct*! So univariate quadratic functions are special: a single step of Newton's Method finds the global minimum!

### Finding $\hat\theta_1$ given $\hat\theta_0$

Okay, let's do this! Let's calculate the optimal $\hat\theta_1$.

To do this we need not only the first, but second partial derivative with respect to $\hat\theta_1$. Note:

\\[
\begin{alignat*}{3}
\frac{\partial E}{\partial \hat\theta_1} &= \sum_{i=1}^N 2 \left(
                                              \left( \hat\theta_0 + \hat\theta_1 x_i \right) - y_i
                                            \right) \big( x_i \big) &&= 0\\
\frac{\partial^2 E}{\partial \hat\theta_1^2} &= \sum_{i=1}^N 2 \big( x_i \big) \big( x_i \big) &&= 0
\end{alignat*}
\\]

In [1]:
%matplotlib inline
from examples.find_theta0_example import FindTheta0Animation

FindTheta0Animation.run()

Notice that we find the global minimum for $\hat\theta_1$ immediately after the first step. Nothing changes in subsequent steps. That's just like we said above!

Notice that this is not exactly the true $\theta_1 = 5.0$. The line with slope $5.05$ fits our data better than the real line! The average SSE is $588.69$ for $\hat\theta_1 = 5.05$ versus an average SSE of $596.33$ for the true $\theta_1 = 5.0$

Ther reason is that a slightly wrong $\hat\theta_1$ is able to "explain away" some of what is really just Gaussian noise in the data. This is called *overfitting*, because the $\hat\theta_1$ has deviated from the true $\theta_1$ in order to reduce error that is simply the result of variation in unseen variables in the dataset.

Luckily, the overfitting is not very strong because we have a fairly large number of datapoints. Our estimate of $\theta_1$ is quite good.

Why does this happen? If we plotted $x_i$ vs just the *noise* $n_i$, then for our small dataset, the line of best fit would have a small positive slope of $0.05$.

In [2]:
from examples.noise_dataset_example import NoiseDatasetAnimation

NoiseDatasetAnimation.run()

Now, we know that there isn't really any relationship between the $x_i$ and the noise: the noise is entirely independent of $x_i$, so $x_i$ is no use in predicting the noise. So the line of best fit having any non-zero slope is delusional.

However, for finite sample sizes, you will basically never see a correlation of exactly zero. You will see a non-zero "coincidental" correlation. Consider an extreme: if you only had two examples in the dataset. Well, two points determine a line, so you can always put a perfect line through these two points!

Luckily, this coincidental correlation will go down if we collect more data. That's because with more data, the coincidence will start to evaporate, because the spurious correlation is not "real."