## Minimizing SSE

### optimal within a model
We can take the idea that a smaller SSE suggests a better model fit further.
Instead of using SSE to compare different models, we can use the SSE to evaluate different parameter values inside the same model.

Consider the same dataset as above and suppose we're fitting a simple linear regression.
Then our SSE becomes

\begin{align}
    \text{SSE}(y,\hat{y}_{i}) &= \sum_{i=1}^{N} [y_{i} - \hat{y}_{i}]^2\\
    \text{SSE}(y,x,\beta_{0},\beta_{1}) &= \sum_{i=1}^{N} [y_{i} - (\beta_{0} + \beta_{1}x_{i})]^2\\
\end{align}

where $y$ and $x$ are vectors of data.
Step two in above equation replaced the predicted value $\hat{y}$ with the linear model used to make this prediction $\beta_{0}+\beta_{1}x$.

Now our SSE is a function of the data, that cannot be changed, and the parameters of our model $\beta_{0}$ and $\beta_{1}$. 
Changing $\beta_{0}$ or $\beta_{1}$ will change the value of the SSE.
One way to find a best fit model is to find those parameters value that make the SSE as small as possible. 


### derivative

SSE is a function of $\beta_{0}$ and $\beta_{1}$, and can be optimized by taking the derivative with respect to both parameters and finding the point where the derivative of these two equations equals zero simultaneously.

We take the derivative with respect to $\beta_{0}$

\begin{align}
   \frac{ d \text{SSE}(\beta_{0},\beta_{1})}{d \beta_{0}} &= \sum_{i=1}^{N} [y_{i} - (\beta_{0} + \beta_{1}x_{i})]^2\\
   \frac{ d \text{SSE}(\beta_{0},\beta_{1})}{d \beta_{0}} &= \sum_{i=1}^{N} \frac{d}{d \beta_{0}} [y_{i} - (\beta_{0} + \beta_{1}x_{i})]^2\\
  \frac{ d \text{SSE}(\beta_{0},\beta_{1})}{d \beta_{0}}  &= \sum_{i=1}^{N} -2[y_{i} - (\beta_{0} + \beta_{1}x_{i})]\\
\end{align}

The above derivative can be set to zero and solved for $\beta_{0}$, our variable.

\begin{align}
    \sum_{i=1}^{N} -2[y_{i} - (\beta_{0} + \beta_{1}x_{i})] &= 0\\
    \sum_{i=1}^{N} y_{i} - N\beta_{0} - \beta_{1}\sum_{i=1}^{N} x_{i} &=0\\
    N\beta_{0} &= \sum_{i=1}^{N} y_{i} - \beta_{1}\sum_{i=1}^{N} x_{i}\\
     \beta_{0} &= \bar{y} - \beta_{1}\bar{x}\\
\end{align}

The value for $\beta_{0}$ that optimizes the SSE is the average of our $y$ values minus the optimal $\beta_{1}$ times the average of our $x$ values. 

We must also take the derivative with respect to $\beta_{1}$.

\begin{align}
   \frac{ d \text{SSE}(\beta_{0},\beta_{1})}{d \beta_{1}} &= \sum_{i=1}^{N} [y_{i} - (\beta_{0} + \beta_{1}x_{i})]^2\\
   \frac{ d \text{SSE}(\beta_{0},\beta_{1})}{d \beta_{1}} &= \sum_{i=1}^{N} -2x_{i}[y_{i} - (\beta_{0} + \beta_{1}x_{i})]\\
\end{align}

The above equation can also be set to zero and solved for $\beta_{1}$.

\begin{align}
    \sum_{i=1}^{N} -2x_{i}[y_{i} - (\beta_{0} + \beta_{1}x_{i})] &=0 \\
    \sum_{i=1}^{N} x_{i}y_{i} - x_{i}\beta_{0} - \beta_{1}x^{2}_{i} &=0 \\
\end{align}    
   
At this point we can substitute the optimal value for $\beta_{0}$ we derived.

\begin{align}
    \sum_{i=1}^{N} x_{i}y_{i} - x_{i}(\bar{y} - \beta_{1}\bar{x}) - \beta_{1}x^{2}_{i} &=0 \\    
    \sum_{i=1}^{N} x_{i}y_{i} - \sum_{i=1}^{N} x_{i}\bar{y} + \sum_{i=1}^{N} x_{i}\beta_{1}\bar{x} - \beta_{1}x^{2}_{i} &=0 \\    
    \beta_{1} \left(x^{2}_{i} - \sum_{i=1}^{N} x_{i}\bar{x}\right) &=\sum_{i=1}^{N} x_{i}y_{i} - \sum_{i=1}^{N} x_{i}\bar{y}\\
    \beta_{1} &= \frac{\sum_{i=1}^{N} x_{i}y_{i} - \sum_{i=1}^{N} x_{i}\bar{y}}{\left(x^{2}_{i} - \sum_{i=1}^{N} x_{i}\bar{x}\right)}
\end{align}

This equation for $\beta_{1}$ doesn't look like anything we can recognize, but we can change the SSE we optimized to make this equation look more familiar.
The equation we optimized was a function of $\beta_{0}$ and $\beta_{1}$, and so adding a constant value that does not include $\beta_{0}$ or $\beta_{1}$ would not change the optimal $\beta$.

From each data point, lets subtract $\bar{x}$ and $\bar{y}$, called centering our data.
Then the above equation becomes

\begin{align}
    \beta_{1} &= \frac{\sum_{i=1}^{N} x_{i}y_{i} - \sum_{i=1}^{N} x_{i}\bar{y}}{\left(x^{2}_{i} - \sum_{i=1}^{N} x_{i}\bar{x}\right)}\\
              &=\frac{\sum_{i=1}^{N} (x_{i}-\bar{x})(y_{i}-\bar{y}) - \sum_{i=1}^{N} (x_{i}-\bar{x})\bar{y}}{\sum_{i=1}^{N} (x_{i}-\bar{x})^{2} - \bar{x} \sum_{i=1}^{N} (x_{i}-\bar{x}) }\\
              &=\frac{\sum_{i=1}^{N} (x_{i}-\bar{x})(y_{i}-\bar{y})}{\sum_{i=1}^{N} (x_{i}-\bar{x})^{2} }\\
              &= \frac{Cov(X,Y)}{Var(X)}
\end{align}

Centering our data, we see the optimal $\beta_{1}$ is the covariance between $y$ and $x$ divided by the variance of $x$.


We can also right the above in matrix form.
The covariance between X and Y is written 

$$
    Cov(X,Y) = X'y
$$
where $X = x-\bar{x}$ and $Y=y-\bar{y}$, and the variance of X is written 

$$
Var(X) = X'X.
$$

Then the expression for $\beta_{1}$ is

$$
\beta_{1} = (X'X)^{-1}(X'y)
$$


But by adding a column of $1$s to X, we can see that the above expression works for both $\beta_{1}$ and $\beta_{0}$.
In fact, this expression will work for any design matrix $X$.

So we can write

$$
\beta = (X'X)^{-1}(X'y)
$$

To see this more clearly, let's generalize our derivations of $\beta_{0}$ and $\beta_{1}$ to multiple $\beta$s.

We first form our SSE for multiple linear regression

$$
\text{SSE}(y,X,\beta_{0},\beta_{1}) = \sum_{i=1}^{N} [y_{i} - (\beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \cdots \beta_{n}x_{in} )]^2\\
$$

where $x_{ij}$ is observation $i$ for variable $j$.
Taking the derivative for every $\beta$ and setting equal to $0$ we have

For $\beta_{0}$

\begin{align}
\sum_{i=1}^{N} 1[y_{i} - (\beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \cdots \beta_{n}x_{in})] &= 0\\
\sum_{i=1}^{N} (\beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \cdots \beta_{n}x_{in}) &= \sum_{i=1}^{N} 1 y_{i}\\
\end{align}

For $\beta_{1}$


\begin{align}
\sum_{i=1}^{N} x_{i1}[y_{i} - (\beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \cdots \beta_{n}x_{in})] &=0\\
\sum_{i=1}^{N} x_{i1} (\beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \cdots \beta_{n}x_{in}) &=\sum_{i=1}^{N} x_{i1}y_{i}\\
\end{align}

For $\beta_{2}$


\begin{align}
\sum_{i=1}^{N} x_{i2}[y_{i} - (\beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \cdots \beta_{n}x_{in})] =0 \\
\sum_{i=1}^{N} x_{i2} (\beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \cdots \beta_{n}x_{in}) &=\sum_{i=1}^{N} x_{i2}y_{i}\\
\end{align}

For $\beta_{n}$


\begin{align}
\sum_{i=1}^{N} x_{in}[y_{i} - (\beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \cdots \beta_{n}x_{in})] =0 \\
\sum_{i=1}^{N} x_{in} (\beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \cdots \beta_{n}x_{in}) &=\sum_{i=1}^{N} x_{in}y_{i}\\
\end{align}

The right hand side of this system of equations can be rewritten as

$$
X'y,
$$

and the left hand side can be rewritten as 

$$
    (X'X)\beta.
$$

Our system of equations is then

$$
 (X'X)\beta = X'y.
$$

We can solve for $\beta$ by left multiplying each side by $(X'X)^{-1}$

$$
\beta = (X'X)^{-1}X'y
$$

and arriving at the same solution for multiple linear regression that we found with simple linear regression.

We can verify that the above equation $(X'X)^{-1}X'y$ recovers the optimal $\beta$s using R.

In [4]:
data <- read.csv('polynomialData.csv')
head(data)

cubicRegression <- lm(y~x+I(x^2)+I(x^3),data=data)

print("CUBIC REGRESSION")
print(cubicRegression)

y <- data$y
ones = rep(1,length(data$x))
x    = data$x
x2   = data$x^2
x3   = data$x^3

print("DESIGN MATRIX")
X = cbind(ones,x,x2,x3)
print(head(X))

print("OPTIMAL BETAS")
optimalBetas <- solve(t(X)%*%X,t(X)%*%y)  # this is the same as solving our system of equatons.
print(round(optimalBetas,4))

x,y
<dbl>,<dbl>
0.9958723,8.2420054
-0.6556163,2.3114202
-0.9176787,4.0842076
0.1963727,-5.5386897
1.0309346,2.5166174
1.2610719,-0.5388713


[1] "CUBIC REGRESSION"

Call:
lm(formula = y ~ x + I(x^2) + I(x^3), data = data)

Coefficients:
(Intercept)            x       I(x^2)       I(x^3)  
     0.2994       2.2877       1.0962      -1.4120  

[1] "DESIGN MATRIX"
     ones          x         x2           x3
[1,]    1  0.9958723 0.99176166  0.987667975
[2,]    1 -0.6556163 0.42983269 -0.281805304
[3,]    1 -0.9176787 0.84213426 -0.772808705
[4,]    1  0.1963727 0.03856225  0.007572574
[5,]    1  1.0309346 1.06282620  1.095704329
[6,]    1  1.2610719 1.59030227  2.005485460
[1] "OPTIMAL BETAS"
        [,1]
ones  0.2994
x     2.2877
x2    1.0962
x3   -1.4120


The beta coefficients we found from running a cubic regression match the beta coefficients from solving the system of equations above that minimize the SSE. 