<a href="https://colab.research.google.com/github/jjennings955/CSE5368-Spring-2019/blob/master/Mean_squared_error_Gradient.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Formulation 
$X = \{ x^{(1)}, x^{(2)}, ..., x^{(n)} \}$

$Y = \{ y^{(1)}, y^{(2)}, ..., y^{(n)} \}$


$\hat y = w^T x + b$

$p(y|x) = N(y; \mu = \hat y, \sigma^2)$


Using this setup, we derived Mean Squared Error from maximum likelihood (derivation omitted)

# Objective function
$MSE = \frac{1}{N} \sum_{i} ||\hat y^{(i)} - y^{(i)}||^2$

# 2D Example - Formulation as Optimization Problem
$w = \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}$
$x^{(i)} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}$

$w*,b* = \text{argmin}_{w,b} \frac{1}{N} \sum_{i} ||\hat y^{(i)} - y^{(i)}||^2$

In english: find the weights and bias values that minimize our mean squared error

# Substitute $\hat y$ for its equation

$w*,b* = \text{argmin}_{w,b} \frac{1}{N} \sum_{i} ||w^T x^{(i)} + b - y^{(i)}||^2$

# The gradient of a sum is the sum of the gradients

From calculus I: $\frac{d}{dx} (a+b) = \frac{d}{dx} (a) + \frac{d}{dx}(b)$

The same is true for partial derivatives $\frac{\delta}{\delta x} (a+b) = \frac{\delta}{\delta x} (a) + \frac{\delta}{\delta x}(b)$

And for gradient $\nabla_x (a+b) = \nabla_x (a) + \nabla_x(b)$

This implies $ \nabla_x [\sum_i  f(x)]= \sum_i \nabla_x [ f(x) ]$



# Complete gradient calculation
Using this, we obtain:

$\nabla_{w,b} (\frac{1}{N} \sum_{i} ||w^T x^{(i)} + b - y^{(i)}||^2) =  \frac{1}{N} \sum_{i} \nabla_{w,b} [w^T x^{(i)} + b - y^{(i)} ]^2 $

We solve this for a single term of the summation to determine a formula.

Since we are solving this for a 2D example, we expand the dot product ($w^T x = w_1 x_1 + w_2 x_2$)

$\nabla_{w,b} [w^T x^{(i)} + b - y^{(i)} ]^2 = \nabla_w [w_1 x^{(1)}_1 + w_2 x^{(1)}_2 + b - y^{(i)} ]^2$

The gradient for a single term in this summation is calculated below using the chain rule:

$\frac{\delta}{\delta w_1}[w_1 x^{(1)}_1 + w_2 x^{(1)}_2 + b - y^{(i)} ]^2 = 2[w_1 x^{(1)}_1 + w_2 x^{(1)}_2 + b - y^{(i)} ] x^{(1)}_0$

$\frac{\delta}{\delta w_1}[w_1 x^{(1)}_1 + w_2 x^{(1)}_2 + b - y^{(i)} ]^2 = 2[w_1 x^{(1)}_1 + w_2 x^{(1)}_2 + b - y^{(i)} ] x^{(1)}_1$

$\frac{\delta}{\delta b}[w_1 x^{(1)}_1 + w_2 x^{(1)}_2 + b - y^{(i)} ]^2 = 2[w_1 x^{(1)}_1 + w_2 x^{(1)}_2 + b - y^{(i)} ] $



# Simplification

We can define our 'error':

$E(x, y) = (w^T x + b - y)$

Then our mean squared error is simply:

$\text{MSE} = \frac{1}{N} \sum_{i} E(x^{(i)}, y^{(i)})^2 $

Using the chain rule $f(g(x))' = f'(g(x))g'(x)$

Consider $g(x) = E(x,y); f(x) = x^2$

$\nabla_w \text{MSE} = \begin{bmatrix} \frac{\delta}{\delta w_0} \\ \frac{\delta}{\delta w_1} \end{bmatrix} = \frac{1}{N} \sum_{i} [2 E(x^{(i)}, y^{(i)}) *  x^{(i)}] $

And similarly for the bias:

$\frac{\delta}{\delta b} = \frac{1}{N} \sum_{i} [2 E(x^{(i)}, y^{(i)}) * 1] $




# Further simplification ('the bias trick')

Some texts choose to place the bias in the vector with the weights, using the following trick

$w = \begin{bmatrix} w_0 \\ w_1 \\ b \end{bmatrix}$
$x = \begin{bmatrix} x_0 \\ x_1 \\ 1 \end{bmatrix}$

Now our model simply becomes

$\hat y = w^T x$

And our gradient becomes unified as follows:

$\nabla_w \text{MSE} = \begin{bmatrix} \frac{\delta}{\delta w_0} \\ \frac{\delta}{\delta w_1} \\ \frac{\delta}{\delta b} \end{bmatrix} = \frac{1}{N} \sum_{i} [2 E(x^{(i)}, y^{(i)}) *  x^{(i)}] $

# Some clarifications

In some cases I may have swapped the order of $y$ and $\hat y$ in the $\text{MSE}$ formula, it doesn't actually make any difference.

$\text{MSE} = \frac{1}{N} \sum_{i} ||\hat y^{(i)} - y^{(i)}||^2$  vs $\text{MSE} = \frac{1}{N} \sum_{i} || y^{(i)} - \hat y^{(i)}||^2$

You can illustrate this by expanding 
$(\hat y - y)^2$ and  $(y - \hat y)^2$

In the first you get $\hat y^2 - 2 \hat y y + y^2$ in the second you get $y^2 - 2 \hat y y +\hat y^2$, and these are equivalent.



# Some other variations
In many texts or deep learning libraries, the convention is to write data as row vectors (like you would in an excel spreadsheet), where each row represents 1 of n **samples**, and each column represents 1 of m **features** of that sample.

So $x^{(1)} = \begin{bmatrix} x^{(1)}_1 & x^{(1)}_2 & ... & x^{(1)}_m \end{bmatrix}$ represents a single sample, the attributes of a single person for example, $x^{(1)}_1$ might be that person's height and $x^{(1)}_2$ that person's weight, etc

Using this convention, our model becomes

$\hat y = x w$

Additionally, this allows us to stack all of our data (or a batch of data for **Stochastic Gradient Descent**) into a single matrix and efficiently compute our model outputs using a single matrix multiplication. You may also use the bias trick in this way by having an entire column of 1's.

$X = \begin{bmatrix} 
  x^{(1)}_1 & x^{(1)}_2 & ... & x^{(1)}_m & 1 \\
  x^{(2)}_1 & x^{(2)}_2 & ... & x^{(2)}_m & 1 \\
  ... & ... & ... & ... & 1 \\
  x^{(n)}_1 & x^{(n)}_2 & ... & x^{(n)}_m & 1  \\
  \end{bmatrix}
$




 We may then compute our model output for all n samples by simply computing the following matrix-vector product
 
$\hat Y = Xw$

Our result will be a $[n, 1]$ vector

This is the more common convention in deep learning implementations, however I consider this an *implementation detail*, and we will likely continue using $w^T x + b$ throughout the course.