# Matrix Algebra with Gradient Descent

### Introduction

So in the last lesson, we saw how we can descend a multidimensional cost curve with gradient descent.

In [136]:
import pandas as pd
url = "./cost_curve_three_d.json"

# df = pd.read_json(url)

# go.Figure(df.to_dict('records'))

And from there, we described our strategy as leaning in each direction, and then walking in proportion to the steepness in each direction.

<img src="./walk-downhill.png" width="30%">

We can can think of this as the biggest bang for our buck approach -- the steeper the slope descends, the more we move in that direction.

### A review, and the gradient

Now we saw that to this idea of leaning to the left and leaning to the right really involved taking the partial derivative with respect to each parameter.  So when our linear function is:

$z = w*x + b$ and

Then taking the partial derivative with respect to each parameter means taking the partial derivative first with respect to $w$ and then with respect to $b$.  So remember that whatever variables we *are not* taking the partial derivative with respect to, we treat as just a number.  

* So when taking the partial derivative $\frac{\delta z}{\delta w}$, we treat $b$ as if it were just a number, which gives us the following: 

    * $z = w*x + b$ and
    * $\frac{\delta z}{\delta w} = w^0*x + 0 = x$

* And when taking the partial derivative $\frac{\delta z}{\delta b}$, we treat $w$ as if it were just a number, which gives us the following:

    * $z = w*x + b$ and
    
    * $\frac{\delta z}{\delta b} = w*x + b = 0 +  b^{0} = 1$

The partial derivatives that we found above can be represented as a vector.  A vector that represents the rate of change each direction.  And we represent this vector with the greek letter nabla,$\nabla$, which represents our gradient.  So our gradient looks like the following:

$$\nabla z(w,b) = \begin{bmatrix}
\frac{\delta z}{\delta w} \\
  \frac{\delta z}{\delta b} 
\end{bmatrix}$$

Or applying this to gradient of our linear function above, we get the following $z(w, b) = wx + b$

$$\nabla z(w,b) = \begin{bmatrix}
x \\
 1 
\end{bmatrix}$$

> The **gradient** of a function is a vector whose entries are the partial derivatives of the function.  It is the direction of fastest increase.  For gradient descent, we move in the direction of the negative gradient -- or the direction of greatest decrease.

So moving in the greatest descent, applied to our function $z$ above, we get the following:

$$ - \nabla z(w,b) = \begin{bmatrix}
 -x \\
  -1 
\end{bmatrix}$$

And this is our vector of our partial derivatives with respect to each of our parameters in our linear function: $\nabla z(w,b) = \begin{bmatrix}
\frac{\delta z}{\delta w} \\
  \frac{\delta z}{\delta b} 
\end{bmatrix}$.

### Extending to multiple weights 

Now remember that our linear function will often multiple weights -- one for each feature in an observation.  So let's assume that we have a linear function that looks like the following: 

$z = w_1*x_1 + w_2*x_2 + b$

Then taking the partial derivative with respect to each parameter, we get the following:

*  $\frac{\delta z}{\delta w_1} = w_1^0*x_1 + 0 + 0 = x_1$

*  $\frac{\delta z}{\delta w_2} = 0 + w_2^0*x_2 + 0 = x_2$

*  $\frac{\delta z}{\delta b} = 0 + 0 + b^0 = 1$

So because we treat each term we are not taking the partial derivative with respect to as a just a number, for every term other than the one we are taking our partial derivative of, the term turns to a zero.  So if we want to express our gradient, it looks like the following:

$\nabla z(w_1,w_2, b) = \begin{bmatrix}
\frac{\delta z}{\delta w_1} \\
\frac{\delta z}{\delta w_2} \\
  \frac{\delta z}{\delta b} 
\end{bmatrix} = \begin{bmatrix}
x_1 \\
x_2 \\
  1
\end{bmatrix}$

So when we move to a multiparameter linear function, this is our gradient -- it's just the values of the vector $x$. 

### Confirming in Pytorch

We can confirm this if we move to pytorch.  We can translate our linear function above:

$z = w_1*x_1 + w_2*x_2 + b$

Into our expression with the dot product.

$z = x \cdot w + b$

In [156]:
import torch

x = torch.tensor([1., 2., 3.])

w = torch.tensor([2., 4., 8.], requires_grad = True)

b = torch.tensor(3., requires_grad = True)

In [157]:
z = x @ w + b
z

tensor(37., grad_fn=<AddBackward0>)

And the derivative should just be the vector $x$ followed by the number 1 for the bias term.  We can see this in Pytorch.

In [158]:
z.backward()

In [159]:
w.grad

tensor([1., 2., 3.])

> So above, this is saying that the gradient of $w$ is just the vector $x$, $[x_1, x_2, x_3]$.

And if we look at the gradient of $b$, we see this is 1.

In [160]:
b.grad

tensor(1.)

### Extending To The Cost Curve

So above, we saw how to calculate the gradient of our linear function $z = w_1*x_1 + w_2*x_2 + b$, which we saw was a vector of partial derivatives with respect to each of our parameters:

$\nabla z(w_1,w_2, b) = \begin{bmatrix}
\frac{\delta z}{\delta w_1} \\
\frac{\delta z}{\delta w_2} \\
  \frac{\delta z}{\delta b} 
\end{bmatrix} = \begin{bmatrix}
x_1 \\
x_2 \\
  1
\end{bmatrix}$

Now this gradient, indicates to us how the output of the linear function changes as we nudge any of our parameter terms.  And we saw that for our weight vector $w$, this was just the vector $x$, and for the bias term it was just the number $1$.  Notice that this is essentially what we wrote in the previous lesson.  

In [164]:
def delta_z_delta_w(x):
    return x

In [162]:
def delta_z_delta_b():
    return 1

But here, `delta_z_delta_w` returns the entire vector $x$. 

In [165]:
delta_z_delta_w(x)

tensor([1., 2., 3.])

Now ultimately, we don't want how each of our parameters alter the linear function but how a change in the output of our linear function affects our activation function which affects our cost function.  This looks like the following:

$\nabla J(w_1,w_2, b) = \begin{bmatrix}
\frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta w_1} \\
\frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta w_2} \\
\frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta b}
\end{bmatrix} = \begin{bmatrix}
\frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z}*x_1\\
\frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z}*x_2\\
\frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z}* 1
\end{bmatrix}$

And we calculated the components of how a change in the linear function affects the activation function, and how the activation function affects the cost function in the previous lesson.

In [163]:
def deriv_sigma(z):
    return torch.sigmoid(z)*(1 - torch.sigmoid(z))

def delta_J_delta_sigma(y_hat, y):
    return torch.sum(2*(y_hat - y))

So starting with our 

In [210]:
x = torch.tensor([1., 2., 3.])
y = torch.tensor(1.)

w = torch.tensor([.5, .3, .1])

b = torch.tensor(-1.)

In [211]:
def linear_fn(w, x, b):
    return x @ w + b 

def activation_fn(z):
    return torch.sigmoid(z)

In [212]:
z = linear_fn(w, x, b)

y_hat = activation_fn(z)
y_hat

tensor(0.5987)

And then we can see that we should update by the following:

In [234]:
grad_J_w = delta_z_delta_w(x)*deriv_sigma(z)*delta_J_delta_sigma(y_hat, y)
grad_J_w

tensor([-0.1928, -0.3857, -0.5785], grad_fn=<MulBackward0>)

So this tells us how we should update each of our terms: $w_1$, $w_2$, and $w_3$.

In [236]:
grad_J_b = delta_z_delta_b()*deriv_sigma(z)*delta_J_delta_sigma(y_hat, y)
grad_J_b

tensor(-0.1928, grad_fn=<MulBackward0>)

### Checking with Pytorch

We can also get Pytorch to calculate this gradient with the `backward` method.

In [228]:
x = torch.tensor([1., 2., 3.])
y = torch.tensor(1.)

w = torch.tensor([.5, .3, .1], requires_grad = True)

b = torch.tensor(-1., requires_grad = True)

In [229]:
z = linear_fn(w, x, b)
y_hat = activation_fn(z)

And if we pass through:

In [230]:
cost = torch.sum((y - y_hat)**2)

In [231]:
cost.backward()

In [232]:
w.grad

tensor([-0.1928, -0.3857, -0.5785])

In [233]:
b.grad

tensor(-0.1928)

So this is what occurs, we repeatedly calculate the gradient to see how to update each parameter.  If we call the `backward` function on the cost, like we did above, then Pytorch calculates how nudging our tensors affect the cost function.  From there, we just perform our gradient descent procedure like we did previously.  Repeatedly using the gradient to update our parameters.  Mathematically, we represent the set of our parameters as $\theta$, and our gradient descent formula looks like the following: 

$$ \theta = \theta - \eta \nabla \theta $$

This just means that repeatedly update each parameter by the negative partial derivative with respect to that parameter, like we saw above, multiplied by a learning rate.  Or in code we perform this with the following:

```python
w = torch.tensor([.5, .3, .1])
b = torch.tensor(-1.)

eta = .00005

for i in range(10):
    # for each loop we first need to calculate the vals z and y_hat to pass into our derivative functions
    z = linear_fn(w, x_vals, b)
    y_hat = activation_fn(z)
    
    w = w - torch.mean(eta*delta_J_delta_sigma(y_hat, y_vals)*deriv_sigma(z)*delta_z_delta_w(x_vals))
    b = b - torch.mean(eta*delta_J_delta_sigma(y_hat, x_vals)*deriv_sigma(z)*delta_z_delta_b())
```

### Summary

In this lesson, we learned that to take the gradient, means to take the partial derivative with respect to each term in the function.  In calculating the gradient, we started by calculating the gradient of the linear layer.  That is, we calculated the amount our linear function's output would change, as we nudged each parameter:

For our function: 

$z = w_1x_1 + w_2x_2 + b$, the gradient is:

$\nabla z(w_1,w_2, b) = \begin{bmatrix}
\frac{\delta z}{\delta w_1} \\
\frac{\delta z}{\delta w_2} \\
  \frac{\delta z}{\delta b} 
\end{bmatrix} = \begin{bmatrix}
x_1 \\
x_2 \\
  1
\end{bmatrix}$

So one way to represent the gradient is simply as the feature vector $x$ followed by a $1$ for the bias term.

And we saw this in Pytorch:

In [241]:
w = torch.tensor([.5, .3, .1], requires_grad = True)
b = torch.tensor(-1., requires_grad = True)
x = torch.tensor([1., 2., 3.])

z = x @ w + b

In [243]:
z.backward()

In [244]:
w.grad, b.grad

(tensor([1., 2., 3.]), tensor(1.))

And then to see how our cost function changes with a nudge in each parameter, then we still took the partial derivative with respect to each parameter, but this time also needed to apply the chain rule.  

$\nabla J(w_1,w_2, b) = \begin{bmatrix}
\frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta w_1} \\
\frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta w_2} \\
\frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta b}
\end{bmatrix} = \begin{bmatrix}
\frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z}*x_1\\
\frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z}*x_2\\
\frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z}* 1
\end{bmatrix}$

The derivatives $\frac{\delta J}{\delta \sigma}$ and $\frac{\delta \sigma}{\delta z}$ are exactly what we calculated in the previous lesson.  And so were able to calculate the gradient of our cost function by making use of the previously determined derivatives:

In [245]:
grad_J_w = delta_z_delta_w(x)*deriv_sigma(z)*delta_J_delta_sigma(y_hat, y)
grad_J_w

tensor([-0.1928, -0.3857, -0.5785], grad_fn=<MulBackward0>)

And again, we saw the same results when we checked our wrk in Pytorch, by this time calling `backward` on our cost, and then calculating the gradient.

```python
z = linear_fn(w, x, b)
y_hat = activation_fn(z)
cost = torch.sum((y - y_hat)**2)

cost.backward()
w.grad # tensor([-0.1928, -0.3857, -0.5785])
```