## setup 

```
x = [x1, x2]     # a single observation has 2 features
h = [h1, h2, h3] # hidden layer with 3 neurons with states h1, h2, h3
```

The weights of each of the three neurons correspond to a column

```
W = [[W11, W12, W13],
     [W21, W22, W23]]
```

Then we can compute the hidden layer

```
h = xW = [h1, h2, h3] # shape(1x3)
```

* x:  shape(1x2)
* W: shape(2x3)
* h: shape(1x3)

Then we run `h` through an output layer with a sigmoid activation function. That output layer has weights `v=[v1, v2, v3]`.

```
out = h@v = [h1 h2 h3] @ [v1 v2 v3] = h1*v1 + h2*v2 + h3*v3
```

To get a probability, we put the result through a sigmoid activation function:

```
prob = sigmoid(out)
```

We can put everything together:


```
f(x, W, v) = sigmoid(out)
           = sigmoid(h @ v)
           = sigmoid(xW @ v)
```

Our goal is now to find `dfdW` and `dfdv` because they are needed to update our weights `W` and `v`.

Using the chain rule, we find `dfdW` as follows:

```
(1)   dprob/dout = sigmoid(out)*sigmoid(1-out)
(2.1) dout/dv    = h
(2.2) dout/dh    = v
(3)   dh/dW = x
```

Applying the chain rule to get `df/dW` and `df/dv`:

```
df/dv = dprob/dout * dout/dv 
      = s(out)*s(1-out) * h

df/dW = dprob/dout * dout/dh * dh/dW
      = s(out)*s(1-out) * v * x
```

Now it's unclear how the vector-vector product `v*x = [v1 v2 v3] * [x1 x2]` is defined? In the end, we must get a (2x3) matrix. The only way to get that from a 3D and 2D vector is via the dot product `shape(2x1) @ shape(1x3) = shape(2x3)`, hence `x.T @ v.T`. But it's unclear why...Maybe it has to do with *Jacobians*?

## example

<img style="max-width:400px;" src="https://i.imgur.com/sG5UBMq.png"></img>

$\vec h$ is a vector-valued function. We're looking for the derivative $\frac{\partial \vec h}{\partial W}$. This requires us to calculate partial derivatives of *each* element of h with respect to *each* element of W. In total, we have to compute `3*(2*3)=18` partial derivatives. We can store these 18 derivatives in three (2x3) matrices. These matrices will store the partial derivatives of $\frac{\partial h_1}{\partial W}, \frac{\partial h_2}{\partial W}, \frac{\partial h_3}{\partial W}$.

<img style="max-width:500px;" src="https://i.imgur.com/oPLzvtc.png"></img>

When computing the elements of the matrices, we note that the following pattern emerges:

<img style="max-width:500px;" src="https://i.imgur.com/YZ7T3SQ.png"></img>

Using this, we can compute the partial derivatives:

<img style="max-width:500px;" src="https://i.imgur.com/thOmApu.png"></img>

Note that most elements of the three matrices are zero. In fact, the only non-zero elements of $\frac{\partial h_i}{\partial W_{jk}}$ are those of column $k=i$. Thus we could define a new matrix $J_{j,i}=\frac{\partial h_i}{\partial W_{ji}}$ that would store the *same* non-trivial information of the three (2x3) matrices, *efficiently* in a *single* (2x3) matrix:

<img style="max-width:500px;" src="https://i.imgur.com/BK4NXCP.png"></img>

This matrix is the **Jacobian** (or **Jacobian Product**???).

For $\vec x=[1, 2]$ the Jacobian/Jacobian Product is:

<img style="max-width:500px;" src="https://i.imgur.com/khVlJIY.png"></img>


In [61]:
# The same example in pytorch
# https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html#optional-reading-tensor-gradients-and-jacobian-products

import torch
x = torch.Tensor([1, 2]).reshape((1, -1))
W = torch.randn((2, 3), requires_grad=True)
h = x@W
print(f"x: {x.shape}\nW: {W.shape}\nh: {h.shape}")

# Normally, we want to compute the gradient of a *scalar* function w.r.t. to some
# parameters. In that case we can simply call f.backward()
# But here, we want to compute the gradient of a *vector-valued* function w.r.t. some
# parameters. In that case f.backward() will compute the *Jacobian product*, and not
# the actual gradient, see https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html#optional-reading-tensor-gradients-and-jacobian-products
v = torch.ones_like(h)
h.backward(v)
J = W.grad # computes v.T@J for input vector v=(v1, ..., vm)
print(f"\n=> Jacobian/Jacobian Product:\n{J}")

x: torch.Size([1, 2])
W: torch.Size([2, 3])
h: torch.Size([1, 3])

=> Jacobian/Jacobian Product:
tensor([[1., 1., 1.],
        [2., 2., 2.]])


**Why is the jacobian product *not* the actual gradient?**

See https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html#optional-reading-tensor-gradients-and-jacobian-products

* "the gradient is subset of the Jacobian." 
* "the gradient can be seen as special case of the Jacobian, i.e. when the function is scalar"
* "The Jacobian matrix is the matrix formed by the partial derivatives of a vector function. Its vectors are the gradients of the respective components of the function." => The Jacobian stores the GRADIENTS of the components of the function in its columns/rows!