In [3]:
import torch
from torch.nn.functional import relu
from torch import sigmoid

torch.set_printoptions(sci_mode=False) # disable scientific notation

## (Helpful) References

* http://cs231n.stanford.edu/vecDerivs.pdf
* https://web.stanford.edu/class/cs224n/readings/gradient-notes.pdf

## Prerequisite 1: Vector-Matrix Gradient

Suppose we have 

* observation $\vec x$ shaped (1, D)
* weight matrix $W$ shaped (D, H)
* output $\vec h=\vec xW$ shaped (1, H)

and we're looking for $\frac{\partial \vec h}{\partial W}$

---

<img style="max-width:400px;" src="https://i.imgur.com/sG5UBMq.png"></img>

$\vec h$ is a vector-valued function. We're looking for the derivative $\frac{\partial \vec h}{\partial W}$. This requires us to calculate partial derivatives of *each* element of h with respect to *each* element of W. That is, we have to calculate $\frac{\partial h_1}{\partial W}, \frac{\partial h_2}{\partial W}, \frac{\partial h_3}{\partial W}$. 

Each derivative $\frac{\partial h_i}{\partial W}$ for $i=1,2,3$ consists of `(2*3)=6` partial derivatives because $W$ has `2*3` elements. These partial derivatives w.r.t. every element of matrix $W$ can be stored in a matrix of the same shape as $W$. Because $\vec h$ has three elements $h_1, h_2, h_3$, we will get *three* of those matrices: $\frac{\partial h_1}{\partial W}, \frac{\partial h_2}{\partial W}, \frac{\partial h_3}{\partial W}$.

<img style="max-width:500px;" src="https://i.imgur.com/oPLzvtc.png"></img>

Note that $h_i=\vec x \cdot W_{:,i}$. The calculation of $h_i$ only involves $x$ and column $i$ of $W$. Thus, when we take the derivative $\frac{\partial h_i}{\partial W}$, only the partial derivatives $\frac{\partial h_i}{\partial W_{j,k}}$ where $i=k$ are non-zero. We can summarise this as follows:

<img style="max-width:500px;" src="https://i.imgur.com/YZ7T3SQ.png"></img>

Using this, we can compute the partial derivatives:

<img style="max-width:500px;" src="https://i.imgur.com/thOmApu.png"></img>

These three matrices can now be used to update `W`. To update `W`, we would need to do three steps:

1. $W = W - \text{learning_rate}\cdot \frac{\partial h_1}{\partial W}$
1. $W = W - \text{learning_rate}\cdot \frac{\partial h_2}{\partial W}$
1. $W = W - \text{learning_rate}\cdot \frac{\partial h_3}{\partial W}$

or, equivalently,

$$
W = W - \text{learning_rate}*(\frac{\partial h_1}{\partial W} + \frac{\partial h_2}{\partial W} + \frac{\partial h_3}{\partial W})
$$

But there is a trick (see Section 3 of http://cs231n.stanford.edu/vecDerivs.pdf) that (1) simplifies the update-step and (2) helps storing the three matrices more efficiently. For this trick, note that most elements of the three matrices are zero. In fact, the only non-zero elements $\frac{\partial h_i}{\partial W_{jk}}$ of $\frac{\partial h_i}{\partial W}$ are those elements where $i=k$. We could define a new matrix $J_{j,i}=\frac{\partial h_i}{\partial W_{ji}}$ that would store the *same* non-trivial information of all three matrices in a *single* 2D-matrix:

<img style="max-width:500px;" src="https://i.imgur.com/BK4NXCP.png"></img>

This matrix is the efficiently stored **Jacobian** $J=\frac{\partial \vec h}{\partial W}$. Using this matrix, the update-step simplifies to:

$$
W = W - \text{learning_rate}\cdot J
$$

---

For example, for input vector $\vec x=[1, 2]$ the efficiently stored Jacobian is:

<img style="max-width:500px;" src="https://i.imgur.com/khVlJIY.png"></img>

In [80]:
# The same example in pytorch
# https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html#optional-reading-tensor-gradients-and-jacobian-products

x = torch.Tensor([1, 2]).reshape((1, -1))
W = torch.randn((2, 3), requires_grad=True)
h = x@W
print(f"x: {x.shape}\nW: {W.shape}\nh: {h.shape}")

h.backward(torch.ones_like(h))
dhdW = W.grad 
print(f"dh/dW:\n{J}")

x: torch.Size([1, 2])
W: torch.Size([2, 3])
h: torch.Size([1, 3])
dh/dW:
tensor([[1., 1., 1.],
        [2., 2., 2.]])


**Why is the jacobian product *not* the actual gradient?**

See https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html#optional-reading-tensor-gradients-and-jacobian-products

* "the gradient is subset of the Jacobian." 
* "the gradient can be seen as special case of the Jacobian, i.e. when the function is scalar"
* "The Jacobian matrix is the matrix formed by the partial derivatives of a vector function. Its vectors are the gradients of the respective components of the function." => The Jacobian stores the GRADIENTS of the components of the function in its rows!

### Extension to multiple data points stored in a data matrix $X$

Previously, we had observation $\vec x$ and weight matrix $W$, and wanted to compute $\vec h = \vec x W$.

Now we have data matrix $X$ shaped $(N, D)$, which stores $N$ observations $\vec x_1, \dots, \vec x_N$ in its rows.

We have the same goal of calculating $H=XW$. 

Note that $H$ is now a (N, H) matrix. Each row of $H$ contains the hidden state $\vec h_i=\vec x_i W$, calculated using observation $i$.

In [391]:
N, D, n_hidden = 2, 2, 3

# create data matrix X
x1 = torch.tensor([1., 2.])
x2 = torch.tensor([10., 20.])
X = torch.cat([x1 , x2]).reshape(N, D)
X

tensor([[ 1.,  2.],
        [10., 20.]])

In [392]:
# init weight matrix W
W = torch.randn((D, n_hidden), requires_grad=True)
W

tensor([[ 0.4463, -0.7549, -0.3479],
        [-0.3383,  2.6471, -0.3953]], requires_grad=True)

In [393]:
# calculate matrix containing hidden states h1, h2
H = X@W
H

tensor([[ -0.2304,   4.5393,  -1.1385],
        [ -2.3036,  45.3927, -11.3849]], grad_fn=<MmBackward>)

In [394]:
# dH/dW 
H.backward(torch.ones_like(W))
W.grad

tensor([[11., 11., 11.],
        [22., 22., 22.]])

=> `dH/dW` is the sum of Jacobians of $\vec h1, \vec h2$.

Recall that

$$
H = 
\begin{bmatrix}
- \vec{h_1} - \\
- \vec{h_2} -
\end{bmatrix}
$$

where 

* $\vec h_1$ is the hidden-state computed using observation $\vec x_1$ (row 1 of $X$)
* $\vec h_2$ is the hidden-state computed using observation $\vec x_2$ (row 2 of $X$)

We previously derived the Jacobians of a vector-matrix derivative. We use this to get $\vec h_1, \vec h_2$:

$$
\frac{\partial \vec h_1}{\partial W}=
\begin{bmatrix}
X_{1,1} & X_{1,1} & X_{1, 1} \\
X_{1,2} & X_{1,2} & X_{1, 2}
\end{bmatrix}
$$ 

and

$$
\frac{\partial \vec h_2}{\partial W}=
\begin{bmatrix}
X_{2,1} & X_{2,1} & X_{2, 1} \\
X_{2,2} & X_{2,2} & X_{2, 2}
\end{bmatrix}
$$ 

So that:

$$
\begin{align}
\frac{\partial \vec H}{\partial W} & =
\begin{bmatrix}
X_{1,1} & X_{1,1} & X_{1, 1} \\
X_{1,2} & X_{1,2} & X_{1, 2}
\end{bmatrix}
+
\begin{bmatrix}
X_{2,1} & X_{2,1} & X_{2, 1} \\
X_{2,2} & X_{2,2} & X_{2, 2}
\end{bmatrix}
\\
& =
\begin{bmatrix}
X_{1,1}+X_{2,1} & X_{1,1}+X_{2,1} & X_{1,1}+X_{2, 1} \\
X_{1,2}+X_{2,2} & X_{1,2}+X_{2,2} & X_{1,2}+X_{2, 2}
\end{bmatrix}
\\
& = 
\begin{bmatrix}
1+10 & 1+10 & 1+10 \\
2+20 & 2+20 & 2+20
\end{bmatrix}
\\
& = 
\begin{bmatrix}
11 & 11 & 11 \\
22 & 22 & 22
\end{bmatrix}
\end{align}
$$

In [395]:
dH_dW = torch.sum(X.T, dim=1).repeat(n_hidden, 1).T # sum of jacobians

tensor([[11., 11., 11.],
        [22., 22., 22.]])

In [454]:
# extension where I push H through output layer
N, D, n_hidden = 2, 2, 3

X = torch.tensor([[1., 2.], [10., 20.]])
W = torch.randn((D, n_hidden), requires_grad=True)
wout = torch.randn((n_hidden), requires_grad=True)

# forward with storing gradients along the way
H = X@W
dH_dW = torch.sum(X.T, dim=1).repeat(n_hidden, 1).T # sum of N Jacobians

Hact = relu(H)
dHact_dH = (H > 0).to(torch.float32)

score = Hact@wout
dscore_dHact = wout.repeat(1, 2).reshape((2, 3))

score.backward(torch.ones_like(x))
W.grad

tensor([[  0.0000, -20.3211,  18.7920],
        [  0.0000, -40.6421,  37.5840]])

In [455]:
print(dscore_dHact)
print(dHact_dH)
print(dH_dW)
dscore_dW = dH_dW * dHact_dH * dscore_dHact
print(dscore_dW)

tensor([[-1.3943, -1.8474,  1.7084],
        [-1.3943, -1.8474,  1.7084]], grad_fn=<ViewBackward>)
tensor([[0., 1., 1.],
        [0., 1., 1.]])
tensor([[11., 11., 11.],
        [22., 22., 22.]])
tensor([[ -0.0000, -20.3211,  18.7920],
        [ -0.0000, -40.6421,  37.5840]], grad_fn=<MulBackward0>)


In [94]:
def vector_matrix_derivative(W, x):
    """
    Function for computing the derivative of y=xW w.r.t. W,
    ie dy/dW.
        
    Args:
        W (torch.tensor): matrix shaped (D, H)
        x (torch.tensor): vector shaped (D) or matrix shaped (N, D)
    
    Returns:
        gradient dy/dW, a torch.tensor of the same shape as W
    
    Note:
        y=xW: output shaped (1, H) or (N, H)
        
    Explanation: 
        dy/dW will be the sum of the scalar-matrix derivatives
        (=Jacobians) dyi/dW, where yi is the i'th element of y.
    """
    # if x is a single observation vector shaped (D)
    if len(x.shape) == 1:
        dy_dW = x.reshape(-1, 1).repeat(1, W.shape[1]) 
    # if x is a data matrix shaped (N, D)
    elif len(x.shape) == 2:
        dy_dW = torch.sum(x, dim=0).reshape(-1, 1).repeat(1, W.shape[1])
    else:
        return "x must be a (1, D) vector or a (N, D) data matrix"
    return dy_dW

In [82]:
# test the function
x = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
W = torch.randn((3, 4), requires_grad=True)
y = x@W
dy_dW = vector_matrix_derivative(W, x)
y.backward(torch.ones_like(y))
dy_dW_pytorch = W.grad

print(f"pytorch grad: \n {dy_dW_pytorch}\n our grad: \n {dy_dW}")

pytorch grad: 
 tensor([[5., 5., 5., 5.],
        [7., 7., 7., 7.],
        [9., 9., 9., 9.]])
 our grad: 
 tensor([[5., 5., 5., 5.],
        [7., 7., 7., 7.],
        [9., 9., 9., 9.]])


## Part 2: backprop of a single observation $\vec x$

<img style="max-width: 500px;" src="https://i.imgur.com/Y6R3i5L.png"></img>

* The hidden layer takes in $\vec x$ and calculates $\vec h=xW$.
* Then $h$ is activated $\vec h_{act}=ReLU(\vec h)$.
* Then $h_{act}$ is run through the output layer: $score=\vec h_{act}\vec w^{out}$
* Then $score$,  a scalar, is finally through a sigmoid function to get a probability $prob=\sigma(score)$. 
* Then $prob$, a scalar, is compared with the target via the loss function $Loss(prob, target)=(prob-target)^2$.
* Then updating the weights $W$ and $w^{(out)}$ is a matter of finding the derivatives `dLoss/dW` and `dLoss/dwout`.

In [95]:
N, D, n_hidden = 1, 2, 3

# input vector and target
x = torch.Tensor([1, 2])
t = torch.Tensor([1])

# init the network
W = torch.randn((D, n_hidden), requires_grad=True)
wout = torch.randn(n_hidden, requires_grad=True)

# forward pass
h = x@W
hact = relu(h)
score = hact@wout
prob = sigmoid(score)
loss = (prob - t)**2

In [99]:
# forward pass, but storing storing gradients on the way
h = x@W
# dh_dW = x.repeat(n_hidden).reshape((n_hidden, D)).T 
dh_dW = vector_matrix_derivative(W, x)

hact = relu(h)
dhact_dh = (h > 0).to(torch.float32) # dReLU(x)/dx=1 if x>0 else 0

score = hact@wout
dscore_dwout = hact # required for dLoss/dW
dscore_dhact = wout # required for dLoss/dwout

prob = sigmoid(score)
dprob_dscore = sigmoid(score)*(1-sigmoid(score))

loss = (prob - t)**2
dloss_dprob = 2*(prob-t)

In [100]:
# compute the gradient dloss/dwout

# required intermediate derivatives for chain rule
print(f"dloss/dprob  = {dloss_dprob}")
print(f"dprob/dscore = {dprob_dscore}")
print(f"dscore/dwout = {dscore_dwout}")

# apply chain rule
dloss_dwout = dloss_dprob * dprob_dscore * dscore_dwout
print(f"=== RESULT ===\ndloss/dwout = {dloss_dwout}")

dloss/dprob  = tensor([-1.7710], grad_fn=<MulBackward0>)
dprob/dscore = 0.10139695554971695
dscore/dwout = tensor([0.0000, 1.9175, 0.0000], grad_fn=<ReluBackward0>)
=== RESULT ===
dloss/dwout = tensor([-0.0000, -0.3443, -0.0000], grad_fn=<MulBackward0>)


In [101]:
# compute the gradient dloss/dW

# required intermediate derivatives
print(f"dloss/dprob  \t= {dloss_dprob}")
print(f"dprob/dscore \t= {dprob_dscore}")
print(f"dscore/dhact \t= {dscore_dhact}")
print(f"dhact/dh \t= {dhact_dh}")
print(f"dh/dW \t\t=\n{dh_dW}")

# apply chain rule
dloss_dW = dloss_dprob * dprob_dscore * dscore_dhact * dhact_dh * dh_dW
print(f"=== RESULT ===\ndloss/dW = \n{dloss_dW}")

dloss/dprob  	= tensor([-1.7710], grad_fn=<MulBackward0>)
dprob/dscore 	= 0.10139695554971695
dscore/dhact 	= tensor([ 0.7842, -1.0667, -0.1411], requires_grad=True)
dhact/dh 	= tensor([0., 1., 0.])
dh/dW 		=
tensor([[1., 1., 1.],
        [2., 2., 2.]])
=== RESULT ===
dloss/dW = 
tensor([[-0.0000, 0.1916, 0.0000],
        [-0.0000, 0.3831, 0.0000]], grad_fn=<MulBackward0>)


In [102]:
# compare our manually computed gradients vs pytorch

# zero gradients & re-do forward pass
W.grad = None
wout.grad = None
h = x@W
hact = relu(h)
score = hact@wout
prob = sigmoid(score)
loss = (prob - t)**2
loss.backward()

# compare our gradients vs pytorch
print("====== dLoss/dwout ======")
print(f"pytorch grad: \t{wout.grad}\nour grad: \t{dloss_dwout}")
print("\n====== dLoss/dW ======")
print(f"pytorch grad: \n{W.grad}\n\nour grad: \n{dloss_dW}\n")

pytorch grad: 	tensor([-0.0000, -0.3443, -0.0000])
our grad: 	tensor([-0.0000, -0.3443, -0.0000], grad_fn=<MulBackward0>)

pytorch grad: 
tensor([[0.0000, 0.1916, 0.0000],
        [0.0000, 0.3831, 0.0000]])

our grad: 
tensor([[-0.0000, 0.1916, 0.0000],
        [-0.0000, 0.3831, 0.0000]], grad_fn=<MulBackward0>)



**=> WE GET THE SAME GRADIENTS AS PYTORCH**

Notes: 

* TODO: understand how pytorch implements `scalar * vector` and `vector * matrix`, and how this translates to actual mathematical notation on paper! Otherwise I don't fully understand what happens when we apply the chain rule, where I multiply differently shaped scalars/vectors/matrices.
* ReLU activation function
    * The derivative of ReLU(x) is interesting. It's either 1 or 0. If it's 1, we allow the gradient *as is* to be backpropagated from that point further backwards. If it's 0, then the gradient will be 0, no gradient will be backpropagated from that point further backwards.
    * ReLU(x) only lets gradients of *activated* neurons through. Here, activated means x>0.
    * ReLu solves the **vanishing gradient problem**. The vanishing gradient problem exists especially in deeper networks that uses e.g. sigmoid activation functions. Because the sigmoid activation function is ~0 for high and low x...ReLU is only 0 for low x...
    * we want neurons to fire when they detect their pattern or "concept". when the pattern is *very* present, we want big positive numbers. when it is not present, the neuron should not fire at all. Intuitively, a pattern cannot be "less" than not present. And if 0 represents "not present", then a value *below* 0 does not make sense. Hence we can intuitively justify setting ReLU(x)=0 if x<0.
    * **dying relu problem**: when there is e.g. a *large negative bias term* that always makes the pre-activation negative. In that case, we always get ReLU(a)=0 for that neuron, and so the neuron *never* learns.

## backprop of multiple observations stored in data matrix X

* repeat the stuff we did with a single observation, but for N observations.

In [140]:
# input vector and target (now a batch of 2 observation)
X = torch.Tensor([[1, 2], [3, 4]])
t = torch.arange(X.shape[0])

N, D = X.shape
n_hidden = 3

# init the network (same architecture as before)
W = torch.randn((D, n_hidden), requires_grad=True)
wout = torch.randn(n_hidden, requires_grad=True)

# forward pass
h = X@W               # X: shape(N, D), W: shape(D, H)   => H: shape(N, H)
hact = relu(h)        # H: shape(N, H)                   => Hact: shape(N, H)
score = hact@wout     # Hact: shape(N, H), wout=shape(H) => score: shape(N)
prob = sigmoid(score) # score: shape(N)                  => prob: shape(N)
loss = (prob - t)**2  # prob: shape(N), t: shape(N)      => loss: shape(N)

In [143]:
# forward pass, but storing storing gradients along the way
H = X@W
dH_dW = vector_matrix_derivative(W, X)

Hact = relu(H) # shape(N, H)
dHact_dH = (H > 0).to(torch.float32)

score = Hact@wout # shape(N)
dscore_dwout = Hact
dscore_dHact = vector_matrix_derivative(Hact, wout)
print(score.shape, dscore_dHact.shape)

prob = sigmoid(score)
dprob_dscore = sigmoid(score)*(1-sigmoid(score))

loss = (prob - t)**2
dloss_dprob = 2*(prob-t)

torch.Size([2]) torch.Size([3, 3])


In [130]:
# compute the gradient dloss/dwout

# required intermediate derivatives
print(f"dloss/dprob  = {dloss_dprob}")
print(f"dprob/dscore = {dprob_dscore}")
print(f"dscore/dwout =\n {dscore_dwout}\n")

# apply chain rule
dloss_dwout = (dloss_dprob * dprob_dscore * dscore_dwout.T).T
dloss_dwout = torch.sum(dloss_dwout, dim=0) # assumption: loss.sum()!
print(f"dloss/dwout =\n {dloss_dwout}")

dloss/dprob  = tensor([ 0.8328, -1.2002], grad_fn=<MulBackward0>)
dprob/dscore = tensor([0.2430, 0.2400], grad_fn=<MulBackward0>)
dscore/dwout =
 tensor([[0.7286, 0.0000, 0.7915],
        [1.0827, 0.0000, 1.6188]], grad_fn=<ReluBackward0>)

dloss/dwout =
 tensor([-0.1644,  0.0000, -0.3061], grad_fn=<SumBackward1>)


In [131]:
# compare with pytorch
W.grad = None
wout.grad = None
h = X@W
hact = relu(h)
score = hact@wout
prob = sigmoid(score)
loss = (prob - t)**2
loss.sum().backward()

# compare our gradients vs pytorch
print("====== dLoss/dwout ======")
print(f"pytorch grad: \t{wout.grad}\nour grad: \t{dloss_dwout}")

pytorch grad: 	tensor([-0.1644,  0.0000, -0.3061])
our grad: 	tensor([-0.1644,  0.0000, -0.3061], grad_fn=<SumBackward1>)


In [138]:
# compute the gradient dloss/dW

# required intermediate derivatives
print(f"dloss/dprob  \t= {dloss_dprob}")
print(f"dprob/dscore \t= {dprob_dscore}")
print(f"dscore/dhact =\n {dscore_dHact}") # this could be wrong...
print(f"dhact/dh =\n {dHact_dH}")
print(f"dh/dW =\n {dH_dW}\n")

# apply chain rule
dloss_dW = dloss_dprob * dprob_dscore * dscore_dHact * dHact_dH * dH_dW
print(f"dloss/dW = \n{dloss_dW}")

dloss/dprob  	= tensor([ 0.8328, -1.2002], grad_fn=<MulBackward0>)
dprob/dscore 	= tensor([0.2430, 0.2400], grad_fn=<MulBackward0>)
dscore/dhact =
 tensor([[-0.6984, -0.6984, -0.6984],
        [ 0.4187,  0.4187,  0.4187],
        [ 0.2163,  0.2163,  0.2163]], grad_fn=<RepeatBackward>)
dhact/dh =
 tensor([[1., 0., 1.],
        [1., 0., 1.]])
dh/dW =
 tensor([[4., 4., 4.],
        [6., 6., 6.]])



RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1

In [136]:
# compare our gradients vs pytorch
W.grad = None
wout.grad = None
h = X@W
hact = relu(h)
score = hact@wout
prob = sigmoid(score)
loss = (prob - t)**2
loss.sum().backward()
print("====== dLoss/dW ======")
print(f"pytorch grad:\n{W.grad}\nour grad: \n{dloss_dW}")

pytorch grad:
tensor([[ 0.4621,  0.0000, -0.1432],
        [ 0.5220,  0.0000, -0.1617]])
our grad: 
tensor([[-0.0000, 0.1916, 0.0000],
        [-0.0000, 0.3831, 0.0000]], grad_fn=<MulBackward0>)


In [306]:
x = torch.tensor([0, 1])
A = torch.tensor([[1, 2, 3], [0, 10, 20]])
A.T

tensor([[ 1,  0],
        [ 2, 10],
        [ 3, 20]])

## what i know

* I know how to do backpropagation for a single example $\vec x$, but I cannot do it for a batch of observations $X$
* The problem lies in the calculation of derivatives. Probably I'm doing Vector-by-Matrix gradients wrong. 
    * http://cs231n.stanford.edu/slides/2018/cs231n_2018_ds02.pdf
    

## old notes I dont want to delete yet

Here are the steps visualized:

```
x = [x1, x2]     # a single observation has 2 features
h = [h1, h2, h3] # hidden layer with 3 neurons with states h1, h2, h3
```

The weights of each of the three neurons correspond to a column

```
W = [[W11, W12, W13],
     [W21, W22, W23]]
```

Then we can compute the hidden layer

```
h = xW = [h1, h2, h3] # shape(1x3)
```

* x:  shape(1x2)
* W: shape(2x3)
* h: shape(1x3)

Then we run `h` through an output layer with a sigmoid activation function. That output layer has weights `v=[v1, v2, v3]`.

```
out = h@v = [h1 h2 h3] @ [v1 v2 v3] = h1*v1 + h2*v2 + h3*v3
```

To get a probability, we put the result through a sigmoid activation function:

```
prob = sigmoid(out)
```

We can put everything together:


```
f(x, W, v) = sigmoid(out)
           = sigmoid(h @ v)
           = sigmoid(xW @ v)
```

Our goal is now to find `dfdW` and `dfdv` because they are needed to update our weights `W` and `v`.

Using the chain rule, we find `dfdW` as follows:

```
(1)   dprob/dout = sigmoid(out)*sigmoid(1-out)
(2.1) dout/dv    = h
(2.2) dout/dh    = v
(3)   dh/dW = x
```

Applying the chain rule to get `df/dW` and `df/dv`:

```
df/dv = dprob/dout * dout/dv 
      = s(out)*s(1-out) * h

df/dW = dprob/dout * dout/dh * dh/dW
      = s(out)*s(1-out) * v * x
```


Now it's unclear how the vector-vector product `v*x = [v1 v2 v3] * [x1 x2]` is defined? In the end, we must get a (2x3) matrix. The only way to get that from a 3D and 2D vector is via the dot product `shape(2x1) @ shape(1x3) = shape(2x3)`, hence `x.T @ v.T`. But it's unclear why...Maybe it has to do with *Jacobians*?