Let's have $y = x \cdot w + b$. Application of the rules of differentiation is simple, e.g. 

$$dy/dx = w$$
$$dy/dw = x$$
$$dy/db = 1$$

The change in $y$ is porportional to the change in $x$. The bigger is $w$, the bigger is the change of $y$ for the same change of $x$.

Let's $L = f(y) = f(y(x))$. Application of the chain rule is simple, e.g. 

$$dL/dx = df(y)/dy \cdot dy/dx = df(y)/dy \cdot w$$ 
$$dL/dw = df(y)/dy \cdot dy/dw = df(y)/dy \cdot x$$ 
$$dL/db = df(y)/dy \cdot dy/db = df(y)/dy \cdot 1$$.

The multidimensional case is not so simple. Functions with multiple inputs and multiple outputs have multiple partial derivatives which need to be arranged and stored properly. Applying this for batches of data complicates the picture even more.

Let's have $y = x \cdot W + b$ and $L = f(y)$

where 

- $y$, $x$ and $b$ are row vectors and $W$ is a matrix.
- $x$ includes the $m$ input features
- $W$ is a weight matrix with $m$ rows and $h$ columns; 
- $b$ is a bias with $h$ elements; 
- $y$ has $h$ features (or nodes).
- $x$ and $y$ represent input and output features (variables, nodes in the NN). Adding additional dimension (multiple rows) could represent multiple data samples. Inputs and outputs could be replaced by matrices $X$ and $Y$ where the last dimension gives the features ($x$ and $y$ for the corresponding data point); 

A gradient is attached to each variable and parameter of the model, i.e.

$y.g = \partial{L}/∂{y}$

$x.g = \partial{L}/∂{x} = \partial{L}/\partial{y} \cdot \partial{y}/\partial{x} = y.g \cdot \partial{y}/\partial{x} = y.g \cdot w$

$w.g = \partial{L}/∂{w} = \partial{L}/\partial{y} \cdot \partial{y}/\partial{w} = y.g \cdot \partial{y}/\partial{w} = y.g \cdot x$

$b.g = \partial{L}/∂{b} = \partial{L}/\partial{y} \cdot \partial{y}/\partial{b} = y.g \cdot \partial{y}/\partial{b} = y.g$

The shapes of the gradient is the same as the shape of the corresponding variable (parameter), e.g. `x.g.shape ≡  x.shape`

![Simple NN](../nn-mini.png)

The structure of a fully connected neural network with single hidden layer could be represented as follows&

![NN with one hidden layer](../nn.png)

In [1]:
from pathlib import Path

In [1]:
def lin_grad(x, w, b, y):
    b.g = y.g.sum(dim=0)
    w.g = x.T @ y.g
    x.g =  y.g * w2.T