# Fully Connected Layer

All definitions use a batch of $S$ examples, meaning the forward pass is done on $S$ examples from the training dataset.
Weight matrix has row vectors that correspond to $M$ nodes within a layer, while column vectors correspond to each of $N$ nodes of the previous layer's activation matrix (or $N$ features of the training dataset).
Layer activation matrix has row vectors that correspond to each of $S$ examples, while column vectors are individual nodes of that layer.

In summary, a layer's input matrix is $S \times N$, while the activation matrix is $S \times M$.

In all of the expressions below: 
* $[l]$ - layer # in the network
* $m$ - node # in a layer
* $n$ - input # in a node
* $s$ - example # in a batch

The activation of node $m$ within layer $l$, given a batch of inputs $\textbf{A}^{[l-1]}$:
$$ \textbf{z}^{[l]}_{:, m} = \textbf{A}^{[l-1]} \cdot \textbf{w}^{[l]T}_{m, :} + b^{[l]}_m $$

$$ \textbf{a}^{[l]}_{:, m} = f(\textbf{z}^{[l]}_{:, m}) $$

The activation matrix of layer $l$, given a batch of inputs $\textbf{A}^{[l-1]}$:

$$ \textbf{Z}^{[l]} = \textbf{A}^{[l-1]} \cdot \textbf{W}^{[l]T}  + \textbf{b}^{[l]}  $$

$$ \textbf{A}^{[l]} = f(\textbf{Z}^{[l]}) $$

In [1]:
import numpy as np

n = 5
m = 3
s = 10

X = np.random.rand(s, n)
W = np.random.rand(m, n)
b = np.random.rand(1, m)

Z = np.dot(X, W.T) + b

print(Z.shape)

(10, 3)


# Example for the 1st Layer

For the 1st layer's 3rd neuron, the calculation for a batch of $\textbf{X}$:

$$ \textbf{z}^{[1]}_{:, 3} = \textbf{X} \cdot \textbf{w}^{[1]T}_{3, :} + b^{[1]}_3 $$

$$ \textbf{a}^{[1]}_{:, 3} = f(\textbf{z}^{[1]}_{:, 3}) $$


The activation matrix for the whole 1st layer given a batch of $\textbf{X}$:

$$ \textbf{Z}^{[1]} = \textbf{X} \cdot \textbf{W}^{[1]T}  + \textbf{b}^{[1]}  $$

$$ \textbf{A}^{[1]} = f(\textbf{Z}^{[1]}) $$

# Derivatives

Derivative of pre-activation for a single example with respect to a node's weight is the example's corresponding feature:

$$ \frac{\partial z^{[l]}_{s, m}}{\partial w^{[l]}_{m, n}} = a^{[l-1]}_{s, n} $$

Derivative of activation for a single example with respect to pre-activation is the activation function's derivative evaluated at the point of pre-activation:
$$ \frac {\partial a^{[l]}_{s, m}} {\partial z^{[l]}_{s, m}} = f'(z^{[l]}_{s, m}) $$

Derivative of the cost of a batch with respect to a node's weight:

$$
\frac {\partial C} {\partial w^{[l]}_{m, n}}
= \sum_{s=1}^S \big[
    \frac {\partial C_s} {\partial a^{[l]}_{s, m}} 
    * \frac {\partial a^{[l]}_{s, m}} {\partial z^{[l]}_{s, m}} 
    * \frac {\partial z^{[l]}_{s, m}} {\partial w^{[l]}_{m, n}} 
\big]
$$ 

$$ 
\frac {\partial C} {\partial w^{[l]}_{m, n}}
= \sum_{s=1}^S \big[
    \frac {\partial C_s} {\partial a^{[l]}_{s, m}} 
    * f'(z^{[l]}_{s, m})
    * a^{[l-1]}_{s, n} 
\big]
$$