---
categories: [neural-nets]
date: '2022-11-13'
description: Implementing the forward and backward pass for a simple neural net.
output-file: 2021-11-13-backward-pass.html
title: Backward Pass
bibliography: ../references.bib
csl: ../control-and-automation.csl
toc: true
use_math: true

---

# Notation
* $N$ : number of training examples
* $d$ : number of features
* $\mathbf{x}^{(i)}$ is the $i$-th training example $$
\mathbf{x}^{(i)} = \begin{bmatrix}
x^{(i)}_1 \\
x^{(i)}_2 \\
\vdots \\
x^{(i)}_d
\end{bmatrix}
$$ 
* $\mathbf{w}$ is the vector of weights for a single neuron $$
\mathbf{w} = \begin{bmatrix}
w_1 \\
w_2 \\
\vdots \\
w_d
\end{bmatrix}
$$ 
* $b$ is the bias term for a single neuron
* $\mathbf{X}$ is an $N \times d$ matrix with the training examples stacked in rows
$$
\mathbf{X} = \begin{bmatrix}
x^{(1)}_1 & x^{(1)}_2 & \cdots & x^{(1)}_d\\
x^{(2)}_1 & x^{(2)}_2 & \cdots & x^{(2)}_d\\
\vdots & \vdots & \ddots & \vdots\\
x^{(N)}_1 & x^{(N)}_2 & \cdots & x^{(N)}_d
\end{bmatrix}
$$ 
* $\{y^{(i)}\}_{i=1}^{N}$ are the targets for each of the $N$ training examples
* $z^{(i)} = \mathbf{w}^T\mathbf{x}^{(i)} + b$ is the output of our single neuron when the $i$-th training example is passed through it
* $a^{(i)} = \phi(z^{(i)})$ is the activation when an input $z^{(i)}$ is passed through an activation function $\phi$
    * $\{a^{(i)}\}_{i=1}^{N}$ are the activations for each of the $N$ training examples when passed through a single neuron followed by the application of the activation function
* $J\left(\{y^{(i)}\}_{i=1}^{N},\{a^{(i)}\}_{i=1}^{N}\right) = \frac{1}{N}\sum_{i=1}^{N}(y^{(i)}-a^{(i)})^{2}$ is the mean squared error across the $N$ training examples

# Derivatives 

## Loss with respect to the Activations
So how does the loss change as the $i$-th activation changes:

$$\frac{\partial J}{\partial a^{(i)}} = \frac{2}{N}\sum_{i=1}^{N} \frac{\partial (y^{(i)}-a^{(i)})^{2}}{\partial a^{(i)}}  = \frac{2}{N}(y^{(i)}-a^{(i)})\frac{\partial (y^{(i)}-a^{(i)}) }{\partial a^{(i)}} = \frac{2}{N}(y^{(i)}-a^{(i)})(0-1) = \frac{2}{N}(y^{(i)}-a^{(i)})\frac{\partial (y^{(i)}-a^{(i)}) }{\partial a_i} = \frac{2}{N}(a^{(i)}-y^{(i)})$$

The change in the loss as a function of the change in activations from our training examples is captured by the $N \times 1$ matrix:

$$
\frac{\partial J}{\partial \mathbf{a}} = \begin{bmatrix}
\frac{\partial J}{\partial a^{(1)}} \\
\frac{\partial J}{\partial a^{(2)}} \\
\vdots \\
\frac{\partial J}{\partial a^{(N)}}
\end{bmatrix}
= \begin{bmatrix}
\frac{2}{N}\left(a^{(1)}-y^{(1)}\right) \\
\frac{2}{N}\left(a^{(2)}-y^{(2)}\right) \\
\vdots \\
\frac{2}{N}\left(a^{(N)}-y^{(N)}\right)
\end{bmatrix}
$$ 