## RNN (Residual nerual networks)

The previous notebook described how image classification performance improved as the
depth of convolutional networks was extended from eight layers (AlexNet) to nineteen
layers (VGG). This led to experimentation with even deeper networks. However, performance decreased again when many more layers were added.
This notebook introduces *residual blocks*. Here, each network layer computes an *additive change* to the current representation instead of transforming it directly. This allows
deeper networks to be trained but causes an exponential increase in the activation magnitudes at initialization. Residual blocks employ **batch normalization** to compensate for
this, which re-centers and rescales the activations at each layer.
Residual blocks with batch normalization allow much deeper networks to be trained,
and these networks improve performance across a variety of tasks. Architectures that
combine residual blocks to tackle image classification, medical image segmentation, and
human pose estimation are described.

Every network we have seen so far processes the data sequentially; each layer receives
the previous layer’s output and passes the result to the next. For example,
a three-layer network is defined by:

$$\begin{align*}
\mathbf{h}_1 &= f_1[\mathbf{x}, \phi_1] \\
\mathbf{h}_2 &= f_2[\mathbf{h}_1, \phi_2] \\
\mathbf{h}_3 &= f_3[\mathbf{h}_2, \phi_3] \\
\mathbf{y} &= f_4[\mathbf{h}_3, \phi_4],
\end{align*}$$

where $h_1, h_2, h_3$ denote the intermediate hidden layers, x is the network input, y
is the output, and the functions $f_k[\bullet,\phi_k]$ perform the processing.
In a standard neural network, each layer consists of a linear transformation followed
by an activation function, and the parameters $\phi_k$ comprise the weights and biases of the linear transformation. In a convolutional network, each layer consists of a set of convolutions followed by an activation function, *and the parameters comprise the convolutional kernels and biases.*


#### shattered gradient:
<img src=../images/shattered_gradient.png width=650>

Residual or skip connections are branches in the computational path, whereby the input
to each network layer $f[\bullet]$ is added back to the output. By analogy to, the residual network is defined as

$$
\begin{align*}
\mathbf{h}_1 &= \mathbf{x} + f_1[\mathbf{x}, \phi_1] \\
\mathbf{h}_2 &= \mathbf{h}_1 + f_2[\mathbf{h}_1, \phi_2] \\
\mathbf{h}_3 &= \mathbf{h}_2 + f_3[\mathbf{h}_2, \phi_3] \\
\mathbf{y} &= \mathbf{h}_3 + f_4[\mathbf{h}_3, \phi_4],
\end{align*}
$$

where the first term on the right-hand side of each line is the residual connection. Each
function $f_k$ learns an additive change to the current representation. It follows that their
outputs must be the same size as their inputs. Each additive combination of the input
and the processed output is known as a residual block or residual layer.

#### the overall layout for residual connections:
<img src=../images/residual_connection.png width=650>

one interpretation is that residual connections turn the original network into an *ensemble* of
these smaller networks whose outputs are summed to compute the result.

### further research:
- shattered gradients??