## RNN (Residual nerual networks)

The previous notebook described how image classification performance improved as the
depth of convolutional networks was extended from eight layers (AlexNet) to nineteen
layers (VGG). This led to experimentation with even deeper networks. However, performance decreased again when many more layers were added.
This notebook introduces *residual blocks*. Here, each network layer computes an *additive change* to the current representation instead of transforming it directly. This allows
deeper networks to be trained but causes an exponential increase in the activation magnitudes at initialization. Residual blocks employ **batch normalization** to compensate for
this, which re-centers and rescales the activations at each layer.
Residual blocks with batch normalization allow much deeper networks to be trained,
and these networks improve performance across a variety of tasks. Architectures that
combine residual blocks to tackle image classification, medical image segmentation, and
human pose estimation are described.

Every network we have seen so far processes the data sequentially; each layer receives
the previous layer’s output and passes the result to the next. For example,
a three-layer network is defined by:

$$\begin{align*}
\mathbf{h}_1 &= f_1[\mathbf{x}, \phi_1] \\
\mathbf{h}_2 &= f_2[\mathbf{h}_1, \phi_2] \\
\mathbf{h}_3 &= f_3[\mathbf{h}_2, \phi_3] \\
\mathbf{y} &= f_4[\mathbf{h}_3, \phi_4],
\end{align*}$$

where $h_1, h_2, h_3$ denote the intermediate hidden layers, x is the network input, y
is the output, and the functions $f_k[\bullet,\phi_k]$ perform the processing.
In a standard neural network, each layer consists of a linear transformation followed
by an activation function, and the parameters $\phi_k$ comprise the weights and biases of the linear transformation. In a convolutional network, each layer consists of a set of convolutions followed by an activation function, *and the parameters comprise the convolutional kernels and biases.*


#### shattered gradient:
<img src=../images/shattered_gradient.png width=650>

Residual or skip connections are branches in the computational path, whereby the input
to each network layer $f[\bullet]$ is added back to the output. By analogy to, the residual network is defined as

$$
\begin{align*}
\mathbf{h}_1 &= \mathbf{x} + f_1[\mathbf{x}, \phi_1] \\
\mathbf{h}_2 &= \mathbf{h}_1 + f_2[\mathbf{h}_1, \phi_2] \\
\mathbf{h}_3 &= \mathbf{h}_2 + f_3[\mathbf{h}_2, \phi_3] \\
\mathbf{y} &= \mathbf{h}_3 + f_4[\mathbf{h}_3, \phi_4],
\end{align*}
$$

where the first term on the right-hand side of each line is the residual connection. Each
function $f_k$ **learns an additive change** to the current representation. It follows that their
outputs must be the same size as their inputs. Each additive combination of the input
and the processed output is known as a residual block or residual layer.

#### the overall layout for residual connections:
<img src=../images/residual_connection.png width=650>

one interpretation is that residual connections turn the original network into an *ensemble* of
these smaller networks whose outputs are summed to compute the result.

***He initialization in initializing network parameters:*** 

we initialize the network parameters so that the expected variance of the
activations (in the forward pass) and gradients (in the backward pass) remains the same
between layers. *He initialization* achieves this for ReLU activations by
initializing the biases $\beta$ to zero and choosing normally distributed weights $\Omega$ with mean
zero and variance $2/{D_h}$ where $D_h$ is the number of hidden units in the previous layer.

Now consider a residual network. We do not have to worry about the intermediate
values or gradients vanishing with network depth since there exists a path whereby
each layer directly contributes to the network output.
However, even if we use He initialization within the residual block, the values in the
forward pass increase exponentially as we move through the network.
To see why, consider that we add the result of the processing in the residual block back
to the input. Each branch has some (uncorrelated) variability. Hence, the overall variance
increases when we recombine them. With ReLU activations and He initialization, the
expected variance is unchanged by the processing in each block. Consequently, when
we recombine with the input, the variance doubles, growing exponentially
with the number of residual blocks. This limits the possible network depth before floating
point precision is exceeded in the forward pass. A similar argument applies to the
gradients in the backward pass of the backpropagation algorithm.
Hence, residual networks still suffer from unstable forward propagation and exploding
gradients even with He initialization. One approach that would stabilize the forward and
backward passes would be to use He initialization and then multiply the combined output
of each residual block by $1/\sqrt{2}$ to compensate for the doubling. However,
it is more usual to use batch normalization.

<img src=../images/BN_in_ResNet.png width=650>

Batch normalization or BatchNorm shifts and rescales each activation hso that its mean
and variance across the batch Bbecome values that are learned during training. First,
the empirical mean $m_h$ and standard deviation $s_h$ are computed:

$$
m_h = \frac{1}{|B|} \sum_{i \in B} h_i
$$
$$
s_h = \sqrt{\frac{1}{|B|} \sum_{i \in B} (h_i - m_h)^2}
$$
where all quantities are scalars. Then we use these statistics to *standardize* the batch activations to have mean zero and unit variance:
$$
h_i \leftarrow \frac{h_i - m_h}{s_h + \epsilon} \quad \forall i \in B
$$
where $\epsilon$ is a small number that prevents division by zero if $h_i$ is the same for every
member of the batch and $s_h = 0$.

Finally, the normalized variable is scaled by $\gamma$ and shifted by $\delta$:
$$
h_i \leftarrow \gamma h_i + \delta \quad \forall i \in B
$$

After this operation, the activations have mean $\delta$ and standard deviation $\gamma$ across all
members of the batch

Batch normalization is applied independently to each hidden unit. In a standard
neural network with K layers, each containing D hidden units, there would be KD
learned offsets δ and KD learned scales γ. In a convolutional network, the normalizing
statistics are computed over both the batch and the spatial position. If there were K
layers, each containing C channels, there would be KC offsets and KC scales. At test
time, we do not have a batch from which we can gather statistics. To resolve this, the
statistics mh and sh are calculated across the whole training dataset (rather than just a
batch) and frozen in the final network.

Batch normalization makes the network invariant to rescaling the weights and biases that
contribute to each activation; if these are doubled, then the activations also double, the
estimated standard deviation $s_h$ doubles, and the normalization in equation above compensates for these changes. This happens separately for each hidden unit. Consequently,
there will be a large family of weights and biases that all produce the same effect. Batch
normalization also adds two parameters, γ and δ, at every hidden unit, which makes the
model somewhat larger. Hence, it both creates redundancy in the weights and biases and
adds extra parameters to compensate for that redundancy. This is obviously ineﬀicient,
but batch normalization also provides several benefits.

benefits of BatchNorm:
- Stable forward propagation
- Higher learning rates
- Regularization

Residual blocks were first used in convolutional networks for image classification. The
resulting networks are known as residual networks, or ResNets for short. In ResNets, each
residual block contains a batch normalization operation, a ReLU activation function, and
a convolutional layer. This is followed by the same sequence again before being added back to the input. Trial and error have shown that this order of operations
works well for image classification.
For very deep networks, the number of parameters may become undesirably large.
*Bottleneck residual blocks* make more eﬀicient use of parameters using three convolutions.
The first has a 1×1 kernel and reduces the number of channels. The second is a regular
3×3 kernel, and the third is another 1×1 kernel to increase the number of channels back
to the original amount. In this way, we can integrate information over a
3×3 pixel area using fewer parameters.

<img src=../images/bottleneck_ResNet_block.png width=700>

The **DenseNet** architecture uses concatenation so that the input to a layer comprises
the concatenated outputs from all previous layers. These are processed to
create a new representation that is itself concatenated with the previous representation
and passed to the next layer. This concatenation means there is a direct contribution
from earlier layers to the output, so the loss surface behaves reasonably.

DenseNet is built around the idea of "dense connectivity," where each layer in the network receives inputs from all preceding layers and passes its own feature maps to all subsequent layers. Concatenation is the mechanism that facilitates this process. Specifically, at each layer, the feature maps produced by that layer are concatenated (combined) with the feature maps from all previous layers, rather than being summed or averaged as in other architectures like ResNet (Residual Networks). In contrast to ResNet, which uses addition to combine features (residual connections), DenseNet’s concatenation keeps all features intact rather than merging them. This distinction allows DenseNet to maintain a more diverse set of features, which can improve the network’s ability to learn complex patterns. concatenation in DenseNet is the key to its dense connectivity, enabling feature reuse, better gradient flow, and parameter efficiency, ultimately leading to a highly effective and compact deep learning architecture.

### further research:
- shattered gradients??