## Convolutional networks
more specialized network components with sparser connections, shared weights, and parallel processing paths

#### for 1D convolutional network:
Convolutional layers are network layers based on the convolution operation. In 1D, a
convolution transforms an input vector $x$ into an output vector $z$ so that each output $z_i$
is a weighted sum of nearby inputs. The same weights are used at every position and
are collectively called the *convolution kernel* or *filter*. The size of the region over which
inputs are combined is termed the *kernel size*.

$$z_i = w_1 x_{i-1} + w_2 x_i + w_3 x_{i+1}$$
where: $\mathbf{w} = [w_1, w_2, w_3]^T$. Notice that the convolution oper- ation is equivariant with respect to translation. If we translate the input $x$, then the corresponding output $z$ is translated in the same way.

the members of convolution operations are distinguished by their *stride*, *kernel size*, and *dilation rate*. When we evaluate the output
at every position, we term this a stride of one. However, it is also possible to shift the
kernel by a stride greater than one. If we have a stride of two, we create roughly half
the number of outputs.
The kernel size can be increased to integrate over a larger area. However, it typically remains an odd number so that it can be centered around the current
position. Increasing the kernel size has the disadvantage of requiring more weights. This
leads to the idea of dilated or atrous convolutions, in which the kernel values are interspersed with zeros. For example, we can turn a kernel of size five into a dilated kernel of
size three by setting the second and fourth elements to zero. We still integrate information from a larger input region but only require three weights to do this.
The number of zeros we intersperse between the weights determines the *dilation rate*.

$$h_i = a \left[ \beta + w_1 x_{i-1} + w_2 x_i + w_3 x_{i+1} \right] = a \left[ \beta + \sum_{j=1}^{3} w_j x_{i+j-2} \right]$$
where the bias $\beta$ and kernel weights $ω_1,ω_2,ω_3$ are trainable parameters, and (with zero-
padding) we treat the input xas zero when it is out of the valid range. This is a special
case of a fully connected layer that computes the ith hidden unit as:
$$h_i = a \left[ \beta_i + \sum_{j=1}^{D} w_j x_j \right]$$

If we only apply a single convolution, information will likely be lost; we are averaging
nearby inputs, and the $ReLU$ activation function clips results that are less than zero.
Hence, it is usual to compute several convolutions in *parallel*. Each convolution produces
a new set of hidden variables, termed a *feature map* or *channel*.

Typically, multiple convolutions are applied to the input $x$
and stored in channels. so channels in terms of the sizes: In general, the input and the hidden layers all have multiple channels. If
the incoming layer has $C_i$ channels and we select a kernel size $K$ per channel, the hidden units in each output channel are computed as a weighted sum over all $C_i$ channels and $K$
kernel entries using a weight matrix $\Omega \in R^{C_i \times K}$ and one bias. Hence, if there are $C_o$
channels in the next layer, then we need $\Omega \in R^{C_i \times C_o \times K}$ weights and $\beta \in R^{C_o}$ biases.

The *receptive field* of a hidden unit in the network is the region of the original input that
feeds into it. Consider a convolutional network where each convolutional layer has kernel
size three. The hidden units in the first layer take a weighted sum of the three closest
inputs, so have receptive fields of size three. The units in the second layer take a weighted
sum of the three closest positions in the first layer, which are themselves weighted sums
of three inputs. Hence, the hidden units in the second layer have a receptive field of size
five. In this way, the receptive field of units in successive layers increases, and information
from across the input is gradually integrated

**Weighted sum**: In the context of deep neural networks, a weighted sum is a fundamental mathematical operation used to compute the input to a neuron (or node) in a network layer. It combines the inputs from the previous layer (or input data) with a set of weights, which represent the strength or importance of each input, and often includes a bias term. This operation is central to how neural networks process and transform data.

This discrepancy between a convolutional neural network and a fully connected deep neural network is probably not due to the difference in the number of parameters;
we know overparameterization usually improves performance. The likely
explanation is that the convolutional architecture has a *superior inductive bias* (i.e.,
interpolates between the training data better) because we have embodied some prior
knowledge in the architecture; we have forced the network to process each position in
the input in the same way. We know that the data were created by starting with a
template that is (among other operations) randomly translated, so this is sensible.

The fully connected network has to learn what each digit template looks like at every
position. In contrast, the convolutional network shares information across positions and
hence learns to identify each category more accurately. Another way of thinking about
this is that when we train the convolutional network, we search through a smaller family
of input/output mappings, all of which are plausible. Alternatively, the convolutional
structure can be considered a regularizer that applies an infinite penalty to most of the
solutions that a fully connected network can describe.

### convolutional neural networks for 2D images:
$$h_{ij} = a \left[ \beta + \sum_{m=1}^{3} \sum_{n=1}^{3} w_{mn} x_{i+m-2, j+n-2} \right]$$

further research: 
- **max_pooling, mean_pooling and average_pooling**
- downsampling and upsampling
- transposed convolutions
- *cutout* (dropout equvalent for CNN)
- 

### summary:
In convolutional layers, each hidden unit is computed by taking a weighted sum of the
nearby inputs, adding a bias, and applying an activation function. The weights and
the bias are the same at every spatial position, so there are far fewer parameters than
in a fully connected network, and the number of parameters doesn’t increase with the
input image size. To ensure that information is not lost, this operation is repeated with different weights and biases to create multiple channels at each spatial position.
Typical convolutional networks consist of convolutional layers interspersed with layers
that downsample by a factor of two. As a data example passes through the network, the
spatial dimensions usually decrease by factors of two, and the channels increase by factors
of two. At the end of the network, there are typically one or more fully connected layers
that integrate information from across the entire input and create the desired output. If
the output is an image, a mirrored *“decoder”* upsamples back to the original size.
The translational equivariance of convolutional layers imposes a useful inductive bias
that increases performance for image-based tasks relative to fully connected networks.
We described image classification, object detection, and semantic segmentation networks.
Image classification performance was shown to improve as the network became deeper.
However, subsequent experiments showed that increasing the network depth indefinitely
doesn’t continue to help; after a certain depth, the system becomes difficult to train.
This is the motivation for residual connections, which are the topic of the next notebook.

Notes:
1. Convolutional networks are typically initialized using
Xavier initialization (Glorot & Bengio, 2010) or He initialization (He et al., 2015), as described
in section 7.5. However, the ConvolutionOrthogonal initializer (Xiao et al., 2018a) is special- ized for convolutional networks. Networks of up to 10,000 layers can be trained using this
initialization without the need for residual connections.
2. Dropout is effective for fully connected networks but less so for convolutional layers (Park &
Kwak, 2016). This may be because neighboring image pixels are highly correlated, so if a hidden
unit drops out, the same information is passed on via adjacent positions. This is the motivation
for spatial dropout and cutout. In spatial dropout (Tompson et al., 2015), entire feature maps
are discarded instead of individual pixels. This circumvents the problem of neighboring pixels
carrying the same information. Similarly, DeVries & Taylor (2017b) propose cutout, in which a
square patch of each input image is masked at training time. Wu & Gu (2015) modified max
pooling for dropout layers using a method that involves sampling from a probability distribution
over the constituent elements rather than always taking the maximum.