# COSE474-2024: Deep Learning HW2

# 7.1 From Fully Connected Layers to Convolutions

## Discussions

### 7.1.1 Invariance

- Convolutional neural networks (CNNs) systematize the idea of spatial invariance, exploiting it to learn useful representations with fewer parameters
  - Spatial invariance: if an object occurs in any image it will be detected irrespective of its position

Enumerating a few desiderata to guide our design of a neural network architecture suitable for computer vision:
1. **Translation invariance** (or translation equivariance): In the earliest layers, our network should respond similarly to the same patch, regardless of where it appears in the image.
2. **Locality principle**: The earliest layers of the network should focus on local regions, without regard for the contents of the image in distant regions.
3. As we proceed, deeper layers should be able to capture longer-range features of the image, in a way similar to higher level vision in nature

### 7.1.2 Constraining the MLP

We can consider an MLP with two-dimensional images **X** asn inputs and their immediate hidden representations **H** similarly represented as matrices.
Fourth-order weight tensors **W**. **U** contains biases.

Fully connected layer:
$$
[\mathbf{H}]_{i,j} = [\mathbf{U}]_{i,j} + \sum_{a} \sum_{b} [\mathbf{V}]_{i,j,a,b}[\mathbf{X}]_{i+a,j+b}
$$

$[\mathbf{V}]_{i,j,a,b} = [\mathbf{W}]_{i,j,i+a,j+b}$  The indices and run over both positive and negative offsets, covering the entire image. For this parametrization: a 1000 x 1000 image is mapped to a 1000 x 1000 hidden representation. This requires  $10^{12}$ parameters.

**Translation Invariance**

This implies that a shift in the input **X** should simply lead to a shift in the hidden representation **H**. This is only possible if **V** and **U** do not actually depend on (i,j).

We can simplify the definition for **H**:
$$
[\mathbf{H}]_{i,j} = u + \sum_{a} \sum_{b} [\mathbf{V}]_{a,b}[\mathbf{X}]_{i+a,j+b}$$ This is a convolution. **V** no longer depends on the location withi. The number of parameters is now 4 x $10^6$

**Locality**
We believe that we should not have to look very far away from location (i,j) in order to glean relevant information to assess what is going on at $[H]_{i,j}$ Outside some range $|a| > \Delta$ or $|b| > \Delta$, we should set $[V]_{a,b} = 0$

$$
[\mathbf{H}]_{i,j} = u + \sum_{a= -\Delta}^{\Delta} \sum_{b= - \Delta}^{\Delta} [\mathbf{V}]_{a,b}[\mathbf{X}]_{i+a,j+b}$$
The number of parameters are now 4$\Delta^2$, where $\Delta$ is typically smaller than 10.
- Convolutional neural networks are a special family of neural networks that contain convolutional layers.
- **V** is refered to as a *convolution kernel*, a *filter*, or simply the layer's weights that are learnable parameters.

### 7.1.3 Convolutions

In mathematics, the *convolution* between two functions, say $f, g : \mathbb{R}^d \rightarrow \mathbb{R}$ is defined as
$$(f*g)(\text{x}) = \int f(\text{z})g(\text{x - z})d\text{z}$$
Whenever we have discrete objects, the integral turns into a sum.
- For vectors from the set of square-summable infinite-dimensional vectors with index running over $\mathbb{Z}$: $(f*g)(i) = \sum_a f(a)g(i-a)$
- For two-dimensional tensors, we have a corresponding sum with indices (a,b) for $f$ and $(i-a,j-b)$ for $g$, respectively: $(f*g)(i,j) = \sum_a \sum_b f(a,b)g(i-a,j-b)$

### 7.1.4 Channels

- The convolutional layer picks windows of a given size and weighs intensities according to the filter **V**
- Images are third-order tensors, characterized by height, width, and channel. The channel can be regarded as assigning a multidimensional representation to each pixel location.
  - We index **X** as [**X**]$_{i,j,k}$. The convolutional filter has to adapt accordingly. [**V**]$_{a,b,c}$
- We want an entire vector of hidden representations corresponding to each spatial location. These are sometimes called *channels*, *feature maps*.
- To support multiple channels in both inputs (X) and hidden representations (H), we can add a fourth coordinate to V.
$$
[\mathbf{H}]_{i,j,d} = u + \sum_{a= -\Delta}^{\Delta} \sum_{b= - \Delta}^{\Delta}\sum_c [\mathbf{V}]_{a,b,c,d}[\mathbf{X}]_{i+a,j+b,c}$$ where $d$ indexes the output channels in the hidden representations H.

- Translation invariance in images implies that all patches of an image will be treated in the same manner.
- Locality means that only a small neighborhood of pixels will be used to compute the corresponding hidden representations.
- Some of the earliest references to CNNs are in the form of the Neocognitron
- Adding channels allowed us to bring back some of the complexity that was lost due to the restrictions imposed on the convolutional kernel by locality and translation invariance.