
---

Lecture 6
=======

Outline
---------

### I Introduction to the convolution operation

### II Building blocks of ConvNets
> A. Convolutional kernels
>
> B. Pooling
>
> C. Strides and downsampling
>
> D. Equivariance and Invariance of the Convolutional Filters
>
> E. Schematic picture of the CNN architecture at work

### III First applications of CNN
> A. The first use case: MNIST
>
> B. Real CNN test: Dogs vs. Cats dataset
>
> C. Dataset augmentation
>
> D. VGG16 and features extraction learning
---


---

I Introduction to the convolution operation
----------------

### Convolution in physics dates back at least to Huygensa and superpositon principle
> A. Given a signal x(t) and filter w(t-a) we define their convolution as

![Image](gfx/convDefinition.png "Convolution definition")
> Simple interpretation:
>> We average (or smear-out) the input signal x(t) with the weight w(t-s) at locations (t-s)
>>
>> For example this can be time-decaying average of a temporal series up to the present time t
---

---

### For NN image processing one has a discrete 2D convolution:

![Image](gfx/conv2D.png "Convolution in 2D")

> A. In principle this linear operation can be generalized to arbitrary tensors
>
> B. For finite ranges of the indices care must be taken for the boundary elements (more on this later on: padding)
>
> C. The above case tranforms one 2D map into another 2D map, and is realized by a single convolution 'kernel' K
>
> D. In general one can consider collections of kernels, or kernel tensors
---

---

II Building blocks of Convolutional Networks
----------------

### Consider 2D image data processing
> A. Suppose we have images represented as 3D tensors V_{i,l,m} within 4D batches, with i labeling channels, e.g. RGB and l, m labelling space
>
> We wish to apply to it a convolution represented by a 4D tensor K_{i,j,m,n}, which would yield another 3D tensor: the layer output
>
> The aim is to transform multi-channel representation of the data into another multi-channel representation, preserving channeled 2D image structure
> 
> B. For the case at hand we have the output tensor Z_{i,j,k}:

![Image](gfx/convNetFormula.png "2D CNN Convolution")
>
> Here i and j labels input and output layers channels, while m, n labels width and height offsets between the corresponding pixels 
>
> C. This algebraic structure replaces Dense layers of old and defines new input-output maps of the CNN network
>
> D. The output size is decreased by two pixels per direction (H x W). One has to treat boundary with zero-padding to preserve the size.
>
> E. Numerous constraints imposed on the structure of K_{i,j,m,n} define fifferent types of convolutional networks
>
> F. The foremost demand in CNNs is that K is a block matrix dominated by small number of identical sub-kernels families {K_1, K_2, ...}_{i,j,m,n}<<K_{i,j,m,n}, and sparse (zero) otherwise
>
> G. In most cases a further demand is introduced, that for the given layer i all the sub-kernels are identical: K_1=K_2=...
---

---
### How is it different from the Dense layer?
> A. The most crucial difference is the sparse connectivity with parameter sharing along the layer
>
> B. We demand, that for the given layer index kernel contains only few different parameters, replicated many times with only limited local connections to the input data tensor
> 
> In practise that means, that kernel consists of small local sets of connections, as opposed to Dense layer, and that each set of connections has the same set of weights. 
>
> C. These sets are called receptive fields and constitute learnable filters, core objects of the CNN architecture
>
> D. Further, we output many channels corresponding to the label i, not just one as in the Dense case
>
> E. Each output channel is the result of the local filter action and represents presence of the corresponfing feature in the data sample
>
> F. Such an architecture has dramatically reduced spatial (storage complexity) and far reduced inference time cost
>
> G. For example: processing a picture of 320 x 280 pixels for vertical edge detection:
>> Direct matrix multiplication implementation: 320 x 280 x 319 x 280 ~ 8x10^9 parameters
>>
>> Convolutional implementation: 2 x 319 x 280 ~ 180 x 10*3 parameters
>
> I. The network can still represent non-local correlations by employing deep leayers ('shadow casting')

---

---
### Parameter sharing consequences
> A. An important property emerges in ConvNets: local feature maps (results of convolutions) become approximately translationally equivariant
>
> B. This means, that the given layer is in fact a characteristic function of a given feature and will detect it anywhere in the input data tensor
>
> C. Other possible symmetries are less obvious to implement

---

---
### Pooling, strides and downsampling
> A. Convolutional layers work in ConvNets together with Pooling layers
>
> B. The role of Pooling is to select the most prominent features, as observed by Conv layer, and substantially reduce the output layer size by the striding operation parametrized by an integer s
>
> C. Stride is basically a shift between neighboaring filter application: s>1 skips some pixels
> 
> D. There are numerous schemes for Pooling feature selection: 
>> Max Pooling
>>
>> Average Pooling
>>
>> Clustering
>>
>> Weighted geometric averaging
>>
>> Learnable Pooling
>
> E. In any case the result is a new output layer of size N/s, with N the input layer size
>
> F. This operation effectively downsamples feature maps, futher reducing spatio-temporal complexity of the model
---

---
### Pooling and downsampling consequences

> A. A cricual property emerging from MaxPooling is translational invariance of the feature detection
>
> B. CNN looses precise information on the given feature locus in favor of its presence detection: 'if it is, not where it is'
> 
> C. This in fact is an assumption about the dataset, and can be wrong!
>
> D. Convolutions and Pooling may lead to undefitting, if their invariances are absent from the dataset.
---

---

Schematic picture of CNN at work
-----------


### A. The idea: hierarchical image decomposition preserving local spatial correlations:

![Image](gfx/cnn-cat-assembly.jpg "CNN Cat Assembly")

### B. In practise, introduce local receptive fields: patches observing small portions of the feature map

![Image](gfx/cnn-action.jpg "CNN Action")

### C. Basic ConvNet stack for images

> ### 1. Start from 3D (not 1D) input image tensor (Height, Width, Channels): (28, 28, 1) or (H, W, 3) for RGB
>
> ### 2. Introduce new elements: many small "convolutional kernels" defining local trainable feature filters, typically shaped (3, 3) or (5, 5). There can be great many of them in the given layer.
>
> ### 3. Scan each pixel of the image with kernels and produce corresponding new output map pixel channels
>
> ### 4. Downsample image by MaxPooling: select the most prominent local features
> 
> ### 5. Repeat layer stack as many times as needed
>
> ### 6. Introduce Dense layers and output to construct classifier mapping to cathegories

### D. Convolution action in detail

![Image](gfx/cnn-conv-layer-decomposition.png "CNN Conv Layer Decomposition")

### E. Local patches: valid convolution points with (3, 3) kernel

![Image](gfx/cnn-conv-padding.png "CNN Valid Convolution")

### F. Zero-Padding for boundary effects mitigation with (3, 3) kernel

![Image](gfx/cnn-conv-padding2.png "CNN Padding")

### G. Receptive field action: same size, filtered output

![Image](gfx/cnn-receptive-field.jpg "CNN receptive field")

### H. Strides: 
> ### Downsampling, used with pooling and sometimes with kernels
>
> ### This step is crucial for reducing network parameters count and allowing for wide scope of the deep receptive fields: 'hyperbolic' vs linear structure of the network (like AdS space!)
>
> ### Translation invariance and global correlations appear due to the locally maximal feature selection

![Image](gfx/cnn-conv-stride.png "CNN Stride - Downsampling")

### I. Cat image activations in a deeper (4th) layer
![Image](gfx/cnn-4th-layer-activations.png "4th layer activations")

---

III Applications
-------

### MNIST
### Cats vs Dogs
### Transfer learning with VGG16