# Chap 9: Convolutional Networks
**Convolutional networks (CNNs)**: a specialized kind of neural network for processing data that has a known, grid-like topology.
Ex: time-series data - 1D gird, images - 2D grid.

Convolutional networks are simply neural networks that use **convolution** in place of general matrix multiplication in at least one of their layers.

## The Convolution Operation
Convolution is an operation on two functions of a realvalued argument:

$s(t) = \int x(a)w(t − a)da = (x * w)(t)$

- First argument: x - **input**

- Second argument: w - **kernel**

- Ouput: **feature map**

Discrete convolution:

$s(t) = (x ∗ w)(t) = \sum^\infty_{a=-\infty} = x(a)w(t − a)$

We often use convolutions over more than one axis at a time. Ex: 2D image I as input, 2D kernel K

$S(i, j) = (I ∗ K)(i, j) = \sum_m \sum_n I(m, n)K(i − m, j − n)$

Convolution is commutative

-> $S(i, j) = (K ∗ I)(i, j) = \sum_m \sum_n I(i − m, j − n)K(m, n)$

**Cross-correlation**, which is the same as convolution but without flipping the kernel:

$S(i, j) = (I ∗ K)(i, j) = \sum_m \sum_n I(i + m, j + n)K(m, n)$

![convolution][convolution]
Many machine learning libraries implement cross-correlation but call it convolution.

Discrete convolution can be viewed as multiplication by a matrix. However, the matrix has several entries constrained to be equal to other entries. 

## Motivation
Convolution leverages three important ideas that can help improve a ML system: 
- sparse interactions (sparse connectivity or sparse weights)
- parameter sharing
- equivariant representations

### Sparse interactions
making kernel smaller than the input 
Ex: The input image might have thousands or millions of pixels, but we can detect small, meaningful features such as edges with kernels that occupy only tens or hundreds of pixels.

**Advantage**:
- need to store fewer parameters, which both reduces the memory requirements of the model and improves its statistical efficiency.
- computing the output requires fewer operations. 

-> These improvements in efficiency are usually quite large

![sparse-interaction][sparse-interaction]

### Parameter sharing
Parameter sharing refers to using the same parameter for more than one function in a model.

In a convolutional neural net, each member of the kernel is used at every position of the input 

-> rather than learning a separate set of parameter for every location, we learn only one set.

Convolution is thus dramatically more efficient than dense matrix multiplication in terms of the memory equirements
and statistical efficiency. 

How sparse connectivity and parameter sharing can dramatically improve the efficiency of a linear function for detecting edges in an image: 
![edge-detection][edge-detection]

### Equivariance
The particular form of parameter sharing leads to equivariance to translation.

a function is equivariant means that if the input changes, the output changes in the same way.

A function $f(x)$ is equivariant to a function $g$ if $f (g (x))=g (f (x))$ 

if we let $g$ be any function that translates the input, i.e., shifts it, then the convolution function is equivariant to $g$

Convolution is not naturally equivariant to some other transformations, such as changes in the scale or rotation of an image.

Some kinds of data cannot be processed by neural networks defined by matrix multiplication with a fixed-shape matrix. Convolution enables processing of some of these kinds of data.

## Pooling 
A typical layer of a convolutional network consists of three stages 
1. The layer performs several convolutions in parallel to produce a set of linear activations
2. each linear activation is run through a nonlinear activation function, such as the rectified linear activation function (detector stage)
3. we use a **pooling function** to modify the output of the layer further

A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby outputs.

-> Pooling helps to make the representation become approximately invariant to small translations of the input.

Invariance to local translation can be a very useful property if we care more about whether some feature is present than exactly where it is.





## Convolution and Pooling as an Infinitely Strong Prior







## The Neuroscientific Basis for Convolutional Networks

Convolutional networks: 
- biologically inspired AI
- key design principles drawn from neuroscience
- begin with neuroscientific experiments 
        -  David Hubel and Torsten Wiesel: basic facts about how the mammalian vision system works -> Nobel prize
            - recording the activity of individual neurons in cats.
            - observed how neurons in the cat’s brain responded to images projected in precise locations on a screen in front of the cat.
        - ->  neurons in the early visual system responded most strongly to very specific patterns of light, such as precisely oriented bars, but responded hardly at all to other patterns.
       
we focus on a part of the brain called V1, also known as the primary visual cortex
![V1](v1.jpg)

V1 is the first area of the brain that begins to perform significantly advanced processing of visual input.

Images -> light -> retina (perform some simple preprocessing of the image but do not substantially alter the way it is represented.)

A convolutional network layer is designed to capture three properties of V1:
- V1 is arranged in a spatial map (bản đồ không gian). It actually has a 2D structure mirroring the structure of the image in the retina
- V1 contains many **simple cells** and **complex cells**
- "grandmother" cells have been shown to actually exist in the human brain, in a region called the medial temporal lobe 

There are many differences between convolutional networks and the mammalian vision system:
- The human eye is mostly very low resolution, except for a tiny patch called the fovea.  Most convolutional networks actually receive large full resolution photographs as input.
- Convolutional networks are purely visual (not include mood or thoughs,...)
- The human visual system does much more than just recognize objects

Reverse correlation shows us that most V1 cells have weights that are described by Gabor functions.

The response of a simple cell to an image:

$s(I) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} w(x, y) I(x, y)$
[convolution]: convolution.jpg
[sparse-interaction]: sparse-interaction.png
[edge-detection]: edge-detection.png