# Convolutional Neural Networks

Convolutional neural networks (CNNs) are a specialized neural networks for processing data that has a grid-like topology.

CNNs have been very successful in image processing and computer vision applications.

Convolutional neural networks utilize a specialized type of linear operation called __convolution__.

You could define convolutional neural networks as neural networks that use convolution in place of general matrix multiplication in at least one of their layers.

## The convolution operation

In its most general form, convolution is an operation on two functions of a real-valued argument.

Let $x(t)$ and $w(t)$ be two functions of $t\in\mathbb R$. The __convolution__  operation is defined as:

$s(t)=(x \ast w)(t)= \int_{-\infty}^{\infty} x(a)w(t-a)da$

$x$ is often referred to as the __input__ and the second argument $w$ as the __kernel__.

If we assume that $x$ and $w$ are only defined on integers $t\in \mathbb I$, we can define the __discrete convolution__.

$s(t)=(x \ast w)(t)=\sum_{a=-\infty}^{\infty} x(a)w(t-a)$

## The convolution operation on tensors

In machine learning, the inputs are usually a tensor of data, and the kernel a tensor of parameters. Since both tensors are finite we evolve the general form of convolution to substitue the infinite summation with a summation over a finite number of tensor elements - it is assumed that the values are zero everywhere but in the finite set of points stored by tensors.

We can define a convolution operation on tensors as follows:

Let $\mathbf X$ and $\mathbf W$ be two-dimensional matrices (rank 2 tensors) .The convolution operation yields another two-dimensional matrix whose elements are defined as:

$z_{i,j}=(\mathbf X \ast \mathbf W)_{i,j}=\sum_m\sum_n x_{m,n} w_{i-m,j-n}$. 

Notice that as $m$ and $n$ increase, the index into the input increases, but the index into the kernel descreases. In  essence, the kernel is __flipped__ relative to the input. Flipping the kernel results in a commutative property of a convolution meaning we can equivalently write:

$z_{i,j}=(\mathbf W \ast \mathbf X)_{i,j}=\sum_m\sum_n x_{i-m,j-n} w_{m,n}$.

Most neural network libraries implement a related function called the __cross-correlation__ (which is the same as convolution but without flipping the kernel) but refer to it as __convolution__.

$z_{i,j}=(\mathbf W \ast \mathbf X)_{i,j}=\sum_m\sum_n x_{i+m,j+n} w_{m,n}$.

In machine learning, the output of a convolution is often referred to as a feature map.

The convolution (cross-correlation) formulas as defined in this section can be generalized to tensors of an arbitrary rank. In deep learning applications, the most common are 1D, 2D, 3D, and 4D convolutions.




## Variants of the basic convolution function

In practical CNN application the input is usually a grid of vector-valued observations. E.g. RGB channels in a color image. Also we usually want to perform multiple convolutions in parallel. This is because a single kernel can extract only one kind of feature, albeit at many spatial locations. Usually we want each layer of the network to extract many kinds of features,at many locations. And finally, we almost always process inputs in a batch mode.

As a result CNNs usually process multichannel convolutions. For example. Let input be RGB images encoded as rank 3 tensors (sometimes referred to as volumes) $\mathbf X\in \mathbb R^{h \times w \times C_{in}}$ where $x_{i,j,k}$ is a pixel at row $i$, column $j$, and color channel $k$. If we want to convolve the input image with a set of kernels to generate $C_{out}$ feature maps the kernel tensor $\mathbf W$ needs be rank 4 tensor $\mathbf W\in \mathbf R^{k_h \times k_w \times C_{in} \times C_{out}}$. The output volume of the multichannel convolution is defined as:

$\mathbf Z[:,:,l]=\sum_{k=1}^{C_{in}}\mathbf W[:,:,k,l] \ast \mathbf X[:,:,k]+\mathbf b[l]\quad$ where $l=\{1,...,C_{out}\}$
