In [2]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

In [2]:
%run beautify_plots.py

In [3]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [4]:
import pandas as pd
import numpy as plt

import matplotlib.pyplot as plt

import os 


# Convolutional Neural Networks: in pictures

It may be easier to grasp the workings of a CNN in pictures.

We start with the simplest case of input $\x$ and pattern $\kernel$
- one non-feature dimension: a 1D vector
- one input feature
- one output feature

We work our way up to a more complicated case
- two non-feature dimensions: 2D matrix
- a number of input features
- a number of output  features (possibly different from the input)

The [notebook](CNN_pictorial.ipynb) illustrates the various possibilities.

In the remainder of this notebook: we explain the pictures.

# Preliminaries

# Behavior of a CNN layer

Layer $\ll$ in a Sequential NN transforms transforms input $\y_{(\ll-1)}$ to output $y_\llp$
- $\y_\llp$ is called a *feature map*, for all layers $\ll$
    - for each location in $\y_{(\ll-1)}$
    - it measures the intensity of the pattern match when the pattern is centered at that location 
    
So we write the input as $\y_{(\ll-1)}$ rather than the $\x$ we had used previously.

The size of all quantities in the convolution can vary by layer
- so we add a parenthesized subscript to indicate the layer

We write
- the kernel size as $f_\llp$ (can vary by layer) rather than the $f$ used previously
- the collection of kernels for layer $\ll$ as $\W_\llp$

In general a layer $\ll$ output $\y_\llp$ will have
- $N_\llp \gt 0$ non-feature dimensions
    - non-feature dimension $i$ has length (number of indices) $d_{\llp,i}$  indices
        - for dimensions $0 \le i \lt N_\llp$
    - the set of indexes in dimension $i$ is written as $D_i$
        - usually equal to $0, \ldots, d_{\llp,i}$
- one feature dimension

A CNN Layer $\ll$
- preserves the non-feature dimensions (when padding is used)
$$
\begin{array} \\
N_{(\ll-1)} & = & N_\llp \\
d_{(\ll-1),i} & = & d_{\llp,i} & 0 \le i \lt N_{(\ll-1)} \\
\end{array}
$$
- changes the length of the feature dimension
    - from $n_{(\ll-1)}$ to $n_\llp$

Thus the shape of the input $\y_{(\ll-1)}$ and $\y_\llp$ may only differ in the length of the feature dimension
- provided padding is used
    - in the absence of padding: $\lfloor \frac{f_\llp}{2} \rfloor$ locations are lost at each boundary

Thus the CNN layer $\ll$

\begin{array}\\
|| \y_{(\ll-1)} || & = & (d_{(\ll-1),0} \times d_{(\ll-1),1} \times \ldots d_{(\ll-1), N_{(\ll-1)}}, & \mathbf{n_{(\ll-1)}} ) \\
|| \y_\llp || &  = & (d_{(\ll-1),0} \times d_{(\ll-1),1} \times \ldots d_{(\ll-1),N_{(\ll-1)}},  &\mathbf{n_\llp} )
\end{array}
$$
because
$$


We write 
$$\y_{\llp, \idxb, j}$$ 
to denote feature $j$ of layer $\ll$ at non-feature dimension location $\idxb$

## Channel Last/First

We have adopted the convention of using the final dimension as the feature dimension.
- This is called *channel last* notation.

Alternatively: one could adopt a convention of the first channel being the feature dimension.
- This is called *channel first* notation.

When using a programming API: make sure you know which notation is the default
- Channel last is the default for TensorFlow, but other toolkits may use channel first.


## Kernel, Filter

There is one pattern per output feature.

A pattern is also called a *kernel*.

The kernels of layer $\ll$ are just the weights of the layer.

The vector $\W_{\llp,1}$ above

So kernel  $j$ ($\kernel_j$)is just an element $\W_{\llp,j}$ of the weights of layer $\ll$.
entered at  $\y_{(\ll-1),j,1}$

There is one kernel per output feature, so $n_\llp$ kernels
- $\kernel_{\llp,1}, \ldots, \kernel_{\llp, n_\llp}$

The length of the feature dimension of a kernel matches it's input, i.e., $n_{(\ll-1)}$

The weight vector $\W_\llp$ therefore has multiple dimensions.  Our convention for each dimension is
- $\mathbf{W}_{\llp, j', \ldots,j}$
    - layer $\ll$
    - output feature $j$
    - spatial location: $\ldots \in \{1,2,3\}$
    - input feature $j'$

## Padding

Convolution centers the pattern at each location of the non-feature dimensions of the input.

But what happens when we try to center a patter over the first/last location ?
- the pattern may extend beyond the boundaries of the input

In such a case, we can choose to *pad* the input
- create a special padding input at the locations of the input beyond the original boundary

We will see this in pictures below.

## Activation of a CNN layer

Just like the Fully Connected layer, a CNN layer is usually paired with an activation.

The default activation $a_\llp$ in Keras is "linear"
- That is: it returns the dot product input unchanged
- Always know what is the default activation for a layer; better yet: always specify !

# Conv 1D: single feature to single feature

Convolutions pictured: sliding a pattern over the input

A *Convolution* is often depicted as
- A filter/kernel
- That is slid over each location in the non-feature dimensions of the input
- Producing a corresponding output for that location

Here's a picture with a kernel of size $f_\llp = 3$


<div><br>
    <center><strong>Conv 1D, single feature: sliding the filter</strong></center>
    <br>
<img src=images/W9_L1_S19_Conv1d_sliding.png width="80%">
    <!-- edX: Original: <img src="images/Conv1d_sliding.png"> replace by EdX created image --> 
</div>

After sliding the Kernel over the whole $\y_{(\ll-1)}$ we get the output feature map $\y_{\llp,1}$ for the first (and only) feature:<div>
    <br>
    <center><strong>Conv 1D, single feature: output feature map</strong></center>
    <br>
<img src=images/W9_L1_S22_Conv1d.png width="80%">
    <!-- edX: Original: <img src="images/Conv1d.png"> replace by EdX created image --> 
    </div>


Element $j$ of output $\y_{\llp, \ldots, 1}$ (i.e., $\y_{\llp,j,1}$)
- Is colored (e.g., $j=1$ is colored Red)
- Is computed by applying the *same* $\W_{\llp,1}$ to 
    - The $f_\llp$ elements of $\y_{(\ll-1),1}$, centered at $\y_{(\ll-1),j,1}$
    - Which have the same color as the output

Note however that,  at the "ends" of $\y_{(\ll-1)}$
the kernel
may extend beyond the input vector.

In that case $\y_{(\ll-1)}$ may be extended with *padding* (elements with $0$ value typically)
- illustrated with the boxes with broken-line edges

# Conv1d transforming single feature to multiple features

When there are multiple output features ($n_\llp > 1$) there is *one kernel per output feature*
- $\kernel_{\llp,1}, \ldots, \kernel_{\llp, n_\llp}$

Here are the 2 kernels for two output features, assuming $n_{(\ll-1)} = 1$

<div>
    <br>
    <center><strong>Conv 1D: 1 input feature, 2 output features</strong></center>
    <br>
<img src=images/Conv1d_2_to_1feature_kernel.png width="35%">
    <br>
    </div>

- $\mathbf{W}_{\llp, j', \ldots,j}$
    - layer $\ll$
    - output feature $j$
    - spatial location: $\ldots \in \{1,2,3\}$
    - input feature $j'$

Here is a [picture](CNN_pictorial.ipynb#Conv-1D:-single-feature-to-multiple-features) of a Convolutional layer $\ll$
transforming 
- a  1-dimensional input layer $(\ll-1)$  consisting of a single feature 
    - $N_{(\ll-1)} = 1, n_{(\ll-1)} =1$
- into a 1-dimensional output layer $\ll$  consisting of a *multiple* features 
    - $N_\llp = 1, n_\llp  > 1$
    

# Conv1d transforming multiple features to multiple features

What happens when the input layer has multiple features ?
- e.g., applying Convolutional layer $(\ll+1)$ to the $n_\llp$ features created by Convolutional layer $\ll$

The answer is 
- The kernels of layer $\ll$ also have a *feature* dimension
    - Kernel dimensions are $(f_\llp \times f_\llp \times n_{(\ll-1)})$
- This kernel is applied
    - at each spatial location
    - to *all features* of layer $(\ll-1)$
    - Computing a generalized "dot product": sum of element-wise products



When the input $\y_{(\ll-1)}$ has more than one feature ($n_{(\ll-1)} > 1$)
- the kernel for each output feature must have feature dimension of length $n_{(\ll-1)}$

Here is the kernel for the first output feature, assuming $n_{(\ll-1)} = 2$
- it's feature dimension is length 2.

There would be a similar kernel for each of the output features.


<div>
    <br>
    <center><strong>Conv 1D: 2 input features: kernel 1</strong></center>
    <br>
<img src=images/Conv1d_2feature_kernel.png width="35%">
    <!-- edX: obsolete:: <img src="images/W9_L2_S23_Conv1d_2feature_kernel.png "> replace by EdX created image --> 
    <br>
    </div>

- $\mathbf{W}_{\llp, j', \ldots,j}$
    - layer $\ll$
    - output feature $j$
    - spatial location: $\ldots \in \{1,2,3\}$
    - input feature $j'$

Notice that (apart from combining spatial locations)
- multiple feature maps from layer $(\ll-1)$ are combined into one feature map at layer $\ll$.
- This is how the "left" half-smile and "right" half-smile features combine into the single "smile" feature

Here is a [picture](CNN_pictorial.ipynb#Conv-1D:-Multiple-features-to-multiple-features) of a Convolutional layer $\ll$
transforming 
- a  1-dimensional input layer $(\ll-1)$  consisting of a 2 features 
    - $N_{(\ll-1)} = 1, n_{(\ll-1)} = 2$
- into a 1-dimensional output layer $\ll$  consisting of a *multiple* features 
    - $N_\llp = 1, n_\llp  = 3$


With an input layer having $N$ spatial dimensions, a Convolutional Layer $\ll$ producing $n_\llp$ features
- Preserves the "spatial" dimensions of the input
- Replaces the channel/feature dimensions

That is
$$
\begin{array}\\
|| \y_{(\ll-1)} || & = & (n_{(\ll-1),1} \times n_{(\ll-1),2} \times \ldots n_{(\ll-1),N }, & \mathbf{n_{(\ll-1)}} ) \\
|| \y_\llp || &  = & (n_{(\ll-1),1} \times n_{(\ll-1),2} \times \ldots n_{(\ll-1),N},  &\mathbf{n_\llp} )
\end{array}
$$




# Conv2d: Two dimensional convolution ($N = 2$)

Thus far, the spatial dimension has been of length $N = 1$.

Generalizing  to $N = 2$ is straightforward.
- The number of spatial dimensions (elements denoted by $\ldots$) expands from $1$ to $2$

When $N = 1$ and $\dim_1 =1$
- we have our case of $n_\llp$ features at a single location

We have shown that permuting the order of features has no effect on a Dense layer
- There is no ordering relationship among features

But when $\dim_1 > 1$, there is a *spatial ordering*.  For example
- a 2D image
- time ordered data

We need some terminology to distinguish the final dimension from the non-final dimensions

Suppose $\y_\llp$ is $(N_\llp+1)$ dimensional of shape 
$$
|| \y_{\llp} || = (\dim_{\llp,1} \times \dim_{\llp,2} \times \ldots \dim_{\llp,N_\llp} \,\, \times n_{\llp} )
$$

(Thus far: $N_\llp = 1$ and $n_{\llp} = 1$ but that will soon change)

The first $N_\llp$ dimensions $(\dim_{\llp,1} \times \dim_{\llp,2} \times \ldots \dim_{\llp,N} )$
- Are called the *spatial* dimensions of layer $\ll$

The last dimension (of size $ n_{\llp}$)
- Indexes the  features i.e., varies over the number of features
- Called the *feature* or *channel* dimension


**Notation**

- $N_\llp$ denotes the *number* of spatial dimensions of layer $\ll$
- $n_\llp$ denotes the *number of features* in layer $\ll$
- We elide the spatial dimensions as necessary, writing 
$$\y_{\llp, \ldots, j}
$$
to denote *feature map* $j$ of layer $\ll$
    - where the dots ($\ldots$) indicate the $N_\llp$ spatial dimensions
    - e.g., the feature map detecting a "smile" in the image of a face

For example
- A grey-scale image
    - $N = 2, n_\llp = 1$
    - Each pixel in the image has one feature
        - the grey-scale intensity 
    - There is an ordering relationship between 2 pixels
        - "left/right", "above/below"
- A color image
    - $N = 2, n_\llp = 3$ 
    - Each pixel in the image has 3 features/attributes
        - the intensity of each of the colors

One can imagine even higher dimensional data ($N > 2$)
- Equity data with "spatial location" identified by (Month, Day, Time)
    - With attributes: $\{ $ Open, High, Low, Close $\}$
    - Month/Day/Time are ordered

Note the distinction between the cases
- When layer $\ll$ has dimension $(\dim_\llp \times 1)$
    - a single feature
    - at $\dim_\llp = \dim_{(\ll-1)}$  *spatial* locations
- When layer $\ll$ has dimension $(1 \times \dim_\llp)$
    - (which is how we have implicitly been considering vectors when discussing the Dense layer type)
    - $\dim_\llp = \dim_{(\ll-1)}$ features
    - at a single spatial location

$n_\llp$ will always refer to the *number of features* of a layer $\ll$

Here is a [picture](CNN_pictorial.ipynb#Conv-1D:-single-feature) of a Convolutional layer $\ll$
transforming 
- a  1-dimensional input layer $(\ll-1)$  consisting of a single feature 
    - $N_{(\ll-1)} = 1, n_{(\ll-1)} =1$
- into a 1-dimensional output layer $\ll$  consisting of a single feature 
    - $N_\llp = 1, n_\llp =1$

We will generalize Convolution to deal with
- $N_\llp > 1$ spatial dimensions
- $n_\llp > 1$ features

As a preview of concepts to be introduced, consider
- the input layer $(l-1)$ is a two-dimensional ($N_{(\ll-1)} = 2$) grid of pixels
- $n_{(\ll-1)} = 1$
- layer $l$ is a Convolutional Layer identifying $n_\llp = 3$ features

<table>
    <tr>
        <th><center>Convolution: 1 input feature to 3 output features</center></th>
    </tr>
    <tr>
        <td><img src="images/Conv2d_multifeature_shape.png" width=80%></td>
    </tr>
</table>

Layer $(l-1)$ is three-dimensional tensor: $8 \times 8 \times 1$
- Spatial dimension $8 \times 8$
- 1 feature map (channel dimension $= 1$)

- Kernel $k_{\llp,j}$ is applied to each spatial location of layer $(\ll-1)$
- Detecting the presence of the pattern (defined by the kernel) at that location
    - kernel $k_{\llp,1}$ detects an eye
- Which results in feature map $\y_{\llp},\dots,j$ being created at layer $\ll$
    - $\y_{\llp,\dots,1}$ are indicators of the presence of an "eye" feature

**Convolutional Layer description**

With this terminology we can say that Convolutional Layer $\ll$:
- Transforms the $n_{(\ll-1)}$ feature maps of layer $(\ll-1)$
- Into $n_\llp$ feature maps of layer $\ll$
- Preserving the spatial dimensions: $\dim_{\llp,p} = \dim_{(\ll-1),p} \; 1 \le p \le N_{(\ll-1)}$
- Uses a different kernel $\kernel_{\llp,j}$ for each output feature/channel $1 \le j \le n_\llp$
- Applies this kernel to *each* element in the *spatial* dimensions
- Recognizing a single feature at each location within the spatial dimension


<div>
    <br>
    <center><strong>Conv 2D: single input feature: kernel 1</strong></center>
    <br>
<img src=images/Conv2d_singlefeature_input_kernel.png width="35%">
    <!-- edX: obsolete: W9_L2_S37_Conv2d_singlefeature_input_kernel.png EdX created image --> 
    <br>
    </div>
    
- $\mathbf{W}_{\llp, j', \ldots,j}$
    - layer $\ll$
    - output feature $j$
    - spatial location: $\ldots \in \{ ( \alpha, \alpha' ) \in (\dim_{(\ll-1),1} \times \dim_{(\ll-1),2} \}$
    - input feature $j'$    

Here is a [picture](CNN_pictorial.ipynb#Conv-2D:-single-feature-to-single-feature) of a Convolutional layer $\ll$
transforming 
- a  2-dimensional input layer $(\ll-1)$  consisting of a 1 feature 
    - $N_{(\ll-1)} = 2, n_{(\ll-1)} = 1$
- into a 2-dimensional output layer $\ll$  consisting of 1 feature
    - $N_\llp = 1, n_\llp  = 1$



We can further generalize to producing multiple output features

Here is a [picture](CNN_pictorial.ipynb#Conv-2D:-single-feature-to-multiple-features) of a Convolutional layer $\ll$
transforming 
- a  2-dimensional input layer $(\ll-1)$  consisting of a 1 feature 
    - $N_{(\ll-1)} = 2, n_{(\ll-1)} = 1$
- into a 2-dimensional output layer $\ll$  consisting of 2 feature
    - $N_\llp = 1, n_\llp  = 2$


Dealing with multiple input features works similarly as for $N=1$:
- The dot product
- Is over a spatial region that now has a "depth" $n_{(\ll-1)}$ equal to the number of input features
- Which means the kernel has a depth $n_{(\ll-1)}$

<div>
    <br>
    <center><strong>Conv 2D: multiple input features: kernel 1</strong></center>
    <br>
<img src=images/Conv2d_multifeature_input_kernel.png width="50%">
    <!-- edX: obsolete W9_L2_S46_Conv2d_multifeature_input_kernel.png EdX created image --> 
    <br>
    </div>

Here is a [picture](CNN_pictorial.ipynb#Conv-2D:-multiple-features-to-single-feature) of a Convolutional layer $\ll$
transforming 
- a  2-dimensional input layer $(\ll-1)$  consisting of multiple features 
    - $N_{(\ll-1)} = 2, n_{(\ll-1)} = 2$
- into a 2-dimensional output layer $\ll$  consisting of 1 feature
    - $N_\llp = 1, n_\llp  = 1$


And finally: the general case for a 2 spatial dimensions
    
Here is a [picture](CNN_pictorial.ipynb#Conv-2D:-multiple-features-to-multiple-features) of a Convolutional layer $\ll$
transforming 
- a  2-dimensional input layer $(\ll-1)$  consisting of multiple features 
    - $N_{(\ll-1)} = 2, n_{(\ll-1)} = 3$
- into a 2-dimensional output layer $\ll$  consisting of multiple features
    - $N_\llp = 1, n_\llp  = 2$



In [5]:
print("Done")

Done
