<h1>CS4618: Artificial Intelligence I</h1>
<h1>Convolutional Neural Networks</h1>
<h2>
    Derek Bridge<br>
    School of Computer Science and Information Technology<br>
    University College Cork
</h2>

<h1>Initialization</h1>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$

In [13]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [14]:
from tensorflow.keras import Model
from tensorflow.keras import Input
from tensorflow.keras.layers import Rescaling
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Flatten

from tensorflow.keras.optimizers import RMSprop

from tensorflow.keras.callbacks import EarlyStopping

from tensorflow.keras.datasets import mnist

<h1>Acknowledgement</h1>
<ul>
    <li>The first image is scanned from Figure 13-1 in: A. G&eacute;ron: 
        <i>Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (2nd edn)</i>, O'Reilly, 2019</li>
    <li>The final image was produced by adapting the code from 
        <a href="https://github.com/gwding/draw_convnet">https://github.com/gwding/draw_convnet</a>
    </li>
</ul>

<h1>Primate Vision</h1>
<ul>
    <li>In the primate vision system, there seems to be a hierarchy of neurons within the visual cortex:
        <figure>
            <img src="images/locality.png" />
        </figure>
    </li>
    <li>In the lowest layers, 
        <ul>
            <li>neurons have small local receptive fields, i.e. they respond to stimuli in a limited
                region of the visual field; and
            </li>
            <li>they respond to, e.g., spots of light.</li>
        </ul>
    </li>
    <li>In higher layers,
        <ul>
            <li>they combine the outputs of neurons in the lower layers;</li>
            <li>they have larger receptive fields; and</li>
            <li>they respond to, e.g., lines at particular orientations.</li>
        </ul>
    </li>
    <li>In the highest layers,
        <ul>
            <li>they respond to ever more complex combinations, such as shapes and objects.
            </li>
        </ul>
    </li>
    <li>There are perhaps as many as 8 layers in the visual cortex alone:
        <figure>
            <img src="images/vision.gif" />
        </figure>
    </li>
</ul>

<h1>Convolutional Neural Networks</h1>
<ul>
    <li>Convolutional Neural Networks (convnets) are widely used in computer vision and in other
        perceptual problems including speech recognition and natural language processing.
    </li>
    <li>We will use 2D convnets, which are widely used for dataset of images.</li>
    <li>They have nice properties, some of which resemble the visual cortex in primates:
        <ul>
            <li>They learn features that are <b>translation invariant</b>:
                <ul>
                    <li>A feature map in a convolutional layer will recognize that feature anywhere in
                        the image: bottom-left, top-right, &hellip;
                    </li>
                </ul>
            </li>
            <li>They learn <b>spatial hierarchies</b> of features:
                <ul>
                    <li>from small local features such as lines in lower layers up to larger shapes
                        in higher layers.
                    </li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

<h1>MNIST Example</h1>

<ul>
    <li>This is how we <em>were</em> preprocessing the MNIST dataset. Note the flattening (reshaping):
    </li>
</ul>

In [15]:
# MNIST dataset

# Load MNIST into four Numpy arrays
(mnist_x_train, mnist_y_train), (mnist_x_test, mnist_y_test) = mnist.load_data()
mnist_x_train = mnist_x_train.reshape((60000, 28 * 28))
mnist_x_test = mnist_x_test.reshape((10000, 28 * 28))

<ul>
    <li>But below is what we will do from now on. Note the reshaping is now different:
    </li>
</ul>

In [16]:
# MNIST dataset

# Load MNIST into four Numpy arrays
(mnist_x_train, mnist_y_train), (mnist_x_test, mnist_y_test) = mnist.load_data()
mnist_x_train = mnist_x_train.reshape((60000, 28, 28, 1))
mnist_x_test = mnist_x_test.reshape((10000, 28, 28, 1))

<h1>Images are Rank 3 Tensors</h1>
<ul>
    <li>Grayscale images:
        <ul>
            <li>A grayscale image has a certain height $h$ and width $w$. Therefore, it makes sense to
                represent them as rank 2 tensors (matrices) of integers $[0, 255]$.
            </li>
            <li>Up to now, however, we have reshaped them into rank 1 tensors (vectors):
                <pre>
mnist_x_train = mnist_x_train.reshape((60000, 28 * 28))
                </pre>
                <figure>
                    <img src="images/reshape.png" />
                </figure>
                What is the disadvantage of this: what information gets destroyed?
            </li>
            <li>So, henceforth, we will not flatten them in this way.</li>
            <li>In fact, for consistency with colour images, we will treat grayscale images as rank 3 tensors
                of shape $(h, w, 1)$
                <pre>
mnist_x_train = mnist_x_train.reshape((60000, 28, 28, 1))                
                </pre>
                <figure>
                    <img src="images/grayscale.png" />
                </figure>
            </li>
        </ul>
    </li>
    <li>Colour images:
        <ul>
            <li>These will be rank 3 tensors: height $h$, width $w$, and channels (or depth) $d$.</li>
            <li>$d = 3$. Why?
                <figure>
                    <img src="images/rgb.png" />
                </figure>
            </li>
        </ul>
    </li>
    <li>Datasets of images:
        <ul>
            <li>Datasets of images will be rank 4 tensors: $(m, h, w, d)$.</li>
            <li>What is $m$?</li>
             <li>$m$ is the batch size</li>
        </ul>
    </li>
    <li>Why will datasets of videos be rank 5 tensors?</li>
     <li>becasue video has frames as well
    </li>
</ul>

<h1>A Convnet for MNIST</h1>

In [17]:
inputs = Input(shape=(28, 28, 1))
x = Rescaling(scale=1./255)(inputs)
x = Conv2D(filters=32, kernel_size=(3, 3), activation="relu")(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Conv2D(filters=64, kernel_size=(3, 3), activation="relu")(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Conv2D(filters=64, kernel_size=(3, 3), activation="relu")(x)
x = Flatten()(x)
x = Dense(64, activation="relu")(x)
outputs = Dense(10, activation="softmax")(x)
convnet = Model(inputs, outputs)
convnet.compile(optimizer=RMSprop(learning_rate=0.0001), loss="sparse_categorical_crossentropy", metrics=["accuracy"])

<ul>
    <li>Note the input shape.</li>
    <li>Note the three numbers that configure convolutional layers:
        <ul>
            <li>number of feature maps (filters); and</li>
            <li>height and width of a window (sometimes called 'convolutional kernel'), which corresponds
                roughly to the idea of a receptive field.
            </li>
        </ul>
        There may be strides and padding.
    </li>
    <li>Note the numbers that configure max pooling layers:
        <ul>
            <li>height and width of a window (sometimes called the 'pooling window').</li>
        </ul>
        Again there may be strides and padding.
    </li>
    <li>Notice the flattening, similar to the reshaping we were doing on the MNIST dataset before.
        This enables us to have some layers at the 'top' of the network that are densely connected,
        in the familiar way.
    </li>
    <li>In particular, the output layer is determined by the task: here we're doing multi-class
        classification.
    </li>
</ul>

In [18]:
convnet.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 rescaling_1 (Rescaling)     (None, 28, 28, 1)         0         
                                                                 
 conv2d_3 (Conv2D)           (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 13, 13, 32)       0         
 2D)                                                             
                                                                 
 conv2d_4 (Conv2D)           (None, 11, 11, 64)        18496     
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 5, 5, 64)         0         
 2D)                                                       

<ul>
    <li>Although we don't understand it fully yet, we'll train it.</li>
     <li>Training takes some time (unsurprising when we look at the number of parameters, above)
        but accuracy is now even higher.
    </li>
    <li>Memory requirements for the network and for all the results that get stored during training are high,
        which is one reason to reduce mini-batch size.
    </li>
</ul>

In [19]:
convnet.fit(mnist_x_train, mnist_y_train, epochs=20, batch_size=32, 
            verbose=0, validation_split=0.2, 
            callbacks=[EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)])

<keras.callbacks.History at 0x7f8ec8d4cf10>

In [20]:
test_loss, test_acc = convnet.evaluate(mnist_x_test, mnist_y_test)
test_acc



0.9876000285148621

<h1>Convolutional Layers</h1>
<ul>
    <li>Consider a neural network whose inputs are images (each is a rank 3 tensor).</li>
    <li>A 2D convolutional layer is a rank 3 tensor of neurons, whose shape is $(h, w, d)$:
        <ul>
            <li>where $d$, the depth, is the number of <b>feature maps</b></li>
        </ul>
    </li>
    <li>For simplicity to begin with, let's assume $d = 1$.</li>
    <li>Connections:
        <ul>
            <li>In the case of a dense layer, we saw that every neuron in that layer has connections from
                every neuron in the preceding layer.
            </li>
            <li>But in the case of a convolutional layer, every neuron in that
                layer has connections from only a small rectangular <b>window</b> of neurons
                in the preceding layer, typically $3 \times 3$ or $5 \times 5$.
                <figure style="text-align: center;">
                    <img src="images/rectangles.png" /><br />
                    <img src="images/conv_S1P0.gif" width="450px" />
                    <figcaption>
                        Animated GIF from <a href="www.MLinGIFS.aqeel-anwar.com ">www.MLinGIFS.aqeel-anwar.com </a>
                    </figcaption>
                </figure>
            </li>
        </ul>
    </li>
</ul>

<h2>Convolutional layers: height and width</h2>
<ul>
    <li>Suppose the shape of the preceding layer is $(28, 28, 1)$ and the windows in the convolutional
        layer are $3 \times 3$
    </li>
    <li>This gives a convolutional layer whose height is 26 and whose width is 26. Why?</li>
    <li>Extra details that you can ignore in CS4618:
        <ul>
            <li>In fact, if we wish, we can make the convolutional layer have the same height and width
                as the preceding layer:
                <ul>
                    <li>Padding: add a border of zeros around the previous layer.
                    </li>
                </ul>
            </li>
            <li>And, if we wish, we can make the convolutional layer have even smaller height and width
                than the preceding layer:
                <ul>
                    <li>Strides: instead of overlapping windows, we can introduce a distance between
                        successive windows.
                    </li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

<h2>Convolutional layers: the weights of a feature map</h2>
<ul>
    <li>Continue to assume $d=1$, the convolutional layer consists of one feature map.</li>
    <li>The idea of a feature map is that it will learn a specific aspect (feature) of its input:
        <ul>
            <li>e.g. the presence of a vertical line;</li>
            <li>e.g.. the presence of a pair of eyes.</li>
        </ul>
    </li>
    <li>Within one feature map, all neurons share the same weights!
        <figure>
            <img src="images/shared_weights.png" />
        </figure>
    </li>
    <li>Advantages:
        <ul>
            <li>This reduces the number of parameters that must be learned.</li>
            <li>More importantly, it means that the feature map will respond to the presence of
                that feature <em>no matter where it is in the input</em> (the <em>translational
                invariance</em>, mentioned earlier).
            </li>
            <li>You may see the word <b>filter</b> to refer to the weights of the neurons in
                a feature map.
            </li>
       </ul>
</ul>

<h2>Convolutional layers: stacks of feature maps</h2>
<ul>
    <li>Now consider the case where $d > 1$: the convolutional layer comprises a stack of $d$ feature maps.</li>
    <li>A neuron in a feature map in a convolutional layer is connected to a window of neurons
        in <em>each</em> of the feature maps of the previous layer (or, in the case of the first layer, in each
        of the channels of the input).
        <figure>
            <img src="images/hierarchy.png" />
        </figure>
    </li>
    <li>Note how this means that a feature map in one layer combines several feature maps (or channels) of
        the previous layer (the <em>spatial hierarchy</em>, mentioned earlier).
    </li>
</ul>

<h1>Pooling Layers</h1>
<ul>
    <li>The goal is to have a layer that shrinks the number of neurons in higher layers:
        <ul>
            <li>to reduce the amount of computation;</li>
            <li>to reduce memory usage;</li>
            <li>to reduce the number of parameters to be learned, thus reducing the risk of
                overfitting; and
            </li>
            <li>to create a hierarchy in which higher convolutional layers contain information about
                the totality of the original input image.
            </li>
        </ul>
    </li>
    <li>Again, it works on rectangular windows: neurons in the pooling layer are connected to windows
        of neurons in the previous layer
        <ul>
            <li>typically $2 \times 2$;</li>
            <li>typically adajcent rather than overlapping.</li>
        </ul>
    </li>
    <li>E.g. if the previous layer has height $h$ and width $w$, and the pooling layer uses adjacent
        $2 \times 2$ pooling windows, then the pooling layer will have height $h/2$ and width $w/2$.
        <figure>
            <img src="images/pooling.png" />
        </figure>
    </li>
    <li>The depth of the pooling layer is the same as the depth of the previous layer.</li>
</ul>

<h2>Max pooling layers</h2>
<ul>
    <li>Pooling layers have no weights: nothing to learn.</li>
    <li>In a <b>max pooling layer</b>, 
        <ul>
            <li>a neuron in the pooling layer receives the outputs of the 
                neurons in the window 
                in the previous layer and outputs only the largest of them.
            </li>
        </ul>
    </li>
    <li>Pooling layers work on the feature maps independently, which is why they have the same depth
        as the previous layer.
    </li>
</ul>

<h1>Check Your Understanding</h1>
<ul>
    <li>Do you understand the numbers in the code?</li>
    <li>Do you understand the numbers in the output of <code>convnet.summary()</code>?</li>
    <li>Do you understand the diagram below?</li>
</ul>

<figure>
    <img src="images/mnist.png" />
</figure>

<h1>Final Remarks</h1>
<ul>
    <li>Note how convolutional layers are computationally efficient:
        <ul>
            <li>They have fewer parameters than dense layers (although, care here, because each one is involved in a more multiplications).</li>
            <li>They can be easily parallelised.</li>
        </ul>
        This is one reason for their popularity. 
    </li>
    <li>Consider tasks that involve audio, text and time series data. For these tasks, Recurrent Neural Networks would be the obvious choice (see CS4619 AI2). But, in some cases, 1D convolutional networks can be used successfuly instead. (Although, to be fair, these days Transformers (also covered in CS4619 AI2) are displacing Recurrent Neural Networks and 1D convnets in many cases.)</li>
</ul>