For P5 we've been instructed to create a Convolutional Neural Network to classify the Cifar-10 dataset using Keras. The notebook is divided in several parts:

<ol>
<li>1: Necessary preparations</li>
<li>2: Data Preparation</li>
<li>3: A first model</li>
<li>???</li>
<li>5: References</li>
</ol>

I'll work through each one individually, and elaborate more:

<h3>1: Necessary preparations</h3>

We need to take some simple steps to get ready to create our model. First we need to import some libraries and classes:

In [348]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPool2D, ZeroPadding2D, Activation, Flatten, Dense
from tensorflow.keras.callbacks import EarlyStopping

Keras uses the numpy random module to instantiate a model's weights and biases. To make sure that notebook behaviour is replicable, we need to set the random seed to a single number.

In [349]:
np.random.seed(42)

Then we load in the dataset. The dataset is pre-split in a training and testing set, so we assign both the target and feature variables for the test and training set to variables.

In [350]:
# Even though x_train and x_test are matrices and not vectors, and convention might expect the variable names to be capital letters, this is how the Keras examples do it, and how I'll do it.
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

Unfortunately, my NVidia-GPU is currently lined up for surgery and out of commission. Even though I do have an AMD-GPU, using that for Tensorflow/Keras requires a Linux installation, and skills with said Linux installation. I have neither of those, so I'll be training using a CPU. Have a look:

In [351]:
tf.config.list_physical_devices("GPU")

[]

<h2>2: Data Preparation</h2>

Before any actual model can be trained, the dataset needs some work. Let's start with the images.

<h4>Images</h4>

In [352]:
x_train.shape

(50000, 32, 32, 3)

The images are represented as Numpy Arrays. According to the Keras documentation, there are 3 32x32 matrices for every picture, representing the RGB color-scales. Every one of these matrices contain pixel values from 0 up to 255. The train set contains 50.000 images, and the test set 10.000.

In [353]:
x_train

array([[[[ 59,  62,  63],
         [ 43,  46,  45],
         [ 50,  48,  43],
         ...,
         [158, 132, 108],
         [152, 125, 102],
         [148, 124, 103]],

        [[ 16,  20,  20],
         [  0,   0,   0],
         [ 18,   8,   0],
         ...,
         [123,  88,  55],
         [119,  83,  50],
         [122,  87,  57]],

        [[ 25,  24,  21],
         [ 16,   7,   0],
         [ 49,  27,   8],
         ...,
         [118,  84,  50],
         [120,  84,  50],
         [109,  73,  42]],

        ...,

        [[208, 170,  96],
         [201, 153,  34],
         [198, 161,  26],
         ...,
         [160, 133,  70],
         [ 56,  31,   7],
         [ 53,  34,  20]],

        [[180, 139,  96],
         [173, 123,  42],
         [186, 144,  30],
         ...,
         [184, 148,  94],
         [ 97,  62,  34],
         [ 83,  53,  34]],

        [[177, 144, 116],
         [168, 129,  94],
         [179, 142,  87],
         ...,
         [216, 184, 140],
        

Neural Networks work best with inputs from 0 to 1. The easiest way to prepare the dataset for usage in a Neural Network is by dividing all pixel values by 255, so that they properly scaled, in the desired range. We accomplish this with numpy vectorized operations, which apply the specified operation element-wise:

In [354]:
x_train = x_train / 255
x_test = x_test / 255

In [355]:
y_train.shape

(50000, 1)

In [356]:
x_train

array([[[[0.23137255, 0.24313725, 0.24705882],
         [0.16862745, 0.18039216, 0.17647059],
         [0.19607843, 0.18823529, 0.16862745],
         ...,
         [0.61960784, 0.51764706, 0.42352941],
         [0.59607843, 0.49019608, 0.4       ],
         [0.58039216, 0.48627451, 0.40392157]],

        [[0.0627451 , 0.07843137, 0.07843137],
         [0.        , 0.        , 0.        ],
         [0.07058824, 0.03137255, 0.        ],
         ...,
         [0.48235294, 0.34509804, 0.21568627],
         [0.46666667, 0.3254902 , 0.19607843],
         [0.47843137, 0.34117647, 0.22352941]],

        [[0.09803922, 0.09411765, 0.08235294],
         [0.0627451 , 0.02745098, 0.        ],
         [0.19215686, 0.10588235, 0.03137255],
         ...,
         [0.4627451 , 0.32941176, 0.19607843],
         [0.47058824, 0.32941176, 0.19607843],
         [0.42745098, 0.28627451, 0.16470588]],

        ...,

        [[0.81568627, 0.66666667, 0.37647059],
         [0.78823529, 0.6       , 0.13333333]

<h4>Targets</h4>

The targets need work too. Let's have a look:

In [357]:
y_train

array([[6],
       [9],
       [9],
       ...,
       [9],
       [1],
       [1]], dtype=uint8)

The targets consist of numbers 0-9, each representing one type of target. Because a loss function might interpret this as ordinal, discrete of continuous data, even though these are completely separate and non-ordered categories, we can one-hot encode them and make sure our later network has 10 outputs to more appropriately represent the meaning of the data.

In [358]:
n_classes = np.unique(y_train).size

y_train, y_test = to_categorical(y_train, n_classes), to_categorical(y_test, n_classes)

In [359]:
y_train

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]], dtype=float32)

That works.

<h2>3: A first model</h2>

Like mentioned before, we'll be constructing a Convolutional neural network. This is more appropriate for the task at hand, then the "traditional" Neural network that we made before, because it more easily learns the concept of "proximity" in the images.

The most important resource for determining the architecture used is Aldewereld et al., 2016. From the suggested template for usual Layer patterns, I have arrived at the following:

For 2 repetitions, there are 2 sequential convolutional layers, followed by a max-pooling layer. This is followed by 3 fully connected layers. Every part will be elaborated on in more detail when they're added.

The amount of kernels in every layer is in multiplying factors of 2, to maintain optimal performance through vectorized operations. According to Karpathy, a (2016), it's quite usual for layers to become deeper and deeper.

In general, the Convolutional part of the network maintains the dimensions of it's inputs, as suggested by Karpathy, A (2016).

In [360]:
base_model = Sequential()

We start with 2 convolutional layers, with small (3x3) filter sizes, in an attempt to pick up on small patterns in the images. The padding=same ensures that input shape (width and height) is the same as output shape. There are followed by a ReLu activation function, because it's proved to be quite effective at what it's supposed to do.

In [361]:
base_model.add(Conv2D(filters=32, kernel_size=(3, 3), input_shape=(32, 32, 3), padding="same"))
base_model.add(Activation('relu'))

In [362]:
base_model.add(Conv2D(filters=64, kernel_size=(3, 3), padding="same"))
base_model.add(Activation('relu'))

We scale down from 32x32 to 16x16 using a max pooling layer. This is a fairly standard followup to one or more convolutional layers, according to both Aldewereld et al. and Karpathy.

In [363]:
base_model.add(MaxPool2D(pool_size=(2, 2)))

After this we add 2 more convolutional layers. This time, the kernel size is bigger, in an attempt to pick up on more general patterns. We zero-pad with 2 on each side this time (different padding because of different kernel size) to ensure that output height and width remains equal to input. Because of a <a href="https://stackoverflow.com/questions/45013060/how-to-do-zero-padding-in-keras-conv-layer">Keras Quirk</a>, this has to be done more explicitly for P=2. These layers still use ReLu for obvious reasons.

In [364]:
base_model.add(ZeroPadding2D(2))
base_model.add(Conv2D(filters=128, kernel_size=(5, 5)))
base_model.add(Activation('relu'))

In [365]:
base_model.add(ZeroPadding2D(2))
base_model.add(Conv2D(filters=256, kernel_size=(5, 5)))
base_model.add(Activation('relu'))

And again, we scale down from 16x16 to 8x8 using a max-pooling layer.

In [366]:
base_model.add(MaxPool2D(pool_size=(2, 2)))

To move onto usage of a densely connected part of the network, we need to flatten everything. We can easily accomplish this by adding a flatten layer.

In [367]:
base_model.add(Flatten())  # 8 x 8 x 256 = 16.384

From the last layer, our network has 16.384 parameters. We'll start scaling down that number through subsequent fully connected layers (still with ReLu activation functions) in powers of 2 for performance reasons.

We'll end with a layer of 10 neurons as output layer, because we're classifying 10 categories. These will use the softmax activation function, so that the outputs more accurately represent how confident the network is about it's predictions.

In [368]:
base_model.add(Dense(1024))
base_model.add(Activation("relu"))
base_model.add(Dense(512))
base_model.add(Activation("relu"))
base_model.add(Dense(n_classes))
base_model.add(Activation("softmax"))

We'll compile the model with 2 loss-metrics: categorical cross entropy and accuracy. Categorical crossentropy for the actual learning, so that the model has a loss function that incrementally lowers as the network approaches as local mininum on it, and can compute useful gradients. We'll the accuracy as a final metric, because it's more human-proof, and more accurately represents how well the model might be performing in the real world.

In [369]:
base_model.compile(loss='categorical_crossentropy', metrics=['accuracy'])

In [370]:
base_model.summary()

Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_37 (Conv2D)           (None, 32, 32, 32)        896       
_________________________________________________________________
activation_58 (Activation)   (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_38 (Conv2D)           (None, 32, 32, 64)        18496     
_________________________________________________________________
activation_59 (Activation)   (None, 32, 32, 64)        0         
_________________________________________________________________
max_pooling2d_17 (MaxPooling (None, 16, 16, 64)        0         
_________________________________________________________________
zero_padding2d_2 (ZeroPaddin (None, 20, 20, 64)        0         
_________________________________________________________________
conv2d_39 (Conv2D)           (None, 16, 16, 128)     

We can see that through the convolutional layers, we have indeed succeeded in maintaining input and output-shape, which is good.

Even though we've been instructed not to add anything extra besides the required basics for the model, I will take some liberties and make the model stop training automatically in case the Accuracy on the validation set doesn't improve after an epoch, because training on my CPU takes forever and I would also like to be able to easily evaluate loss and accuracy on the test-set before the model gets over fitted, without any workarounds like saving the model to disk every epoch. I justify this as not actually changing anything in the model, just interrupting training early.

In [371]:
callback = EarlyStopping(monitor="val_accuracy", patience=1)

We'll train for a maximum of 10 epochs (because more if just unmanageable time-wise without my GPU), with a batch-size of 128. We split 20% of the data from the training set, as validation set, to monitor for over fitting.

In [None]:
base_epochs = 10  # Store amount of epochs in seperate variable for plotting purposes later
base_hist = base_model.fit(x_train, y_train, batch_size = 128, epochs=base_epochs, verbose=1, validation_split = 0.2, callbacks=[callback])#, validation_steps=12000//128)

Epoch 1/10
 37/313 [==>...........................] - ETA: 3:02 - loss: 4.2358 - accuracy: 0.1237

Even though we can already see the numbers, a better insight into the model's training journey might be provided by viewing a plot of the accuracy on both the training and validation set.

In [None]:
plt.plot(base_hist.history["accuracy"], "rx-", label="Training Dataset Accuracy")
plt.plot(base_hist.history["val_accuracy"], "bx-", label="Validation Dataset Accuracy")

if len(base_hist.history["val_accuracy"]) - 1 < base_epochs:  # If amount of actual epochs is not equal to expected amount of epochs.
    plt.vlines([len(base_hist.history["val_accuracy"]) - 1],
               colors="y", linestyles="dashed",
               ymin=0, ymax=1,
               label="Training interruption")
plt.legend(loc="lower right")
plt.ylabel("Accuracy")
plt.xlabel("Epochs")
plt.xticks(range(base_epochs))
plt.title("Accuracy after every epoch")

We cab see that even though training dataset accuracy keeps rising, the accuracy on the validation set only gets better with small amounts after 2 epochs. It does continue to rise, when examining the exact numbers. After the 5th epoch, the validation dataset accuracy finally stops rising, and training is interrupted to prevent over fitting. We can examine the model's final performance by evaluating it against the testset.

In [None]:
_, accuracy = base_model.evaluate(x_test, y_test)
f"Final test-set accuracy is {accuracy}%."

We end up with an accuracy of 75%

<h3>References</h3>

Keras Team. Keras Documentation: CIFAR10 small images classification dataset. Keras. Retrieved March 26, 2022, from https://keras.io/api/datasets/cifar10/

Aldewereld, H., van der Bijl, B., Bunk, J., van Moergestel, L. 2022. Machine Learning (reader).
Utrecht: Hogeschool Utrecht.

Karpathy, A. [Andrej Karpathy]. (2016, 28 January). CS231n Winter 2016: Lecture 7: Convolutional Neural Networks Video. Youtube. https://www.youtube.com/watch?v=LxfUGhug-iQ