<a href="https://colab.research.google.com/github/jpcompartir/dl_notebooks/blob/main/keras_computer_vision_conv_nets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



We'll look at simple convolutional neural networks (convnets) followed by more compelx convnet architectures and advanced techniques. We use conv nets for a variety of tasks in computer vision
* image classification
* object detection
* image segmentation

To use convnets we first need to pre-process images into the dimensions each architecture accepts as input, generally we represent each image as a matrix of pixels with rgb values. For example an image of a dog could be represented as a an x by x matrix (2d), and a region of the x by x matrix will contain the eyes, ears, paws etc. A convnet requires inputs of shape image height, image width, and number of channels. break down the image into its constituent parts with increasing levels of specificity, i.e. the first layers of the convent will most resemble the original image and the later layers will least. In the early layers we find outlines of whole animals/structures in later layers we will find increasingly specific 'features' like outline of eyebrows, nostrils etc.

One of the great strengths of a convent is that the network learns these features in a context-independent manner, i.e. it can recognise features in similar-but-distinct images. Which if we think about the task of computer vision, is very important - we need the network to recognise the features whether a small thing in the images changes or not.

As we go we will get into pre-processing, convolutional layers, pooling layers, strides, data augmentation, regularisation, residual connections to avoid vanishing gradients in deep networks, depthwise separable convolutions and transfer learning. We begin by following Chollet's introduction to convents.

A basic convnet is a block of 2d convolutional layers and maxpooling2d layers.


In [4]:
from tensorflow import keras
from tensorflow.keras import layers

inputs = keras.Input(shape = (28, 28, 1))
x = layers.Conv2D(filters = 32, kernel_size = 3, activation = "relu")(inputs)
x = layers.MaxPooling2D(pool_size = 2)(x)
x = layers.Conv2D(filters = 64, kernel_size = 3, activation = "relu")(x)
x = layers.MaxPooling2D(pool_size = 2)(x)
x = layers.Conv2D(filters = 128, kernel_size = 3, activation = "relu")(x)
x = layers.Flatten()(x)

outputs = layers.Dense(10,  activation = "softmax")(x)

model = keras.Model(inputs = inputs, outputs = outputs)

model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 conv2d_3 (Conv2D)           (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 13, 13, 32)       0         
 2D)                                                             
                                                                 
 conv2d_4 (Conv2D)           (None, 11, 11, 64)        18496     
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 5, 5, 64)         0         
 2D)                                                             
                                                                 
 conv2d_5 (Conv2D)           (None, 3, 3, 128)         73856 

Let's think about what happened:
We defined the dimensions of our inputs, they are image_height = 28, image_width = 28, channels = 1. Channels = 1 means that we are looking at a gray image - the image has 1 channel which is levels of gray. Channels = 3 would refer to RGB - 3 separate channels, 1 for each colour, or depth 3.
Then we feed our inputs inot a Conv2d layer, with 32 filters, a kernel size of 3 and a relu activation - kernel_size 3 means that we will be looking at 3x3 blocks of our image and the relu activation we should be comfortable with by now, but recall that it returns 0 for any input <= 0 and input for any input > 0 i.e. if the input is positive the value is returned if the value is zero or negative then 0 is returned. We can think of the transformation as input * 0 if input <= 0 or input * 1 if input >  0.
The input is then fed into a MaxPooling2d layer, and the pool_size = 2 - so we make sliding widows across our kernal of 2, often with stride 2(skip 1 square) and we take the max value for the 2 squares - think of this as downsampling our input by a factor of 2 (if stride =2).
Then we feed the pooled inputs into a new Conv2D layer which as 2x as many filters as our previous layer and repeat until the flatten and classification layers.

When we take a look at the model summary we notice the change in output for each level and the increasing number of parameters until we flatten the outputs - we flatten because our classification head accepts a vector as input. We use a sotfmax activation to output a probability distribution.

Now we need to get some data and compile our model with an optimizer, a loss function and some metrics before we fit our model to our data.

In [7]:
from tensorflow.keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape(60000, 28, 28, 1)
train_images = train_images.astype("float32") / 255

test_images = test_images.reshape(10000, 28, 28, 1)
test_images = test_images.astype("float32") / 255

model.compile(optimizer = "rmsprop",
             loss = "sparse_categorical_crossentropy",
             metrics = ["accuracy"])

model.fit(train_images, train_labels,  epochs = 5, batch_size = 64)

test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f"Test accuracy: {test_acc:.3f}")

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test accuracy: 0.991


We referenced before that convnets learn representations that are context-independent, another way of saying this is to say their learned representations are translation-invariant. We also said that their representations become increasingly more specific the deeper the network becomes - we can think of the layers as hierarchichal - or more specifically spatially hierarchical - i.e. the general representations of the first layer are broken down in the second layer to more specific representations, which are then broken down in the third layer to still more specific representations.

The representations of convnets are found in the 'filters' - a parameter we feed into each convent layer to determine the number of representations the layer should output. See page ~205+ for a deeper & visual explanation of conv net architecture including information on kernels, feature maps and padding/border effects.

We use padding/border effects to increase the number of x by x windows the network can create features from. Padding takes 2 arguments - "same" and "valid", where "valid" will add no padding, "same" - pad such that the output is the same shape and width of the input. See 208 for visual explanation of why padding is important - but imagine a 5 x 5 feature map, there are only 9 tiles which a 3 x 3 window can be centered around.

In [None]:
# Downloading the dogs vs cats data set in Colab (doesn't come with Keras)
!kaggle competitions download -c dogs-vs-cats
#Need to authenticate Kaggle for this, so follow page 212 to set that up.

Once the dataset is downloaded etc. we take a sampls - 2 x 1k for train, 2 x 500 for validation, 2 x 1k for test - out of 25k original images, to make the problem more realistic (it's hard to get so much data in reality). (2 x - one for cat one for dog).

In [None]:
import os, shutil, pathlib

original_dir = pathlib.Path("dir_to_data")
new_base_dir = pathlib.Path("cats_vs_dogs_small")

def make_subset(subset_name, start_index, end_index):
    for category in("cat", "dog"):
        dir = new_base_dir / subset_name / category
        os.makedirs(dir)
        fnames = [f"{category}.{i}.jpg" for i in range(start_index, end_index)]
        for fname in fnames:
            shutil.copyfile(src=original_dir / fname, 
                            dat=dir / fname)
            
make_subset("train", start_index = 0, end_index = 1000)
make_subset("validation", start_index = 1000, end_index = 1500)
make_subset('test', start_index = 1500, end_index = 2500)

Now to make a model similarly to how we did before, however this time the images will also be rescaled from 0 - 255 to 0 -1

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

inputs = keras.Input(shape = (180, 180, 3))
x = layers.Rescaling(1./255)(inputs)
X = layers.Conv2D(filters = 32, kernel_size = 3, activation = "relu")(x)
x = layers.MaxPooling2D(pool_size = 2)(x)
x = layers.Conv2D(filter = 64, kernel_size = 3, activation = "relu")(x)
x = layers.MaxPooling2D(pool_size = 2)(x)
x = layers.Conv2D(filter = 128, kernel_size = 3, activation = "relu")(x)
x = layers.MaxPooling2D(Pool_size = 2)(x)
x = layers.Conv2D(filter = 256, kernel_size = 3, activation = "relu")(x)
x = layers.MaxPooling2D(Pool_size = 2)(x)
x = layers.Conv2D(filter = 256, kernel_size = 3, activation = "relu")(x)
x = layers.Flatten()(x)

outputs = layers.Dense(1, activation = "sigmoid")(x)

model = keras.Model(inputs=inputs, outputs=outputs)

model.summary

In [None]:
model.compile(loss = "binary_crossentropy",
             optimizer = "rmsprop",
             accuracy = ["accuracy"])

Now our data needs to be pre-processed from the directories we created. So we must -
* Read the files
* Decode the jpg content to RGB  grids of pixels
* Convert to floating-point tensors
* Resize images to 180 x 180
* Batch them up - in this case 32 images per batch

We'll use the keras utility functions to carry out these steps:

In [None]:
from tensorflow.keras.utils import image_dataset_from_dictionary

train_dataset = image_dataset_from_dictionary(new_base_dir / "train",
                                             image_size = (180, 180),
                                             batch_size = 32)

validation_dataset = image_dataset_from_dictionary(new_base_dir / "validation", 
                                                  image_size = (180, 180),
                                                  batch_size = 32)

test_dataset = image_dataset_from_dictionary(new_base_dir / "test",
                                            image_size = (180, 180),
                                            batch_size = 32)

#Can test the output of our datasets (TensorFlow Dataset objects - great for straming data and dealing with big data)
for data_batch, labels_batch in train_dataset:
    print("data_batch_shape: ", data_batch.shape)
    print("labels_shape: ", labels_batch.shape)
    break

Finally now we can fit the data and monitor how well it converges:

In [None]:
callbacks = [
    keras.callbacks.ModelCheckpoint(
    filepath = "convenet_from_scratch.keras",
    save_best_only=True,
    monitor = "val_loss")
]

history = model.fit(
train_dataset,
epocs = 30,
validation_data = validation_dataset,
callbacks = callbacks)