------------------------------------------------------------------------------------------------------------------------
### <b>Table of Content</b>

0. Background

1. Import functions

2. Download ZIP file from Google Drive and unzip in into local drive

3. Load image files

4. Define a CNN (Convolutional Neural Network)

    4-1.Initialize a Sequential model from Keras and add layers to it

    4-2. Compile the model with the f1 score as the evaluation metric

5. Train the CNN model and evaluate model performance

6. Save models for later use

7. Conclusion
------------------------------------------------------------------------------------------------------------------------

### <b>0. Background</b>

This project is for an artificial intelligence and computer vision company, one of its solutions is MonReader that can detect page flips to digitize documents through a mobile app.

We are given thousands of high-resolution video frames in sequential order with labels 'flip' and 'notflip', and the goal of the project is to train a model to predict if a given sequence of images is being flipped or not.

More details can be found in <a href="https://github.com/henryhyunwookim/MonReader#readme">README</a>.

### <b>1. Import functions</b>

In [1]:
from utils.load import download_from_gdrive, extract_zip_file, load_images, get_image_shape
from utils.stats_functions import f1_score
from utils.evaluate import return_result_table

import os
import sys
import json
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")

from keras.models import Sequential, load_model
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from keras.utils import image_dataset_from_directory, custom_object_scope

### <b>2. Download ZIP file from Google Drive and unzip in into local drive</b>

Download the source file from G Drive if the file does not already exist in the download path.

In [3]:
# Define source details.
with open('./data/source_details.json') as f:
  source_details = json.load(f)
file_id = source_details['file_id']
file_name = source_details['file_name']

# Define file path.
root_dir = sys.path[0]
download_dir = Path(root_dir) / 'data'
file_path = download_dir / file_name

# Download the file if it's not found in the file path.
download_from_gdrive(file_id, file_name, download_dir, root_dir)

File images.zip already exists in d:\OneDrive\GitHub\Apziva\MonReader\data.


Extract the zip file.

In [6]:
extract_zip_file(file_path, download_dir, file_name)

images.zip extracted in d:\OneDrive\GitHub\Apziva\MonReader\data.


### <b>3. Load image files</b>

Load images as is without any transformation such as converting to arrays for efficiency and less memory usage.

In [8]:
array_dict = load_images(str(download_dir), as_array=False)

Loading files in d:\OneDrive\GitHub\Apziva\MonReader\data\images
Loading files in d:\OneDrive\GitHub\Apziva\MonReader\data\images\testing
Loading files in d:\OneDrive\GitHub\Apziva\MonReader\data\images\testing\flip


100%|██████████| 290/290 [00:07<00:00, 36.94it/s]


Loading files in d:\OneDrive\GitHub\Apziva\MonReader\data\images\testing\notflip


100%|██████████| 307/307 [00:04<00:00, 64.23it/s]


Loading files in d:\OneDrive\GitHub\Apziva\MonReader\data\images\training
Loading files in d:\OneDrive\GitHub\Apziva\MonReader\data\images\training\flip


100%|██████████| 1162/1162 [00:15<00:00, 77.03it/s]


Loading files in d:\OneDrive\GitHub\Apziva\MonReader\data\images\training\notflip


100%|██████████| 1230/1230 [00:14<00:00, 82.38it/s] 


Check the shape of the images.

In [9]:
image_shape = get_image_shape(array_dict)

Image shape: (1920, 1080, 3)


### <b>4. Define a CNN (Convolutional Neural Network)</b>

##### 4-1.Initialize a Sequential model from Keras and add layers to it

In [10]:
# 0. Initialize a Sequential model from Keras
model = Sequential()

# 1.  Add a convolutional layer. The first convolutional layer includes an input layer as specified by input_shape.
reduced_image_shape = (int(image_shape[0]/10), int(image_shape[1]/10), image_shape[2])
model.add(Conv2D(filters=8, kernel_size=(7, 7), activation='relu', input_shape=reduced_image_shape))

# 2. Add a max pooling layer
model.add(MaxPooling2D(pool_size=(3, 3)))

# Add another set of convolutional (with a different number of output filters) and pooling layers
model.add(Conv2D(filters=16, kernel_size=(7, 7), activation='relu'))
model.add(MaxPooling2D(pool_size=(3, 3)))

# Add another set of convolutional (with a different number of output filters) and pooling layers
model.add(Conv2D(filters=64, kernel_size=(7, 7), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

# 3. Add a flatten layer
model.add(Flatten())

# 4. Add a dense (i.e. fully connected) layer with 32 neurons and a ReLU activation function
model.add(Dense(units=32, activation='relu'))

# A dropout layer can be added to deal with overfitting;
# The below line of code will randomly drop 50% of the neurons during training, which helps to reduce overfitting
# model.add(Dropout(0.5))

# 5. Add an output layer, which is another dense layer with 1 neurons and a sigmoid activation function
model.add(Dense(units=1, activation='sigmoid'))

# Print out the summary of the model
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 186, 102, 8)       1184      
                                                                 
 max_pooling2d (MaxPooling2D  (None, 62, 34, 8)        0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 56, 28, 16)        6288      
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 18, 9, 16)        0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (None, 12, 3, 64)         50240     
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 6, 1, 64)         0

Here's an explanation of the architecture of the network. Simply put, it is a CNN with multiple convolutional and max pooling layers, followed by a flatten layer, a fully connected layer and a binary classification output layer. This architecture is commonly used for image classification tasks.

<b>0. Sequential model</b>

A Sequential model allows us to build a linear stack of layers where each layer has exactly one input tensor and one output tensor.

<b>1-1. Input layer</b>

Input layers accept input image data, which is typically in the form of a 2D or 3D array, depending on the color channels of the image. In our case, we have 1920 x 1080 RGB pictures so the input shape would be (1920, 1080, 3). To reduce the training time and memory use, we reduced the first two dimensions of the input shape into one-tenth of the original shape.

<b>1-2. Convolutional layer</b>

Convolutional layers perform feature extraction by applying a set of filters to the input image. Each filter detects a specific feature, such as edges, corners, blobs, etc. The output of each filter is a feature map, which highlights the presence of that feature in different parts of the input image.

In our CNN, Conv2D from Keras is used, which stands for 2-dimensional convolution.

<b>Output filters</b>

The first parameter of Conv2D (i.e., filters) is the dimensionality of the output space, that is, the number of output filters in the convolution. In the code, the three Conv2D layers have 8, 16, and 64 filters, respectively. These filters are applied to the input image to extract features that are relevant to the classification task. Increasing the number of filters can help the model learn more complex and abstract features, but that will increase the number of parameters in the model, making training slower and more computationally intensive.

It is generally better to have a different number of filters in different convolutional layers in a CNN. In the earlier layers of the network, it is common to use a small number of filters, such as 32 or 64, to extract simple and general features from the input images. In the later layers of the network, a larger number of filters, such as 128 or 256, are often used to extract more complex and specific features.

Using different numbers of filters in different convolutional layers can help the model learn more efficiently and effectively. It allows the network to identify simple and general features in the early layers, and then build on those features with more complex and specific features in the later layers. Additionally, using fewer filters in the early layers can help to reduce the number of parameters in the network, which can help to prevent overfitting.

<b>Kernel size</b>

The second parameter (i.e. kernel_size) is the kernel size, specifying the height and width of the 2D convolution window. For binary image classification problems, the typical kernel sizes for the first convolutional layer are in the range of 3x3 to 7x7. Larger kernel sizes may be used for input images with larger spatial dimensions. Smaller kernel sizes can capture fine-grained details in the input image, while larger kernel sizes can capture more global features.

<b>Activation function</b>

The Activation parameter refers to the non-linear function applied to the output of a layer, which adds non-linearity to the model,  allowing it to learn more complex features from the input data. Activation functions are typically applied after the linear transformation of the input data by a layer's weights and biases. This output is then passed through the activation function, which transforms the input into a new output.

ReLU (Rectified Linear Unit) is a popular choice for most applications due to its simplicity and effectiveness in reducing the vanishing gradient problem, and sigmoid can be used for binary classification problems. Both activation functions are available in Keras and are used in our code.

<b>2. Pooling layer</b>

Pooling layers downsample the feature maps produced by the convolutional layers, commonly by taking the maximum or average value (i.e., max pooling or average pooling) within small regions of the feature maps. This helps reduce the dimensionality of the feature maps (i.e., the height and width dimensions while preserving the depth dimension) and makes the network more computationally efficient.

Max pooling takes the maximum value of each non-overlapping rectangular sub-region in the input volume and uses that as the output value for that region. This operation is called "max" pooling because it retains the largest (max) value from each region. Max pooling is useful for detecting the presence of a particular feature or pattern in an input volume, as it retains the strongest activation signal in each region.

Average pooling takes the average value of each non-overlapping rectangular sub-region in the input volume and uses that as the output value for that region. This operation is called "average" pooling because it takes the average value from each region. Average pooling is useful for reducing the spatial dimensions of an input volume while preserving the overall structure of the input, as it retains a more generalized representation of the input volume.

In general, max pooling is more commonly used in CNNs because it has been found to work better in practice, especially for tasks like object recognition. However, average pooling can also be useful in some cases, such as for tasks like semantic segmentation where spatial resolution is important.

In our CNN, max pooling with a 3x3 (or 2x2) pooling window, as specified in the pool_size parameter, is used for each pooling layer. This means that the pooling layers will take the max value over a 3x3 (or 2x2) pooling window.

<b>3. Flatten layer</b>

Flatten layers reshape the output of the previous layers into a 1D array (or one-dimensional vector), which can be fed into a fully connected layer. Without a flatten layer, the output of the final convolutional layer would be a 3D tensor with a fixed spatial structure, which cannot be directly fed into a dense (or fully connected) layer that expects a 1D tensor.

<b>4. Dense (or fully connected) layer</b>

Dense layers perform the final classification by combining the features extracted by the convolutional layers and making a prediction based on them. The output of the final fully connected layer is a probability score indicating the likelihood of the input image belonging to each of the two classes. By fully connected, it means that every neuron in the previous layer is connected to every neuron in the current layer.

<b>5. Output layer</b>

This layer produces the final binary classification decision based on the probability scores generated by the previous layers. In our CNN, it is another dense layer with a single neuron and a sigmoid activation function. The sigmoid function squashes the output between 0 and 1, which can be interpreted as the probability of the input image belonging to the positive class. That is, the output of the output layer would be the predicted probability of each input image belonging to a certain class, in our case either 'flip' or 'notflip'.

##### 4-2. Compile the model with the f1 score as the evaluation metric

In [11]:
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=[f1_score]
    )

Binary cross-entropy is the most commonly used loss function for binary image classification tasks, where the output of the model is a probability distribution over two classes (i.e., flip or notflip in our case). Binary cross-entropy measures the difference between the predicted and true labels for each binary classification instance.

Adam is a popular optimizer that is often used for binary classification problems. It is an adaptive learning rate optimization algorithm that is well-suited for large datasets and high-dimensional parameter spaces.

For evaluating model performance during training and testing, we use the f1 score since it's the success metric of the project.

### <b>5. Train the CNN model and evaluate model performance</b>

In [12]:
train_data_dir = './data/images/training'
test_data_dir = './data/images/testing'
batch_size = 64
epochs = 10

train_data, validate_data = image_dataset_from_directory(
    directory=train_data_dir,
    labels='inferred',
    label_mode='binary',
    color_mode='rgb',
    batch_size=batch_size,
    image_size=reduced_image_shape[:2], # (height, width) to resize images to after they are read from disk
    shuffle=True,
    seed=1,

    # If True, resize the images without aspect ratio distortion.
    # When the original aspect ratio differs from the target aspect ratio,
    # the output image will be cropped so as to return the largest possible window 
    # in the image (of size `image_size`) that matches the target aspect ratio.
    # By default (i.e., 'crop_to_aspect_ratio=False'), aspect ratio may not be preserved.
    crop_to_aspect_ratio=True,
    
    validation_split=0.2, # 20% of the data will be reserved for validation
    subset='both', # Subset of the data to return. 'both' returns a tuple of the training and validation datasets.
)

test_data = image_dataset_from_directory(
    directory=test_data_dir,
    labels='inferred',
    label_mode='binary',
    color_mode='rgb',
    batch_size=batch_size,
    image_size=reduced_image_shape[:2],
    shuffle=True,
    seed=1,
    crop_to_aspect_ratio=True
)

results = model.fit(

    # Since we pass a generator (i.e., train_data) to 'x',
    # 'y' should not be specified - targets will be obtained from 'x'.
    x=train_data,

    epochs=epochs,
    verbose=2, # This will output one line per epoch
    validation_data=validate_data
    
    # This argument is not supported when 'x' is a dataset, generator or 'keras.utils.Sequence' instance
    # validation_split=0.2

    # This is not required when the data is in the form of generators since they generate batches
    # # batch_size=batch_size
)

Found 2392 files belonging to 2 classes.
Using 1914 files for training.
Using 478 files for validation.
Found 597 files belonging to 2 classes.
Epoch 1/10
30/30 - 66s - loss: 4.1552 - f1_score: 0.5085 - val_loss: 0.6662 - val_f1_score: 0.5768 - 66s/epoch - 2s/step
Epoch 2/10
30/30 - 104s - loss: 0.6409 - f1_score: 0.6208 - val_loss: 0.6251 - val_f1_score: 0.5826 - 104s/epoch - 3s/step
Epoch 3/10
30/30 - 92s - loss: 0.5423 - f1_score: 0.7400 - val_loss: 0.5190 - val_f1_score: 0.6680 - 92s/epoch - 3s/step
Epoch 4/10
30/30 - 39s - loss: 0.4118 - f1_score: 0.8152 - val_loss: 0.4141 - val_f1_score: 0.7763 - 39s/epoch - 1s/step
Epoch 5/10
30/30 - 55s - loss: 0.2705 - f1_score: 0.8963 - val_loss: 0.3306 - val_f1_score: 0.8391 - 55s/epoch - 2s/step
Epoch 6/10
30/30 - 61s - loss: 0.1927 - f1_score: 0.9315 - val_loss: 0.2981 - val_f1_score: 0.8410 - 61s/epoch - 2s/step
Epoch 7/10
30/30 - 96s - loss: 0.1464 - f1_score: 0.9502 - val_loss: 0.2005 - val_f1_score: 0.9282 - 96s/epoch - 3s/step
Epoch 8

The batch size in a Convolutional Neural Network (CNN) refers to the number of images that are processed in a single forward/backward pass. The choice of batch size can impact the performance of the model, as well as the training time and memory requirements. In general, batch sizes between 32 and 128 are commonly used for CNN models for image classification.

Here are some important features of smaller and larger batch sizes.
- Smaller batch size: Less memory usage, suitable for small data, a model with a large number of parameters, or a very deep model to prevent overfitting
- Larger batch size: faster training, can train large data or a relatively simple model without overfitting

During training (through .fit method), the model will iterate over the training data in batches, compute the gradients, and update the model parameters to minimize the loss. The validation data is also used periodically to evaluate the model performance on unseen data and to prevent overfitting.

Once the training is complete, we can use the .evaluate method to compute the final loss and f1 score on the test set.

In [13]:
test_loss, test_f1 = model.evaluate(test_data)



Here's a summary table of model performance, in terms of the loss and f1 score on different sets of data.

In [14]:
train_loss = results.history['loss']
train_f1 = results.history['f1_score']
val_loss = results.history['val_loss']
val_f1 = results.history['val_f1_score']
return_result_table(results = [train_loss, val_loss, test_loss, train_f1, val_f1, test_f1])

Unnamed: 0,Train loss,Validate loss,Test loss,Train f1,Validate f1,Test f1
Mean,0.6568,0.343,0.075,0.841,0.808,0.9804
Std,1.1821,0.1962,0.0,0.1591,0.144,0.0
Max,4.1552,0.6662,0.075,0.9906,0.9663,0.9804
Min,0.0465,0.1106,0.075,0.5085,0.5768,0.9804


### <b>6. Save model for later use</b>

Load the previous model if exists in local drive, and evaluate its performance on the test data.

In [15]:
if os.path.exists('mon_reader_model.h5'):
    with custom_object_scope({'f1_score': f1_score}):
        loaded_model = load_model('mon_reader_model.h5')

loaded_model_loss, loaded_model_test_f1 = loaded_model.evaluate(test_data)



Save the model if no model exists in local drive.

If the previous model exists, compare the performance between the previous and current models. Save the current model only if it performed better than the previous model.

In [16]:
if not os.path.exists('mon_reader_model.h5'):
    print('No model exists in local drive.')
    model.save('mon_reader_model.h5')
    print('Current model saved.')

elif test_f1 > loaded_model_test_f1:
    print('The newly trained model performed better than the previous model.')
    print(f'Previous f1 score: {round(loaded_model_test_f1, 4)}, New f1 score: {round(test_f1, 4)}')
    model.save('mon_reader_model.h5')
    print('Current model saved.')

else:
    print('The newly trained model didn\'t perform better than the previous model.')
    print(f'Previous f1 score: {round(loaded_model_test_f1, 4)}, New f1 score: {round(test_f1, 4)}')
    print('Not saving the current model.')

The newly trained model didn't perform better than the previous model.
Previous f1 score: 0.9832, New f1 score: 0.9804
Not saving the current model.


### <b>7. Conclusion</b>

By experimenting with different input shapes, number of filters, kernel and pool sizes in the CNN model, we were able to construct a highly efficient and performant CNN model using more than two thousand high-resolution images (1920 x 1080 x 3). Fitting the model on the training data took less than 10 minutes, yet the performance was promising. The best model returned an f1 score of 0.9839, which is close to the perfect score of 1.0. The model is saved for later use, i.e., for similar image classification tasks.