# Machine Learning Project 2022: Plankton

### Authors:
- Bram Fresen
- Bram Huis
- Max Burger
- Moos Middelkoop

For the Machine Learning Project to finish off the minor Artificial Intelligence, we chose to tackle the plankton problem, originally uploaded as the United States national data science bowl in december 2014. For this problem, the goal is to classify microscopic images of particles in water as one of 121 different classes of plankton. The dataset is 30.000 images large, with varying sizes. The dataset is also imbalanced.

In order to solve this problem we will make use of a Convolutional Neural Network using the tensorflow library.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Import libraries

Firstly, we will import the needed libraries, and check if we are running on a GPU.

In [None]:
import tensorflow as tf 
import numpy as np
import cv2
import os
import random
import matplotlib.pyplot as plt
import math
import seaborn as sns

from tensorflow import math as tfmath
from tensorflow.keras import layers, models, preprocessing
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.utils import class_weight

print(tf.config.list_physical_devices('GPU'))

## Loading the training data and training labels

We use the cv2 library to load the training images (which are .jpg files), and turn them into arrays. This piece of code was found online:

https://stackoverflow.com/questions/30230592/loading-all-images-using-imread-from-a-given-folder
https://drive.google.com/file/d/1hAaPzDMVEZ8X1tfRS2ieFEqi0R7Ww7uL/view

### Training data

Training data is sorted into folders by class, this next piece of code reads in the training data, puts it in an array, and constructs an array for the classes by using the names of the folders. At the end, this array for the labels is turned into a one-hot matrix, so that tensorflow can work with it.

We also delete the unnecessary channels here already, all input images consist of 3 channels with exactly the same values, so we drop two of them in order to remove redundant data.

In [None]:
def read_data(folder, user="standard"):
    # Create empty lists for the not resized training data, the labels (not one hot encoded yet) and the class sizes
    train_data = []
    train_labels = []
    class_size_list = []
    categories_list =[]
    offset = 0

    # Loop through the index (for the one hot matrix) and the categories
    for number, categories in enumerate(os.listdir(folder)):
        class_size = 0
        categories_list.append(categories)
        print(number)

        # If we come across a hidden folder (starting with ".") on mac os, we ignore it
        if categories[0] == ".":
            offset += 1
            continue

        # Loop through the images , add 1 to the class size, read the images in in and add them to a list, 
        # also add the index 'number' to a list for the one hot matrix
        for image in os.listdir(f'{folder}/{categories}'):
            class_size += 1
            train_labels.append(number - offset)
            img = cv2.imread(os.path.join(f'{folder}/{categories}', image))
            train_data.append(img[:, :, 0])
      
        # Append the size of the class to the class size list, in order to check the class sizes later, this way we
        # can ananlyze the degree of class imbalance
        class_size_list.append(class_size)

    # Create a one hot matrix from the train labels
    train_labels_one_hot = tf.keras.utils.to_categorical(train_labels, num_classes=121)
  
    return train_data, train_labels_one_hot, class_size_list, categories_list

train_data, train_labels, class_size_list, categories_list = read_data('data/train')


# Resize input images

Because all images are differnt sizes, it is necessary to resize all input data to the same size, in order to make tensorflow be able to work with the data. The first cell below analyzes the sizes of the data, and the second cell actually resizes, based on this analysis. An essential element is explicitly adding a third dimension with a value of 1 to the images, otherwise tensorflow can't work with the data. Lastly, the data is converted into numpy arrays, so tensorflow will be able to work with them

In [None]:
# Start with an infinitely large number
value_1 = math.inf
value_2 = math.inf
sum_1 = 0
sum_2 = 0
count = 0

# This checks for the lowest image size in the first and second dimension
for image in train_data:
    count+=1
    sum_1 += image.shape[0]
    sum_2 += image.shape[1]
    if image.shape[0] < value_1:
        
        value_1 = image.shape[0]
        hold_1 = image.shape
    if image.shape[1] < value_2:
        
        value_2 = image.shape[0]
        hold_2 = image.shape
        
sum_11 = sum_1 / count
sum_22 = sum_2 / count

print(f'Average dimensions: {sum_11}, {sum_22}')

plt.imshow(train_data[0], cmap = 'gray')

print(f'Lowest first dimension image {hold_1}')
print(f'Lowest second dimension image {hold_2}')

In [None]:
# Create an empty list for the training data
train_data_resized = []

# Loop through the images in the training data and resize them to the lowest shape in the dataset, 
# add a third dimension of 1 to the images and then append the image to the 'train_data_resized' list
for image in train_data:
    img = cv2.resize(image, dsize = (27, 27), interpolation = cv2.INTER_AREA)
    img = np.expand_dims(img, axis = 2)
    train_data_resized.append(img)

# Test if the image is resized and show the image
print(train_data_resized[0].shape)
plt.imshow(train_data_resized[0], cmap = 'gray')

#Split the data into 70% training and 30% validation
im_train, im_val, lab_train, lab_val = train_test_split(train_data_resized, train_labels, train_size=0.7, random_state=1265599650)

In [None]:
# Convert the data to numpy arrays, so tensorflow can use them
image_train = np.array(im_train)
label_train = np.array(lab_train)
image_val = np.array(im_val)
label_val = np.array(lab_val)

# Test if the shapes are correct
print(image_train.shape)
print(label_train.shape)
print(image_val.shape)
print(label_val.shape)

## Convolutional network

We use the function 'train_and_evaluate' which, obviously, trains our model and then evaluates the trained model on the validation data. This function was reused from the CIFAR-assignment from module 6 of ML2.

Y_ints line: https://datascience.stackexchange.com/questions/13490/how-to-set-class-weights-for-imbalanced-classes-in-keras

In [None]:
def train_and_evaluate(model, train_x, train_y, val_x, val_y, preprocess={}, epochs=20, augment={}):
    
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

    train_gen = preprocessing.image.ImageDataGenerator(**preprocess, **augment)
    train_gen.fit(train_x) 

    val_gen = preprocessing.image.ImageDataGenerator(**preprocess)
    val_gen.fit(train_x)
    
    # used for k-fold
    y_ints = [y.argmax() for y in label_train]
    class_weights = class_weight.compute_class_weight('balanced',classes = np.unique(y_ints), y = y_ints)
    
    history = model.fit(train_gen.flow(train_x, train_y), epochs=epochs, 
                        validation_data=val_gen.flow(val_x, val_y), class_weight = class_weights)

    fig, axs = plt.subplots(1,2,figsize=(20,5)) 

    for i, metric in enumerate(['loss', 'accuracy']):
        axs[i].plot(history.history[metric])
        axs[i].plot(history.history['val_'+metric])
        axs[i].legend(['training', 'validation'], loc='best')

        axs[i].set_title('Model '+metric)
        axs[i].set_ylabel(metric)
        axs[i].set_xlabel('epoch')

    plt.show()

    print(f"Validation Accuracy: {model.evaluate(val_gen.flow(val_x, val_y))[1]}")
    return model.evaluate(val_gen.flow(val_x, val_y))[1]

## The actual model

We start with a very simple convolutional neural network, with 2 convolutional layers, both with pooling afterwards, and one dense layer. kernelsize, amount of filters, amount of nodes are specified in the code cell. This first version of the model gives us a validation accuracy of approximately 64%

In [None]:
first_layer_filters = 32
second_layer_filters = 64
third_layers_filters = 128

kernelsize = (3,3)
inputshape = (27,27, 1)
first_hidden_layer_nodes = 1024
second_hidden_layer_nodes = 512
output_nodes = 121
optimizer = tf.keras.optimizers.Adam(lr=0.0001)
# optimizer = 'adam'

model_1 = models.Sequential()

model_1.add(layers.Conv2D(first_layer_filters, kernelsize, activation = 'relu', padding = 'same', input_shape = inputshape))
model_1.add(layers.MaxPooling2D((3, 3)))
model_1.add(layers.Dropout(0.2))


model_1.add(layers.Conv2D(second_layer_filters, kernelsize, activation = 'relu', padding = 'same'))
model_1.add(layers.MaxPooling2D((3, 3)))
model_1.add(layers.BatchNormalization())

model_1.add(layers.Flatten())
model_1.add(layers.Dropout(0.35))
model_1.add(layers.Dense(first_hidden_layer_nodes, activation = 'relu'))
model_1.add(layers.Dropout(0.35))
model_1.add(layers.Dense(second_hidden_layer_nodes, activation = 'relu'))
model_1.add(layers.Dropout(0.35))
model_1.add(layers.Dense(output_nodes, activation = 'softmax'))

train_and_evaluate(model_1, image_train, label_train, image_val, label_val, epochs = 80)

In [None]:
model_1.summary()


In [None]:
# Filter out the largest example (and possibily the second-largest) to illustrate resizing
storage_1 = 0 
storage_2 = 0
store_img1 = None
store_img2 = None
for image in train_data:
    store = image.shape[0]
    if store > storage_1 and store > storage_2:
        store_img1 = image
        storage_1 = image.shape[0]
    if store > storage_2 and store != storage_1:
        store_img2 = image
        storage_2 = image.shape[0]
        

# Plot the different manners of interpolation for comparison
f, axarr = plt.subplots(2,3, figsize=(12, 12))
axarr[0,0].imshow(store_img1, cmap = 'gray')
axarr[0,0].set_title('Original picture:')
axarr[1,0].imshow(cv2.resize(store_img1, dsize = (32, 32), interpolation = cv2.INTER_LINEAR), cmap = 'gray')
axarr[1,0].set_title('Bilinear interpolation:')
axarr[0,1].imshow(cv2.resize(store_img1, dsize = (32, 32), interpolation = cv2.INTER_AREA), cmap = 'gray')
axarr[0,1].set_title('Pixel area relation interpolation:')
axarr[1,1].imshow(cv2.resize(store_img1, dsize = (32, 32), interpolation = cv2.INTER_NEAREST), cmap = 'gray')
axarr[1,1].set_title('Nearest-neighbor interpolation:')
axarr[0,2].imshow(cv2.resize(store_img1, dsize = (32, 32), interpolation = cv2.INTER_CUBIC), cmap = 'gray')
axarr[0,2].set_title('Bicubic interpolation:')
axarr[1,2].imshow(cv2.resize(store_img1, dsize = (32, 32), interpolation = cv2.INTER_LANCZOS4), cmap = 'gray')
axarr[1,2].set_title('Lanczos interpolation:')

#plt.savefig('resizingmethods.png')

In [None]:
# First compute the predictions based on the trained model
y_pred = tf.keras.utils.to_categorical(model_1.predict_classes(image_val))


# Convert both actual and predict to utilize in conf matrix 
y_true = tf.argmax(label_val, axis=1)
y_pred = tf.argmax(y_pred, axis=1)

# Compute matrix
conf_matrix = tfmath.confusion_matrix(y_true, y_pred)

# Remove the diagonal for clearer image
cnf_mtrx = np.array(conf_matrix)
np.fill_diagonal(cnf_mtrx, 0)

# Compute heatmap for abosulte numbers

plt.figure(figsize=(60,60))
ax = sns.heatmap(cnf_mtrx, annot=True, fmt="d", xticklabels=categories_list, yticklabels=categories_list)
ax.set(xlabel='Predicted Class', ylabel='Actual Class')
plt.show()

# Compute the relative (%) cf matrix and corresponding heatmap

cnf_mtrx = np.nan_to_num((cnf_mtrx / cnf_mtrx.astype(np.float).sum(axis=0)), copy=True, nan=0.0, posinf=None, neginf=None)
plt.figure(figsize=(60,60))
ax = sns.heatmap(cnf_mtrx, annot=True, xticklabels=categories_list, yticklabels=categories_list)
ax.set(xlabel='Predicted Class', ylabel='Actual Class')

plt.show()

## Stratified K-folds

In the cell below, we can train the model in K stratified folds, meaning we can test the model over different test and train splits that take the distribution of classes into account when making said splits.

In [None]:
# Convert data to array before split to use in stratified kfolds
all_image = np.array(train_data_resized)
all_labels = np.array(train_labels)

# Convert one hot matrix back to label vector
y_ints_all = [y.argmax() for y in all_labels]
y_ints_all = np.array(y_ints_all)

n_splits = 5

# Define stratified kfold and number of splits
skf = StratifiedKFold(n_splits= n_splits)
skf.get_n_splits(all_image, all_labels)

StratifiedKFold(n_splits= n_splits, random_state= None, shuffle= False)

val_accuracy_list = []

# Iterate over the splits
for train_index, test_index in skf.split(all_image, y_ints_all):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = all_image[train_index], all_image[test_index]
    y_train, y_test = all_labels[train_index], all_labels[test_index]

    
    first_layer_filters = 32
    second_layer_filters = 64
    third_layers_filters = 128

    kernelsize = (3,3)
    inputshape = (27,27, 1)
    first_hidden_layer_nodes = 1024
    second_hidden_layer_nodes = 512
    output_nodes = 121
    optimizer = tf.keras.optimizers.Adam(lr=0.0001)
    # optimizer = 'adam'

    model_1 = models.Sequential()

    model_1.add(layers.Conv2D(first_layer_filters, kernelsize, activation = 'relu', padding = 'same', input_shape = inputshape))
    model_1.add(layers.MaxPooling2D((3, 3)))
    model_1.add(layers.Dropout(0.2))


    model_1.add(layers.Conv2D(second_layer_filters, kernelsize, activation = 'relu', padding = 'same'))
    model_1.add(layers.MaxPooling2D((3, 3)))
    model_1.add(layers.BatchNormalization())

    model_1.add(layers.Flatten())
    model_1.add(layers.Dropout(0.35))
    model_1.add(layers.Dense(first_hidden_layer_nodes, activation = 'relu'))
    model_1.add(layers.Dropout(0.35))
    model_1.add(layers.Dense(second_hidden_layer_nodes, activation = 'relu'))
    model_1.add(layers.Dropout(0.35))
    model_1.add(layers.Dense(output_nodes, activation = 'softmax'))
    
    # Call train and evaluate for every split
    val_accuracy = train_and_evaluate(model_1, X_train, y_train, X_test, y_test, epochs = 80)
    val_accuracy_list.append(val_accuracy)
print("------------------------------------------------------------------")
print(f"The mean accuracy of {n_splits} folds is {np.mean(val_accuracy_list)}")