## Summary

This notebook details my process for using a convolutional neural network (CNN) to classify white bloods by their subtypes. The dataset comes from Kaggle (https://www.kaggle.com/datasets/paultimothymooney/blood-cells) and includes approximately 12000 augmented JPEG images of cells from a microscpe. The images came prepackaged in train and test sets that were equally balanced across the four classes. Images were loaded into the keras preprocessing package ImageDataGenerator to produce the image pixel arrays and labels. As part of my EDA, I examined the distribution of RGB pixel values for each class and ran an ANOVA test to confirm they come from different populations. Inspection of the distributions revealed slight differences such as the higher density of high pixel values for Monocytes, higher density of lower pixel values for Lymphocytes and the realtively similar distributions of Eosinophils and Neutrophils. My hypothesis was that the model would pick up on these differences, making classes 1 and 2 easier to classify versus 0 and 3. The modeling process involved trialing different model complexities, batch sizes, image sizes, larning rates and regularization techniques. Results were evaluated using accuracy scores and categorical cross entropy loss. Throughout the process, test/validation scores steadily increased, but were significantly outpaced by training scores, showing signs of overfitting. Regularization techniques such as adding Dropouts and L2 regularization were trialed to varying degrees of success. Ultimately, the best model recorded an accuracy of 81% and loss of 0.545. My hypothesis around which classes would be easiest and most difficult to classify was confirmed by the confusion matrix and classification report. As next steps, I would like to continue to add more images to the training set and explore additional regularization techiniques to reduce overfitting. 


In [None]:
#Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import f_oneway
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, classification_report

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from keras.preprocessing import image 
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense , Dropout
from tensorflow.keras.layers import Flatten 
from tensorflow.keras.layers import Conv2D 
from tensorflow.keras.layers import MaxPooling2D
from keras.callbacks import LearningRateScheduler
from tensorflow.keras import regularizers

## Load Data with ImageGenerator

In [None]:
#assign train and test os paths  

train_path = '/Users/mike/Flatiron/Blood_Cells/Data/dataset2-master/dataset2-master/images/TRAIN'
test_path = '/Users/mike/Flatiron/Blood_Cells/Data/dataset2-master/dataset2-master/images/TEST'

#generate image arrays and labels for train, validation and test

train_datagen = ImageDataGenerator(rescale=1./255, validation_split = .2)
train_generator = train_datagen.flow_from_directory(train_path, target_size =(256,256), batch_size = 64,
                                                   class_mode = 'categorical', subset ='training')

test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_directory(test_path, target_size =(256,256), batch_size = 64,
                                                  class_mode= 'categorical', shuffle = False)

val_generator = train_datagen.flow_from_directory(train_path, target_size =(256,256), batch_size = 64,
                                                  class_mode= 'categorical', subset = 'validation', shuffle = False )

In [None]:
#confirm class balance for train and test

train_labels = train_generator.classes
test_labels = test_generator.classes

train_label, train_count = np.unique(train_labels, return_counts=True)
test_label, test_count = np.unique(test_labels, return_counts=True)

print('Train ~ {}'.format(list(zip(train_label, train_count))))
print('Test ~ {}'.format(list(zip(test_label, test_count))))

## EDA

For EDA, I will visualize the distribution of pixel intensities for each class to see if there are any noticeable differences. An ANOVA test will be used to confirm the classes come from different populations and have statistically significant differences in mean pixel intensity

In [None]:
# initialize empty arrays to store pixel values and labels
pixels = []
labels = []

# iterate over batches generated by train_generator. Append values to respective lists
for x_batch, y_batch in train_generator:
    pixels.append(x_batch)
    labels.append(y_batch)
    # stop iteration when all batches have been processed
    if len(labels) * train_generator.batch_size >= train_generator.n:
        break

# combine pixel values and labels into single arrays
pixels = np.concatenate(pixels, axis=0)
labels = np.concatenate(labels, axis=0)

# group pixel values by class label
class_pixels = {}
for i in range(train_generator.num_classes):
    class_pixels[i] = pixels[labels[:, i] == 1]

# calculate mean and standard deviation for each class
class_stats = {}
for i in range(train_generator.num_classes):
    class_stats[i] = {}
    class_stats[i]['mean'] = np.mean(class_pixels[i], axis=0)
    class_stats[i]['std'] = np.std(class_pixels[i], axis=0)

In [None]:
#test distributions for independence

alpha = .05
#initialize empty arrays to store p-values
p_values = []

#extract pixel values for each class
class_0_pixels = class_stats[0]['mean']
class_1_pixels = class_stats[1]['mean']
class_2_pixels = class_stats[2]['mean']
class_3_pixels = class_stats[3]['mean']

# perform ANOVA test on the four classes
f_statistic, p_value = f_oneway(class_0_pixels, class_1_pixels, class_2_pixels, class_3_pixels)

# append p-value to list
p_values.append(p_value)

# combine p-values using Fisher's method
fisher_p_value = np.prod(p_values)
np.set_printoptions(precision=10)

print("Overall p-value:", fisher_p_value)
if fisher_p_value < alpha:
    print('We reject the null hypothesis that the classes come from the same population')
else:
    print('We fail to reject the null hypothesis')


I will use seaborn to graph the distributions of the 4 classes on one plot, separated using hue. In order to do so, I will combine all the pixel values and labels into a dataframe and take a random sample of 5 million pixels from each.

In [None]:
# concatenate pixel values for each class into one array
x = np.concatenate([class_pixels[i].ravel() for i in range(train_generator.num_classes)])

# create a DataFrame with the concatenated pixel values and a holder value for class
data = pd.DataFrame({'Pixel Values': x, 'Class':0})

In [None]:
# determine the index range for each class 
for i in range(train_generator.num_classes):
    print('There are {} pixels in class {}'.format(+ len(class_pixels[i].ravel()), i))

In [None]:
# assign the number of rows in new column Class for each label
x = 490930176
y = 488177664
z = 487194624
t = 491323392

# Set the values of the 'Class' column
data.loc[:x-1, 'Class'] = 0
data.loc[x:x+y-1, 'Class'] = 1
data.loc[x+y:x+y+z-1, 'Class'] = 2
data.loc[x+y+z:x+y+z+t-1, 'Class'] = 3


In [None]:
# Create a new dataframe with a random sample of 5,000,000 rows from each class

#set variables and initialize a list to hold sampled rows
sample_size = 5000000
class_col = 'Class'
sampled_dataframes = []

# Sample from each class and concatenate the results
for class_val in range(train_generator.num_classes):
    class_subset = data[data[class_col] == class_val]
    class_sample = class_subset.sample(n=sample_size)
    sampled_dataframes.append(class_sample)

# Concatenate the sampled dataframes
sampled_df = pd.concat(sampled_dataframes, ignore_index=True)

In [None]:
# # plot histogram of pixel values with each class separated by hue
sns.histplot(sampled_df, x='Pixel Values', hue='Class', alpha=0.5, kde=True)
plt.xlabel('Pixel Value')
plt.ylabel('Frequency')
plt.xlim(0.4, 1.0)
plt.ylim(0,120000)
plt.title('Pixel Value Frequency by Class')
plt.legend(['Eosinophil', 'Lymphocyte', 'Monocyte', 'Neutrophil'])
plt.show()

Inspection of the distribution shows a couple of interesting trends. First, Lymphocytes and Monocytes seem to have slightly wider distributions with more pixel values pushed out towards the tails in opposite directions. This tendency toward lighter and darker pixel intesities respectively may make them easier for the model to detect. The second aspect of distribution that catches my attention is that Eosinophil and Neutrophil are both more densly packed around the mean. This suggests to me that they are more similar to each other and will be more difficult to distinguish.

## Baseline Model

For the baseline model, I will create a very simple convolutionary neural network (CNN) with only one convolutional layer, max pooling, and one dense layer.

In [None]:
#instantiate a model
base_model = Sequential()


# add the input layer  
base_model.add(Conv2D(filters=32,
                        kernel_size=(3, 3),
                        activation='relu',
                        input_shape=(256, 256, 3)))


# max pool in 2x2 window
base_model.add(MaxPooling2D(pool_size=(2, 2)))

# connect all nodes with dense layers. output for multi-categorical with 4 classes  
base_model.add(Flatten())
base_model.add(Dense(64, activation='relu'))
base_model.add(Dense(4, activation='softmax'))

#using adam optimizer, categorical_crossentropy to measure loss and accuracy as our metric  
base_model.compile(optimizer='adam', loss='categorical_crossentropy',  metrics=['accuracy'])

In [None]:
#fit the model with the training data. Start with 20 epochs at 20 steps per epoch  
history_base = base_model.fit(train_generator, steps_per_epoch = 20, verbose = 1, validation_data = val_generator,
                              epochs = 20)

In [None]:
test_loss, test_acc = base_model.evaluate_generator(test_generator, verbose =2)
print('Test loss: {}'.format(test_loss))
print('Test accuracy: {}'.format(test_acc))

Baseline test accuracy is 45% and loss is 0.4796. Training accuracy is already very high, showing that the model is overfitting off the bat. I will try to add more complexity to increase my accuracy before I attempt to fix the overfitting with regularization 

## Model 1

In the first model, I will add more convolutional and dense layers to add complexity to the model

In [None]:
#instantiate a model
model_1 = Sequential()


# add the input layer 
model_1.add(Conv2D(filters=64,
                        kernel_size=(3, 3),
                        activation='relu',
                        input_shape=(256, 256, 3)))
model_1.add(MaxPooling2D(pool_size=(2, 2)))

#add second convolutional layer
model_1.add(Conv2D(filters=32,
                        kernel_size=(3, 3),
                        activation='relu'))
model_1.add(MaxPooling2D(pool_size=(2, 2)))

#add third convolutional layer
model_1.add(Conv2D(filters=16,
                        kernel_size=(3, 3),
                        activation='relu'))
model_1.add(MaxPooling2D(pool_size=(2, 2)))


# connect all nodes with dense layers. output for multi-categorical with 4 classes  
model_1.add(Flatten())
model_1.add(Dense(128, activation='relu'))
model_1.add(Dense(64, activation='relu'))
model_1.add(Dense(32, activation='relu'))
model_1.add(Dense(4, activation='softmax'))

#using adam optimizer, categorical_crossentropy to measure loss and accuracy as our metric  
model_1.compile(optimizer='adam', loss='categorical_crossentropy',  metrics=['accuracy'])

In [None]:
#fit the model with the training data.  
history_1 = model_1.fit(train_generator, steps_per_epoch = 20, verbose = 1, validation_data = val_generator,
                              epochs = 20)

Based on the validation scores, it is clear the added complexity did not help the model's performance.

## Model 2

In the second mode, I will reduce the amount of added complexity and try to use a smaller filter layer.

In [None]:
#instantiate a model
model_2 = Sequential()


# add the input layer  
model_2.add(Conv2D(filters=32,
                        kernel_size=(3, 3),
                        activation='relu',
                        input_shape=(256, 256, 3)))

model_2.add(MaxPooling2D(pool_size=(2, 2)))

# add second convolutional layer
model_2.add(Conv2D(filters=16,
                        kernel_size=(3, 3),
                        activation='relu',
                        input_shape=(256, 256, 3)))
model_2.add(MaxPooling2D(pool_size=(2, 2)))

# connect all nodes with dense layers. output for multi-categorical with 4 classes  
model_2.add(Flatten())
model_2.add(Dense(128, activation='relu'))
model_2.add(Dense(4, activation='softmax'))

#using adam optimizer, categorical_crossentropy to measure loss and accuracy as our metric  
model_2.compile(optimizer='adam', loss='categorical_crossentropy',  metrics=['accuracy'])

In [None]:
history_2 = model_2.fit(train_generator, steps_per_epoch = 20, verbose = 1, validation_data = val_generator,
                              epochs = 20)

There is a slight improvement over Model 1, but the validation scores remain fairly consistent.

## Model 3

My hypothesis is that since only one cell in the image is dyed, I can focus in on the target cell more by using a smaller image size. I will also try to use smaller batchsizes, a more narrow input layer and more epochs.

In [None]:
#regenerate the image data using 128x128 as the image size  

train_path = '/Users/mike/Flatiron/Blood_Cells/Data/dataset2-master/dataset2-master/images/TRAIN'
test_path = '/Users/mike/Flatiron/Blood_Cells/Data/dataset2-master/dataset2-master/images/TEST'

#generate image arrays and labels for train, validation and test

train_datagen = ImageDataGenerator(rescale=1./255, validation_split = .2)
train_generator = train_datagen.flow_from_directory(train_path, target_size =(128,128), batch_size = 32,
                                                   class_mode = 'categorical', subset ='training')

test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_directory(test_path, target_size =(128,128), batch_size = 32,
                                                  class_mode= 'categorical', shuffle = False)

val_generator = train_datagen.flow_from_directory(train_path, target_size =(128,128), batch_size = 32,
                                                  class_mode= 'categorical', subset = 'validation', shuffle = False )

In [None]:
#instantiate a model
model_3 = Sequential()


# add the input layer 
model_3.add(Conv2D(filters=16,
                        kernel_size=(3, 3),
                        activation='relu',
                        input_shape=(128, 128, 3)))

model_3.add(MaxPooling2D(pool_size=(2, 2)))

# add second convolutional layer
model_3.add(Conv2D(filters=32,
                        kernel_size=(3, 3),
                        activation='relu'))
model_3.add(MaxPooling2D(pool_size=(2, 2)))

# connect all nodes with dense layers. output for multi-categorical with 4 classes  
model_3.add(Flatten())
model_3.add(Dense(128, activation='relu'))
model_3.add(Dense(4, activation='softmax'))

#using adam optimizer, categorical_crossentropy to measure loss and accuracy as our metric  
model_3.compile(optimizer='adam', loss='categorical_crossentropy',  metrics=['accuracy'])

In [None]:
history_3 = model_3.fit(train_generator, steps_per_epoch = 20, verbose = 1, validation_data = val_generator,
                              epochs = 30)

In [None]:
test_loss, test_acc = model_3.evaluate_generator(test_generator, verbose =2)
print('Test loss: {}'.format(test_loss))
print('Test accuracy: {}'.format(test_acc))

Model 3 is significantly faster and performs better than any model previously run. Test accuracy was 57% with loss of 0.91.

## Model 4

The changes in model 3 seem to have created improvements. In this next model, I will add more complexity, reduce the kernel_size, and add more epochs

In [None]:
model_4 = Sequential()


# add the input layer 
model_4.add(Conv2D(filters=16,
                        kernel_size=(2, 2),
                        activation='relu',
                        input_shape=(128, 128, 3)))

model_4.add(MaxPooling2D(pool_size=(2, 2)))

# add second convolutional layer
model_4.add(Conv2D(filters=32,
                        kernel_size=(2, 2),
                        activation='relu'))
model_4.add(MaxPooling2D(pool_size=(2, 2)))

# add third convolutional layer
model_4.add(Conv2D(filters=64,
                        kernel_size=(2, 2),
                        activation='relu'))
model_4.add(MaxPooling2D(pool_size=(2, 2)))

# connect all nodes with dense layers. output for multi-categorical with 4 classes  
model_4.add(Flatten())
model_4.add(Dense(256, activation='relu'))
model_4.add(Dense(128, activation='relu'))
model_4.add(Dense(4, activation='softmax'))

#using adam optimizer, categorical_crossentropy to measure loss and accuracy as our metric  
model_4.compile(optimizer='adam', loss='categorical_crossentropy',  metrics=['accuracy'])

In [None]:
history_4 = model_4.fit(train_generator, steps_per_epoch = 20, verbose = 1, validation_data = val_generator,
                              epochs = 50)

Model 4 showed the highest validation performance yet. Improvements were consistent all the way to the end, which leads me to believe I need to add more epochs to allow it to learn longer. Since the model is so fast, I don't see this as an issue right now.

## Model 5

Model 5 includes an additional dense layer and uses a smaller filter size compared to the previous model. I will run this model for 100 epochs to allow more time for the model to learn.

In [None]:
model_5 = Sequential()


# add the input layer  
model_5.add(Conv2D(filters=16,
                        kernel_size=(2, 2),
                        activation='relu',
                        input_shape=(128, 128, 3)))

model_5.add(MaxPooling2D(pool_size=(2, 2)))

# add second convolutional layer
model_5.add(Conv2D(filters=32,
                        kernel_size=(2, 2),
                        activation='relu'))
model_5.add(MaxPooling2D(pool_size=(2, 2)))

# add third convolutional layer
model_5.add(Conv2D(filters=64,
                        kernel_size=(2, 2),
                        activation='relu'))
model_5.add(MaxPooling2D(pool_size=(2, 2)))

# connect all nodes with dense layers. output for multi-categorical with 4 classes  
model_5.add(Flatten())
model_5.add(Dense(256, activation='relu'))
model_5.add(Dense(128, activation='relu'))
model_5.add(Dense(64, activation = 'relu'))
model_5.add(Dense(4, activation='softmax'))

#using adam optimizer, categorical_crossentropy to measure loss and accuracy as our metric  
model_5.compile(optimizer='adam', loss='categorical_crossentropy',  metrics=['accuracy'])

In [None]:
history_5 = model_5.fit(train_generator, steps_per_epoch = 20, verbose = 1, validation_data = val_generator,
                              epochs = 100)

In [None]:
test_loss, test_acc = model_5.evaluate_generator(test_generator, verbose =2)
print('Test loss: {}'.format(test_loss))
print('Test accuracy: {}'.format(test_acc))

Model 5 has the highest test accuracy yet at 73%. Test loss is 1.216. This is currently my best model. Training accuracy is up to 97%, but the model is overfitting compared to the validation and test sets. In my next model iteration, I will try to add some regularization techniques

## Model 6

In this model, Dropout commands have been added to the second and third convolutional layer.

In [None]:
model_6 = Sequential()

# add the input layer 
model_6.add(Conv2D(filters=16,
                        kernel_size=(2, 2),
                        activation='relu',
                        input_shape=(128, 128, 3)))

model_6.add(MaxPooling2D(pool_size=(2, 2)))

# add second convolutional layer
model_6.add(Conv2D(filters=32,
                        kernel_size=(2, 2),
                        activation='relu'))
model_6.add(MaxPooling2D(pool_size=(2, 2)))
model_6.add(Dropout(0.25))

# add third convolutional layer
model_6.add(Conv2D(filters=64,
                        kernel_size=(2, 2),
                        activation='relu'))
model_6.add(MaxPooling2D(pool_size=(2, 2)))
model_6.add(Dropout(0.25))

# connect all nodes with dense layers. output for multi-categorical with 4 classes  
model_6.add(Flatten())
model_6.add(Dense(256, activation='relu'))
model_6.add(Dense(128, activation='relu'))
model_6.add(Dense(64, activation = 'relu'))
model_6.add(Dense(32, activation = 'relu'))
model_6.add(Dense(4, activation='softmax'))

#using adam optimizer, categorical_crossentropy to measure loss and accuracy as our metric  
model_6.compile(optimizer='adam', loss='categorical_crossentropy',  metrics=['accuracy'])


In [None]:
history_6 = model_6.fit(train_generator, steps_per_epoch = 20, verbose = 1, validation_data = val_generator,
                              epochs = 100)

In [None]:
test_loss, test_acc = model_6.evaluate_generator(test_generator, verbose =2)
print('Test loss: {}'.format(test_loss))
print('Test accuracy: {}'.format(test_acc))

Adding the Dropouts significantly helped with the overfitting and raised the test accuracy scores to 77%. The test loss fell by half to 0.589. I will consider Model 6 as my best model and leave the Dropout commands in subsequent iterations.

## Model 7

In the following model, I would like to experiment with the learning rate. I will use a step decay approach to decrease the learning rate at later epochs to allow it to fine tune the weights after the majority of the accuracy gains have been achieved. Based on the learning progression from Model 6, this seems to have been around 75 epochs, so that will be my target inflection point. I will change the learning rate using a schedule function and the LearningRateScheduler from keras callbacks.

In [None]:
def step_decay_schedule(initial_lr, decay_factor, step_size):
    '''
    Wrapper function to create a LearningRateScheduler with step decay schedule.
    '''
    def schedule(epoch):
        return initial_lr * (decay_factor ** np.floor(epoch/step_size))
    
    return LearningRateScheduler(schedule)

In [None]:
model_7 = Sequential()

# add the input layer 
model_7.add(Conv2D(filters=16,
                        kernel_size=(2, 2),
                        activation='relu',
                        input_shape=(128, 128, 3)))

model_7.add(MaxPooling2D(pool_size=(2, 2)))

# add second convolutional layer
model_7.add(Conv2D(filters=32,
                        kernel_size=(2, 2),
                        activation='relu'))
model_7.add(MaxPooling2D(pool_size=(2, 2)))
model_7.add(Dropout(0.25))

# add third convolutional layer
model_7.add(Conv2D(filters=64,
                        kernel_size=(2, 2),
                        activation='relu'))
model_7.add(MaxPooling2D(pool_size=(2, 2)))
model_7.add(Dropout(0.25))

# connect all nodes with dense layers. output for multi-categorical with 4 classes  
model_7.add(Flatten())
model_7.add(Dense(256, activation='relu'))
model_7.add(Dense(128, activation='relu'))
model_7.add(Dense(64, activation = 'relu'))
model_7.add(Dense(32, activation = 'relu'))
model_7.add(Dense(4, activation='softmax'))

#using adam optimizer, categorical_crossentropy to measure loss and accuracy as our metric  
model_7.compile(optimizer='adam', loss='categorical_crossentropy',  metrics=['accuracy'])
lr_sched = step_decay_schedule(1e-3, decay_factor=0.25, step_size=75)

In [None]:
history_7 = model_7.fit(train_generator, steps_per_epoch = 20, verbose = 1, validation_data = val_generator,
                              epochs = 100, callbacks = [lr_sched])

In [None]:
test_loss, test_acc = model_7.evaluate_generator(test_generator, verbose =2)
print('Test loss: {}'.format(test_loss))
print('Test accuracy: {}'.format(test_acc))

Model 7 returns the best results yet with 81% test accuracy and loss of 0.545.

## Model 8

In this model, I will try a different approach with the learning rate. I will drop the learning rate by 10% at two different intervals over the course of 200 epochs.

In [None]:
def step_decay_schedule(initial_lr, decay_factor):
    '''
    Wrapper function to create a LearningRateScheduler with step decay schedule.
    '''
    def schedule(epoch):
        if epoch <= 75:
            return initial_lr
        if epoch > 75 and epoch <= 150:
            new_lr = initial_lr - (initial_lr * decay_factor)
            return new_lr
        else:
            last_lr = initial_lr - (initial_lr * (2*decay_factor))
            return last_lr
    
    return LearningRateScheduler(schedule)

lr_sched = step_decay_schedule(1e-3, decay_factor=0.10)

In [None]:
model_8 = Sequential()

# add the input layer 
model_8.add(Conv2D(filters=16,
                        kernel_size=(2, 2),
                        activation='relu',
                        input_shape=(128, 128, 3)))

model_8.add(MaxPooling2D(pool_size=(2, 2)))

# add second convolutional layer
model_8.add(Conv2D(filters=32,
                        kernel_size=(2, 2),
                        activation='relu'))
model_8.add(MaxPooling2D(pool_size=(2, 2)))
model_8.add(Dropout(0.25))

# add third convolutional layer
model_8.add(Conv2D(filters=64,
                        kernel_size=(2, 2),
                        activation='relu'))
model_8.add(MaxPooling2D(pool_size=(2, 2)))
model_8.add(Dropout(0.25))

# connect all nodes with dense layers. output for multi-categorical with 4 classes  
model_8.add(Flatten())
model_8.add(Dense(256, activation='relu'))
model_8.add(Dense(128, activation='relu'))
model_8.add(Dense(64, activation = 'relu'))
model_8.add(Dense(32, activation = 'relu'))
model_8.add(Dense(4, activation='softmax'))

#using adam optimizer, categorical_crossentropy to measure loss and accuracy as our metric  
model_8.compile(optimizer='adam', loss='categorical_crossentropy',  metrics=['accuracy'])

In [None]:
history_8 = model_8.fit(train_generator, steps_per_epoch = 20, verbose = 1, validation_data = val_generator,
                              epochs = 200, callbacks = [lr_sched])

In [None]:
test_loss, test_acc = model_8.evaluate_generator(test_generator, verbose =2)
print('Test loss: {}'.format(test_loss))
print('Test accuracy: {}'.format(test_acc))

Model 8's performance is similar to model 7, but with a larger loss and is more computationally expensive. I will continue with Model 7 as my best model.

## Model 9

At this point, I will shift my focus to further addressing the overfitting. I will attempt to add L2 regularization to my convolutional layers.

In [None]:
model_9 = Sequential()

# add the input layer  
model_9.add(Conv2D(filters=16,
                        kernel_size=(2, 2),
                        activation='relu',
                        input_shape=(128, 128, 3)))

model_9.add(MaxPooling2D(pool_size=(2, 2)))

# add second convolutional layer
model_9.add(Conv2D(filters=32,
                        kernel_size=(2, 2),
                        activation='relu',
                  kernel_regularizer = 'l2'))
model_9.add(MaxPooling2D(pool_size=(2, 2)))
model_9.add(Dropout(0.25))

# add third convolutional layer
model_9.add(Conv2D(filters=64,
                        kernel_size=(2, 2),
                        activation='relu',
                   kernel_regularizer = 'l2'))
model_9.add(MaxPooling2D(pool_size=(2, 2)))
model_9.add(Dropout(0.25))

# connect all nodes with dense layers. output for multi-categorical with 4 classes  
model_9.add(Flatten())
model_9.add(Dense(256, activation='relu'))
model_9.add(Dense(128, activation='relu'))
model_9.add(Dense(64, activation = 'relu'))
model_9.add(Dense(32, activation = 'relu'))
model_9.add(Dense(4, activation='softmax'))

#using adam optimizer, categorical_crossentropy to measure loss and accuracy as our metric  
model_9.compile(optimizer='adam', loss='categorical_crossentropy',  metrics=['accuracy'])
lr_sched = step_decay_schedule(1e-3, decay_factor=0.25, step_size=75)

In [None]:
history_9 = model_9.fit(train_generator, steps_per_epoch = 20, verbose = 1, validation_data = val_generator,
                              epochs = 100, callbacks = [lr_sched])

The penalty appears to be too strong and holds back the model from reaching its full predictive power

## Model 10

The default penalty value for the L2 regularizer is 0.01. I will try to reduce that to .001 and see how the model performs.

In [None]:
model_10 = Sequential()

# add the input layer 
model_10.add(Conv2D(filters=16,
                        kernel_size=(2, 2),
                        activation='relu',
                        input_shape=(128, 128, 3)))

model_10.add(MaxPooling2D(pool_size=(2, 2)))

# add second convolutional layer
model_10.add(Conv2D(filters=32,
                        kernel_size=(2, 2),
                        activation='relu',
                   kernel_regularizer = regularizers.L2(1e-3)))
model_10.add(MaxPooling2D(pool_size=(2, 2)))
model_10.add(Dropout(0.25))

# add third convolutional layer
model_10.add(Conv2D(filters=64,
                        kernel_size=(2, 2),
                        activation='relu',
                   kernel_regularizer = regularizers.L2(1e-3)))
model_10.add(MaxPooling2D(pool_size=(2, 2)))
model_10.add(Dropout(0.25))

# connect all nodes with dense layers. output for multi-categorical with 4 classes  
model_10.add(Flatten())
model_10.add(Dense(256, activation='relu'))
model_10.add(Dense(128, activation='relu'))
model_10.add(Dense(64, activation = 'relu'))
model_10.add(Dense(32, activation = 'relu'))
model_10.add(Dense(4, activation='softmax'))

#using adam optimizer, categorical_crossentropy to measure loss and accuracy as our metric  
model_10.compile(optimizer='adam', loss='categorical_crossentropy',  metrics=['accuracy'])

In [None]:
history_10 = model_10.fit(train_generator, steps_per_epoch = 20, verbose = 1, validation_data = val_generator,
                              epochs = 200)

In [None]:
test_loss, test_acc = model_10.evaluate_generator(test_generator, verbose =2)
print('Test loss: {}'.format(test_loss))
print('Test accuracy: {}'.format(test_acc))

The reduced penalty did not help the overfitting at later epochs and performance remains similar to previous models. Due to time constraints, I will move forward with Model 7 as my best model and evaluate its results.

## Evaluation

In [None]:
#plot the change in loss for the train and validation sets 
model7_history = pd.DataFrame(history_7.history)
model7_history.index.name = 'Epochs'

col_list = ['loss', 'val_loss']
model7_history[col_list].plot()
plt.ylabel('Categorical Cross Entropy')
plt.title('Training Loss History')
plt.show()

In [None]:
#plot the change in accuracy for the train and validation sets 
col_list = ['accuracy', 'val_accuracy']
model7_history[col_list].plot()
plt.ylabel('Accuracy')
plt.title('Training Accuracy History')
plt.show()

In [None]:
#print confusion matrix and classification report to see how the model performed across classes 
predictions = model_7.predict(test_generator)
predicted_classes = np.argmax(predictions, axis=1)
true_classes = test_generator.classes
cm = confusion_matrix(true_classes, predicted_classes)
ConfusionMatrixDisplay(cm).plot()
print(classification_report(true_classes, predicted_classes))

The classification report confirms our initial hypothesis that classes 1 and 2 will be the easiest to classify due to their different pixel value distributions while classes 0 and 3 will present more of a challenge. The confusion matrix supports this by showing classes 0 and 3 are most commonly misclassified as each other. The test accuracy of 81% is a significant improvement over the baseline of 45% and warrants further exploration with the model. With more time I would like to gather more training data to improve accuracy and play with different regularization techniques/values to reduce the overfitting.