In [None]:
import numpy as np
import keras
import pandas as pd
from keras.models import Sequential 
from keras.layers import Dense, Dropout, Activation, Flatten, Conv2D, MaxPooling2D
from keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
from keras.utils import np_utils
import cv2
import os
import glob

This project involves creating a CNN classifier for classifying almost 60,000 images into 10 class objects including lion, rocket, tank, flower, streetcar, deer, plane, bird, car, truck. The training and testing data were acquired from the following competition:

https://www.kaggle.com/competitions/stat940-winter-2022-dc1

# Step 1: Loading Data

In [None]:
#Loading Training Labels

y_train= pd.read_csv('/kaggle/input/stat940-winter-2022-dc1/train_labels.csv')
del y_train['id']

#y_train.describe()
y_train

i.e The training labels(y_train) are loaded by reading the 'train_labels.csv'.


                i.e The training data(x_train) is organizaed, tested, and loaded.

*Problem: The Output labels(Y_train) are organized but the input data(.jpg images) for both the test(x_test) and training(x_train) are not organized in the folder. This could lead to incorrect matching between the training input (x_train) and the training output (y_train) which would thus cause a major problem when training the model.

solutions suggested:

a. Download the training data locally, reorganize the data, upload the data, read the data using the imread() function, and then append the data to form a list. Finally, store the data in a variable and convert it into a numpy array using "np.array(Variable)". 

b. Read the images in the directory sequentially through a f-string and then append the data to form a list. Perform this procedure for both the testing data(x_test) and the training data(x_train).

Option chosen: Option b. Our goal is to read the images sequentially and then store them in a variable. Downloading the whole data and then reorganizing all the data for a future upload would be time expensive and unnecessary especially when our goal is to just fetch/read the data in sequence to store in a list. This list could then be converted into a numpy array for using it as inputs for the model.

In [None]:
#Loading Training Data

x_train=[]

#String Formatting

for a in range(50000):
    img= cv2.imread(f'/kaggle/input/stat940-winter-2022-dc1/train/train/{a}.jpg')
    x_train.append(img)

x_train= np.array(x_train)
print(x_train.shape)
        

The shape of the training data(x_train) is (50000,32,32,3) where 50000 represents the number of images, the two 32s represent the size of the image(32*32), and the 3 represents the RGB color channels. 

In [None]:
plt.figure()                                      # create new figure
fig_size = [20, 20]                               # specify figure size
plt.rcParams["figure.figsize"] = fig_size         # set figure size

#Plot first 10 train image of dataset
for i in range(0,10):                          
    ax = plt.subplot(10, 10, i+1)                  # Specify the i'th subplot of a 10*10 grid
    img = x_train[i,:,:,:]                        # Choose i'th image from train data
    ax.get_xaxis().set_visible(False)             # Disable plot axis.
    ax.get_yaxis().set_visible(False)
    plt.imshow(img)
    
plt.show()

#NOTE: This portion of the code for printing training images was taken from a tutorial from 
#the STAT 940 course at the University of Waterloo.

After reading the images in the right order and then storing those images in "x_train" variable, the images were in the right order with respect to the training "y_train" labels. Thus, the training data "x_train" was ready for the model.

In [None]:
#Loading Testing Data

x_test=[]

#String Formatting

for a in range(10000):
    img1= cv2.imread(f'/kaggle/input/stat940-winter-2022-dc1/test/test/{a}.jpg')
    x_test.append(img1)

x_test= np.array(x_test)
print(x_test.shape)

                    i.e The test data(x_test) is organized, read, and loaded.

The shape of the test data(x_test) is (10000,32,32,3) where 10000 represents the number of images, the two 32s represent the size of the image(32*32), and the 3 represents the RGB color channels. 

# Step 2: Setting Parameters

In [None]:
batch_size= 256
num_classes= 10
epochs= 50

i.e A batch size of 256, 50 epochs and 10 classes are defined for the model.

# Step 3: Preparing Data

In [None]:

x_train= x_train.astype('float32') #type casting 
x_test= x_test.astype('float32')

x_train /=255 # Normalization of data: Divide by 255
x_test /=255

i.e The data samples are type-casted and normalized.

In [None]:
y_train= np_utils.to_categorical(y_train, num_classes) #one-hot encoding: Organizing the labels

# Step 4: CNN Model Setup

In [None]:
model = Sequential()

#layer 1: 32 Kernels with 5*5 size and relu activation

model.add(Conv2D(128, (5,5), padding='same', activation= 'relu', input_shape=x_train.shape[1:]))

#layer 2: Max pooling with pool size 2*2

model.add(MaxPooling2D(pool_size=(2,2)))

#layer 3: Dropout Layer with a rate 0.25(randomly selected neurons are ignored)

model.add(Dropout(0.25))

#layer 4

model.add(Conv2D(128, (2,2), padding='same', activation= 'relu', input_shape=x_train.shape[1:]))

model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Dropout(0.25))

#Max Pooling Layer
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(150, activation ='relu'))

#model.add(Dropout(0.5))
model.add(Dense(num_classes, activation = 'softmax'))

model.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])

In [None]:
model.fit(x_train,y_train, batch_size= batch_size, epochs= epochs)


# Step 5: Prediction of Test Labels

In [None]:
y_testB= model.predict(x_test)
y_test= np.argmax(y_testB, axis=1)

The model predict the output labels based on the labels. This output is stored in y_testB but the output has to be converted from one-hot encoding to a form that is similar to the orginal data with classification ranging from 0 to 9. Thus, argmax is used to return the outputs back to it's original form.

# Step 6: Exporting Predicted Labels to CSV File

In [None]:
res = pd.DataFrame(y_test)
#res.index = x_test.index # its important for comparison
res.columns = ["label"]
res.to_csv("test_results5.csv")

A dataframe "res" is created and y_test is stored in it. The "res" dataframe is then converted to a .csv file.