# CMSC 422 Final Project

## Author: Josue Melendez


### Data Preprocessing

We will be building an ANN from scratch to process numerical images that have been handdrawn and seeing if our ANN can identify them. We begin by importing the necessary packages, using os for extracting data, PIL to load, process, and convert the images, and numpy to ensure data processing and FLOPS are down efficiently.

In [73]:
import os
from PIL import Image, ImageOps
import numpy as np

After importing our images we define the desired sizes for our images, extract them, resize them, and append them to X (our image storing variable), and y (our label storing variable). After loading these images we reshape X to be suited for input into our ANN as the images themselves will be our input layer.

In [None]:
data_dir = '/home/jems/cmsc422/final/samples'  
image_size = (80, 60)

X = []  
y = [] 
# search every directory we unzipped
for sample_id in range(1, 63):  
    sample_dir = os.path.join(data_dir, f'Sample{sample_id:03d}')
    
    for image_file in os.listdir(sample_dir):
        if image_file.endswith('.png'):
            # load the image
            image_path = os.path.join(sample_dir, image_file)
            image = Image.open(image_path)
            # Resize the image and convert it to grayscale
            image = ImageOps.grayscale(image)
            image = image.resize((80, 60), Image.BILINEAR)
            # Append the grayscale image to X
            X.append(image)
            # Append the label to y (could be sample_id or another way to represent the label)
            y.append(sample_id)

X = np.array(X)
y = np.array(y)

# Reshape X to match the input layer dimensions for the ANN
X = X.reshape(-1, 4800) / 255.0 # 80 * 60 = 4800

We previously has y be 1-indexed, but as our labels will be indexes in an array, we readjust y to ensure there will be no indexing issues later on. We then add print statements as a sanity check to ensure we have proper dimensions in X, as it should be (3410, 80*60) = (3410, 4800) and in Y, that should be (3410, ).

In [None]:
print(X.shape)
y = y - 1
print(y)
print(y.shape)

(3410, 4800)
[ 0  0  0 ... 61 61 61]
(3410,)


### ANN Construction

Now that we have prepocessed our data, we will define an ANN class. This ANN class consists of 4 functions: init, sigmoid, forward, and backpropagate. 

Init is simply the initialization function for the ANN and defines the sizes for the input, hidden, and output layers. As we already know the sizes of these layers, they're defined as constants. We then use Xavier initialization to create the matrices for our hidden and output layers. These layers allow us to conduct matrix multiplication operations easily. We also define out biases for the hidden and output laters as arrays of 0's. These will quickly be changed as we conduct forward passes and backpropagations, so theyre initial values are of little consequence. 

Sigmoid is simply the activation function that was specified, and returns the answer for any given input.

Forward defines a forward pass within an ANN, saving each step as a variable. We define z1 as the output of the hidden layer when multiplied by the input layer and adding biases. A1 is the output after placing z1 through the activation function (the sigmoid function). We do the same for the output layer, defining z2 and A2 accordingly, and then returning A2, representing our predictions.

Backpropagation defines the path we take to propogate and adjust for our gradients. We first calculate the gradients for our hidden and output layers, then adjust the weights and biases for both accordingly. We then utilize the Mean Squared Error function to define our loss and return this loss to be used when gauging the progress of our ANN.

In [None]:
class ANN:
    def __init__(self):
        self.inputSize = 4800
        self.outputSize = 62
        self.hiddenSize = 100

        # Adjusted weight initialization to match original backprop dimensions
        self.hidden = np.random.randn(self.inputSize, self.hiddenSize) * np.sqrt(2 / (self.inputSize + self.hiddenSize))
        self.output = np.random.randn(self.hiddenSize, self.outputSize) * np.sqrt(2 / (self.hiddenSize + self.outputSize))

        self.hidden_biases = np.zeros((1, self.hiddenSize))
        self.output_biases = np.zeros((1, self.outputSize))
    
    def sigmoid(self, s):
        return 1 / (1 + np.exp(-s))


    def forward(self, X):
        # Forward pass
        # X shape: (batch_size, inputSize)
        self.z1 = np.dot(X, self.hidden) + self.hidden_biases  
        self.A1 = self.sigmoid(self.z1)  
        self.z2 = np.dot(self.A1, self.output) + self.output_biases 
        self.A2 = self.sigmoid(self.z2) 
        return self.A2  
    
    
    
    def backpropagation(self, X, Y, learning_rate):

        def sigmoid_derivative(s):
           return s * (1 - s)
         
        output_gradient = self.A2 - Y
        hidden_gradient = np.dot(output_gradient, self.output.T) * sigmoid_derivative(self.A1)
       

        # Update weights and biases
        self.output -= learning_rate * np.dot(self.A1.T, output_gradient)
        self.output_biases -= learning_rate * np.sum(output_gradient, axis=0, keepdims=True)
        self.hidden -= learning_rate * np.dot(X.T, hidden_gradient)
        self.hidden_biases -= learning_rate * np.sum(hidden_gradient, axis=0, keepdims=True)

        # return the MSE loss
        return (1/(2*Y.shape[0])) * np.sum((self.A2 - Y)**2)


    

### Sorting, Testing, and Training

We will now begin the process of sorting our data, training our ANN, and then testing our ANN. We begin by import a defaultdict and sorting our images by label. This gives a dictionary with the format {label: [img list]}. After this we will define 4 variables and one function. Our variables train_X, train_y, test_X, and test_Y store the images and associated labels for our test and train sets. Split_data is the function that we define to splice and fill these sets. Split data takes in a singular label and all of the images associated with this label. From here we create for variable to seperate the train and test, image and label sets for this particular label. We use np.random.permutation() to add the images before adding the first 5 to the test set and the rest to the train set. We also create the label arrays necessary. This results in something like: test_x [imgarr] * 5 and test_y [label] * 5.

Note: The default dict is nor necessary but it is more intuitive. 

We then proceed to loop through every label and gather its images, creating the overall test and train sets to be used in our ANN. We then convert them to numpy arrays for easier use and add sanity checks in the form of printing the shapes of the resulting sets to ensure everything has gone according to plan, and it has.

After this we define train_ann which will be our training function and it takes in our ANN model, a training image set (X_train), a training label set (Y_train), a specified number of epochs, a learning rate, and a batch size. Our train functions works through mini batching. We will create a loop that iterates as many times as there are epochs (so iterates 100 times if epochs = 100), and we will create mini batches during each epoch that will help train the model. For each epoch that we have, we first permute the training sets utilizing np.random.permutation() allowing us to add randomness to the testing. From there we track the number of correct predictions, total predictions, and overall loss. 

For every batch that we create in the epoch (so if we have 320 samples and a batch size of 32, then for 10 batches), we will seperate that set of images from the overall train set, take our predictions using ann.forward() and then call ann.backpropagation() to update our parameters. This process takes a few steps. The first is to create backpropagation labels that take on the dimensionality of our predictions to ensure that we are able to propagate through the network, and this is done np.zeros() and np.arange(). From there we will gather our predictions by calling ann.forward() and passing in our images. Once we have received these predictions, we will gather our loss by calling ann.backpropagate(). An important here is that the indexing of prediction steps and updating of hyperparameters is automatically done within our ANN, so we do not do it manually in our training function. After gathering this information, we will argmax our predictions to get the predicted class, check if it is correct, and once the epoch has gone through all of its batches, it will tally the accuracy and average loss and display it. Note: As epochs range in the 100's for this, we will only display accuracy and loss every 10 epochs. 

After this process has repeated for every epoch, we will return our trained ANN and it is now ready to test. We will now define our testing function which is much simpler. To test our model, we will call ann.forward() on our testing set, and will then use np.argmax() and np.sum() to see how many prediction we got correct, tallying them over the total amount of predictions, and printing/returning our accuracy.

In [191]:
from collections import defaultdict

# Sorting data by labels
sorted_data = defaultdict(list)
for i in range(len(X)):
    sorted_data[y[i]].append(X[i])

# Split data into training and testing sets
def split_data(data, label):
    test_X = []
    test_y = []
    train_X = []
    train_y = []
    n = np.random.permutation(len(data))
    data = np.array(data)[n]
    for i in range(55):
        if i < 5:
            test_X.append(data[i])
            test_y.append(label)
        else:
            train_X.append(data[i])
            train_y.append(label)
    return train_X, train_y, test_X, test_y   

train_X = []
train_y = []
test_X = []
test_y = []

for label, data in sorted_data.items():
    train, train_y_, test, test_y_ = split_data(data, label)
    train_X.extend(train)
    train_y.extend(train_y_)
    test_X.extend(test)
    test_y.extend(test_y_)

train_X = np.array(train_X)
train_y = np.array(train_y)
test_X = np.array(test_X)
test_y = np.array(test_y)
print('THE SHAPES ARE IMPORTANT TO CHECK')
print('TRAIN SHAPES:')
print(train_X.shape)
print(train_y.shape)
print('TEST SHAPES:')
print(test_X.shape)
print(test_y.shape)

# ANN training function
def train_ann(ann, X_train, y_train, epochs=300, learning_rate=0.02, batch_size=32):
    n = X_train.shape[0]  # Number of samples
    for epoch in range(epochs):
        indices = np.random.permutation(n)
        X_train = X_train[indices]
        y_train = y_train[indices]

        
        correct = 0
        total = 0
        cumulated_loss = 0

        #print(f'SHAPE OF X_TRAIN: {X_train.shape}')
        for i in range(0, n, batch_size):
            X_batch = np.array(X_train[i:i + batch_size])
            y_batch = np.array(y_train[i:i + batch_size])

            #print(f'SHAPE OF X_BATCH: {X_batch.shape}')
            backprop_labels = np.zeros((y_batch.shape[0], ann.outputSize))
            backprop_labels[np.arange(y_batch.shape[0]), y_batch] = 1

            predictions = ann.forward(X_batch)
            loss = ann.backpropagation(X_batch, backprop_labels, learning_rate)
            cumulated_loss += loss

            predicted_classes = np.argmax(predictions, axis=1)
            correct += np.sum(predicted_classes == y_batch)
            total += y_batch.shape[0]
        accuracy = correct / total * 100
        if epoch % 10 == 0:
            print(f'Statistics for Epoch: {epoch}')
            print(f'Current Accuracy: {accuracy:.2f}%, Current Loss: {cumulated_loss/n:.8f}')
        
    return ann

# ANN testing function
def test_ann(ann, X_test, y_test):
    predictions = ann.forward(X_test)
    predicted_classes = np.argmax(predictions, axis=1)
    accuracy = np.mean(predicted_classes == y_test) * 100
    
    print(f"Test Accuracy: {accuracy:.2f}%")
    # print(f"Predicted Classes: {predicted_classes}")
    return accuracy

ann = ANN()
trained_ann = train_ann(ann, train_X, train_y, epochs=200, learning_rate=0.002, batch_size=32)
test_ann(trained_ann, test_X, test_y)


THE SHAPES ARE IMPORTANT TO CHECK
TRAIN SHAPES:
(3100, 4800)
(3100,)
TEST SHAPES:
(310, 4800)
(310,)
Statistics for Epoch: 0
Current Accuracy: 1.61%, Current Loss: 0.01881615
Statistics for Epoch: 10
Current Accuracy: 3.03%, Current Loss: 0.01537276
Statistics for Epoch: 20
Current Accuracy: 10.55%, Current Loss: 0.01516434
Statistics for Epoch: 30
Current Accuracy: 17.68%, Current Loss: 0.01484184
Statistics for Epoch: 40
Current Accuracy: 23.68%, Current Loss: 0.01446535
Statistics for Epoch: 50
Current Accuracy: 28.77%, Current Loss: 0.01409560
Statistics for Epoch: 60
Current Accuracy: 32.29%, Current Loss: 0.01370253
Statistics for Epoch: 70
Current Accuracy: 34.23%, Current Loss: 0.01332133
Statistics for Epoch: 80
Current Accuracy: 37.55%, Current Loss: 0.01293515
Statistics for Epoch: 90
Current Accuracy: 39.10%, Current Loss: 0.01260692
Statistics for Epoch: 100
Current Accuracy: 42.06%, Current Loss: 0.01229882
Statistics for Epoch: 110
Current Accuracy: 44.52%, Current Loss:

38.70967741935484

### Assessment

We have discussed the implementation of our model, let us now discuss the results. With a learning rate of 0.002, and 200 epochs, we have gone from a 1.61% accuracy rating in epoch 0, to a 51.52% accuracy rating in epoch 190. This shows a drastic improvement and can largely be attributed to the hundreds of backpropagations we have done. After every forward call we saved the steps we used to predict our classes, and then calculated gradients every batch, and went against these gradients slightly. Rather than using the full magnitude of the gradients, we used a fraction of them (0.002) to avoid overfitting and overtime were able to find a nice place such that we have a testing accuracy of 51.52%. 

When testing, we had an accuracy of 38%. This is to be expected, as our model will have seen the examples in our training sets multiple times and so has had time to get used to them. As such the discrepancy in our accuracies can be attributed to a slight overtraining of the model due to the repetitiveness of the data. This is natural, as there was not much data and so there were not many new examples that the model could be given overtime. However, a 38.7% accuracy rating is still very good and goes to show our model has performed well and is within expectations. While it could use improvement through more epochs or different data to increase the variety of examples, it has performed well overall. 