### Dimensionality
We would want to keep the embeddings of both the images and the feature vector to be about the same dimension. 

Current Features: 

Image: $255 \times 255 \times 3$

AncorPoints: $204$


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms

import matplotlib.pyplot as plt
from torchvision import datasets, transforms
from torch.utils.data import random_split, DataLoader

### Variables

In [139]:
### DATA ### 
trainSplit = 0.7
testSplit = 0.15
# (valSplit = 1 - trainSplit - testSplit)
numImageChannels = 3 # 1 if grayscale or anchor points, 3 if colored images


### CNN ARCHITECTURE ###
# Define image input dimensions and kernel size
inputImageOneDim = 128
kernelSize = 3
# Example: If our imput image is 128x128 pixels: inoutImageOneDim = 128

# Dimensionality Reduction: Reduce dimensions drastically with Pooling Layers?
poolingLayers = True
poolingLayerKernelSize = 3

### TRAINING ###
initalLearningRate = 1e-3
batchSize = 64
numTrainEpochs = 30

### LOSS FUNCTION ###
# OBS! Select only one:

lossFunction = nn.NLLLoss()
# lossFunction = nn.CrossEntropyLoss() 
# lossFunction = nn.BCELoss()
# triplet_Loss = True TODO

### Setup

In [140]:
useDevice = 'cuda' if torch.cuda.is_available() else 'cpu'
device = torch.device(useDevice)
print("Using Device: " + useDevice)

transform = transforms.Compose([
    transforms.ToTensor(),      
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalize the images
])

dataset = datasets.ImageFolder(root="../Pre-processing/dataset/face_dataset/", transform=transform)

train_size = int(trainSplit * len(dataset))
test_size = int(testSplit * len(dataset))
val_size = len(dataset) - train_size - test_size
train_dataset, test_dataset, val_dataset = random_split(dataset, [train_size, test_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=4)

# calculate steps per epoch for training and validation set
trainSteps = len(train_loader.dataset) // batchSize
valSteps = len(val_loader.dataset) // batchSize


num_classes = len(dataset.classes)

Using Device: cuda


### Model Structure

Trying to read up on our current experiment, I find it hard to find many sources or tips on how to construct a common architecture for two different inputs without changing the architecture between them. But let's begin with looking at the simple and most common image recognition architecture:


From what I read, when doing facial recognition with images, we usually want multiple different convolutional layers, such as
Conv Layer 1: To recognize edges
Conv Layer 2: Recognize Corners and parts of the face
Conv Layer 3: Recognize More concrete face features.

And in between these have a pooling lyare to reduce dimensionality
Thus the often proposed image recognition CNN is 3 Conv Layers, 3 pooling layers and then 1 or 2 fully connected layers.

Experiments that are the closest to what we are doing that I found are multimodal models, let's take a look at how we might tackle this if we wanted to do a multi-modal model:

What we could do is perform these steps on the images to reduce their dimensionality to a shared amount of nodes as the feature vectors and then have them share the final layers, such that we end up with something like the following architecture.

$$
\begin{vmatrix}
Image: 3x(conv -> pool)   \\
Anchor: Linear -> Linear 
\end{vmatrix}  -> Linear -> Linear/Output
$$


Thus we would share 2 layers in common with each other and it would be more of a multi modal model. Training two different models with these respective architectures would be to dissimilar however. We would end up with two completely different architectures:

$$3(conv -> pool) -> 2Linear$$ 
and
$$4 Linear$$

So our options are to either "give up" on many of the layers in the classical image architecture, but if we abondon the general architecture which seems to be how image FR works best, are we really making a fair comparason? Or are we just unintentionally reducing the performance of our image recognition just to make a "fair comparasson"? 

My conclusion is thus that we have three reasonable ways to go forward. 
1. The easiest course of action would be to try to keep the architecture we know works for traditional Image Recognition, but try to represent our feature vector as more of an image and thus try to use the same/as similar as possible of an architecure for our Images as our Feature Vectors.
2. Ignore the architecture problem and just try to create two models with as best performance/accuracy as possible but on the two different representations of the data. In this case, using a self-fitting CNN architecture as proposed in "The Loss Surfaces of Multilayer Networks" or similar could be a good idea.
3. Just use different known CNN models and train them both on the data.

Bellow I'll focus on 1.

In [141]:
### Given our number of features, how many dims do we need to represent our features as an image?

numberOfFeatures = 204
counter = 1
while True:
    calculation = counter*counter
    print(str(counter) + ": " + str(calculation))
    if(calculation > numberOfFeatures):
        break
    counter = counter + 1

1: 1
2: 4
3: 9
4: 16
5: 25
6: 36
7: 49
8: 64
9: 81
10: 100
11: 121
12: 144
13: 169
14: 196
15: 225


In [142]:
from torch.nn import Module
from torch.nn import Conv2d
from torch.nn import Linear
from torch.nn import MaxPool2d
from torch.nn import ReLU
from torch.nn import LogSoftmax
from torch import flatten

# Class taken and modified from https://pyimagesearch.com/2021/07/19/pytorch-training-your-first-convolutional-neural-network-cnn/
class CNN(Module):
    def __init__(self, numChannels, classes, inputImageOneDim):
        super(CNN, self).__init__()
        
        # initialize first set of CONV => RELU => POOL layers
        self.conv1 = nn.Conv2d(in_channels=numChannels, out_channels=32, kernel_size=kernelSize)
        inputImageOneDim = inputImageOneDim-kernelSize+1
        self.relu1 = nn.ReLU()

        if poolingLayers:
            self.maxpool1 = nn.MaxPool2d(kernel_size=3)
            inputImageOneDim = int(inputImageOneDim/3)
        
        # initialize second set of CONV => RELU => POOL layers
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=kernelSize)
        inputImageOneDim = inputImageOneDim-kernelSize+1
        self.relu2 = nn.ReLU()

        if poolingLayers:
            self.maxpool2 = nn.MaxPool2d(kernel_size=3)
            inputImageOneDim = int(inputImageOneDim/3)
        
        # initialize third set of CONV => RELU => POOL layers
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=kernelSize)
        inputImageOneDim = inputImageOneDim-kernelSize+1
        self.relu3 = nn.ReLU()

        if poolingLayers:
            self.maxpool3 = nn.MaxPool2d(kernel_size=3)
            inputImageOneDim = int(inputImageOneDim/3)

        finalDim = inputImageOneDim*inputImageOneDim
        self.fc1 = Linear(in_features=123 * finalDim, out_features=finalDim)
        self.relu4 = ReLU()

        #Softmax
        self.fc2 = Linear(in_features=finalDim, out_features=classes)
        self.logSoftmax = LogSoftmax(dim=1)



    def forward(self, x):
        # Apply convolutional layers and pooling
        if poolingLayers:
            x = self.maxpool1(self.relu1(self.conv1(x)))
            x = self.maxpool2(self.relu2(self.conv2(x)))
            x = self.maxpool3(self.relu3(self.conv3(x)))
        else:
            x = self.relu1(self.conv1(x))
            x = self.relu2(self.conv2(x))
            x = self.relu3(self.conv3(x))
            
        # Flatten the output for fully connected layers
        x = x.view(x.size(0), -1)  # Flatten to [batch_size, finalDim]
        
        # Fully connected layers
        x = self.relu4(self.fc1(x))
        x = self.fc2(x)
        
        # Apply LogSoftmax for final output
        output = self.logSoftmax(x)
        return output


    

Initializing the model

In [143]:
from torch.optim import Adam
import time
import argparse

print("[INFO] initializing the CNN model...")
model = CNN(
	numChannels=numImageChannels,
	classes=train_size,
    inputImageOneDim=inputImageOneDim).to(device)
# initialize our optimizer and loss function
opt = Adam(model.parameters(), lr=initalLearningRate)

# initialize a dictionary to store training history
H = {
	"train_loss": [],
	"train_acc": [],
	"val_loss": [],
	"val_acc": []
}



[INFO] initializing the CNN model...


Train Model Over Epochs

In [144]:
# measure how long training is going to take
print("[INFO] training the network...")
startTime = time.time()

for epoch in range(0, numTrainEpochs):
	# set the model in training mode
	model.train()
	# initialize the total training and validation loss
	totalTrainLoss = 0
	totalValLoss = 0
	# initialize the number of correct predictions in the training and validation step
	trainCorrect = 0
	valCorrect = 0
	# loop over the training set
	for (x, y) in train_loader:
		# send the input to the device
		(x, y) = (x.to(device), y.to(device))
		# perform a forward pass and calculate the training loss
		pred = model(x)
		loss = lossFunction(pred, y)
		# zero out the gradients, perform the backpropagation step, and update the weights
		opt.zero_grad()
		loss.backward()
		opt.step()
		# add the loss to the total training loss so far and calculate the number of correct predictions
		totalTrainLoss += loss
		trainCorrect += (pred.argmax(1) == y).type(
			torch.float).sum().item()

[INFO] training the network...


RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x8192 and 1107x9)

Model Evaluation

In [None]:
# switch off autograd for evaluation
with torch.no_grad():
    # set the model in evaluation mode
    model.eval()
    # loop over the validation set
    for (x, y) in val_loader:
        # send the input to the device
        (x, y) = (x.to(device), y.to(device))
        # make the predictions and calculate the validation loss
        pred = model(x)
        totalValLoss += lossFunction(pred, y)
        # calculate the number of correct predictions
        valCorrect += (pred.argmax(1) == y).type(
            torch.float).sum().item()

NotImplementedError: Module [CNN] is missing the required "forward" function

Statistics FROM HERE BELLOW NOT TESTED

(At the moment just copied from the example code and variable changed to relevant for us)

In [None]:
# calculate the average training and validation loss
avgTrainLoss = totalTrainLoss / trainSteps
avgValLoss = totalValLoss / valSteps
# calculate the training and validation accuracy
trainCorrect = trainCorrect / len(train_loader.dataset)
valCorrect = valCorrect / len(val_loader.dataset)
# update our training history
H["train_loss"].append(avgTrainLoss.cpu().detach().numpy())
H["train_acc"].append(trainCorrect)
H["val_loss"].append(avgValLoss.cpu().detach().numpy())
H["val_acc"].append(valCorrect)
# print the model training and validation information
print("[INFO] EPOCH: {}/{}".format(e + 1, numTrainEpochs))
print("Train loss: {:.6f}, Train accuracy: {:.4f}".format(
    avgTrainLoss, trainCorrect))
print("Val loss: {:.6f}, Val accuracy: {:.4f}\n".format(
    avgValLoss, valCorrect))

ZeroDivisionError: division by zero

In [None]:
from sklearn.metrics import classification_report
import numpy as np

# finish measuring how long training took
endTime = time.time()
print("[INFO] total time taken to train the model: {:.2f}s".format(
	endTime - startTime))
# we can now evaluate the network on the test set
print("[INFO] evaluating network...")
# turn off autograd for testing evaluation
with torch.no_grad():
	# set the model in evaluation mode
	model.eval()
	
	# initialize a list to store our predictions
	preds = []
	# loop over the test set
	for (x, y) in test_loader:
		# send the input to the device
		x = x.to(device)
		# make the predictions and add them to the list
		pred = model(x)
		preds.extend(pred.argmax(axis=1).cpu().numpy())
# generate a classification report
print(classification_report(test_loader.targets.cpu().numpy(),
	np.array(preds), target_names=test_loader.classes))

In [None]:
# plot the training loss and accuracy
plt.style.use("ggplot")
plt.figure()
plt.plot(H["train_loss"], label="train_loss")
plt.plot(H["val_loss"], label="val_loss")
plt.plot(H["train_acc"], label="train_acc")
plt.plot(H["val_acc"], label="val_acc")
plt.title("Training Loss and Accuracy on Dataset")
plt.xlabel("Epoch #")
plt.ylabel("Loss/Accuracy")
plt.legend(loc="lower left")
# plt.savefig(args["plot"])
# serialize the model to disk
# torch.save(model, args["model"])