In [2]:
#Imports used throughout the notebook
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt

import os
import pandas as pd
import numpy as np

from torch.utils.data import Dataset, DataLoader
from torchvision.io import read_image

import seaborn as sns
from sklearn.model_selection import train_test_split

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

#Moving computations to the gpu if available
device = (
    torch.device("cuda")
    if torch.cuda.is_available()
    else torch.device("mps")
    if torch.backends.mps.is_available()
    else torch.device("cpu")
)
print(f"Using {device} device")

Using cuda device


# Problem 1:
### a: NLP
Please discuss the recent trend of rapidly increasing sizes of NLP architectures. 

### Answer:
NLP architectures, especially in combination with transformers, have been found to scale exceptionally well with size. To my knowledge, this trend that larger models perform better is basically ubiqutous across all topics in NLP. The reason we see these models explode due to the sheer amount of available AND accesible data, combined with the massive advances in GPUs and TPUs in recent years. Thus, it has been seen that models with more parameters can learn more complex patterns and are better at generalizing from training data.

While these improvements are generally considered to a step towards technological advancements, some people are concerned with the ethical and ecological dilemmas of such models.
Firstly, training these humongous models are very costly with some articles claiming the training of GPT-4 emitted upwards of 15 metric tons of CO2, and others claiming estimates of 43,200kg CO2 emitted daily while running chatGPT - more than the average emissions from 30 people taking a transatlantic flight.

Secondly, some people are scared that such intelligent models might eventually turn sentient if the trends continue. How does a model determine what is right and wrong, could this be abused? What if these models fall into the hands of the wrong people? These are questions shared by many individuals in todays world, and questions even people like Sam Altman, CEO of OpenAI, have talked about in podcasts and even hearings. 

I think there are a lot of opinions on this topic and its a very interesting ongoing debate. For this assignment though, I hope this was "enough" of a discussion, without deep diving into the topic completely.

### b: Transfer Learning
Classify the following example of transfer learning. More exactly, what are the domains and tasks, and which are being changed?

Source: Using a step counter to monitor exercise in healthy people.

Target: Using a step counter to indicate recovery progression in a patient

### Answer:
The domain D is a combination of the feature space and the marginal distribution. <br>
In this case, the feature space 
$$\chi_{s} = \text{steps counted in healthy people}$$ 
$$\chi_{t} = \text{steps counted in patients}$$ 
and the marginal distribution
$$P(\chi_{s}) = \mathbb{N}_0$$
$$P(\chi_{t}) = \mathbb{N}_0$$
While their distributions might be of the same form, its important to realise that recovering patients on avg. probably wont be having as many steps. This does not mean we can't use the information from the source domain to learn something about the target domain.<br>

The task T is then a combination of possible labels (y), and an unknown conditional probability function.<br>
Here, the possible labels arent explicit, but in this context, and with available information, I went with {exercising, not-exercising}. Other possible sets of labels could be:{fat, fit}, {rich, poor}, {old, young}, {male, female} etc. Its just about what information is available on the people providing step information, and I think this part of the exercise could have been worded better.<br>
$$y_{s} = \text{{exercising, not-exercising}}$$
$$y_{t} = \text{{recovered, not-recovered}}$$


Lastly, for the conditional probability function
$$P(y|\chi_{s}) = \text{predicting exercise levels in people}$$
$$P(y|\chi_{t}) = \text{predicting recovery progression in people}$$

Thus, following https://commons.wikimedia.org/w/index.php?curid=58812416 by By Emilie Morvant - Own work, CC BY-SA 4.0, we can see that we have the same source and target marginal distributions on $\chi$, but the tasks are different between the source and target domains.<br>

Meaning, we land in Inductive Transfer Learning.



### c: Attention

Assume sdotproduct attention, and that the hidden states of the encoder layer are [0,1,4],[-1,1,2],[1,1,1],[2,1,1]. If the activation for the previous decoder is [0.1,1,-2], what is the attention-context vector?

### Answer:
sdot prodcut attention is given as<br>
$$ a_{ij} = s^T_{i-1}h_j$$

We have the hidden states (h) of the encoder given<br>
$h_1 = [0, 1, 4]$<br>
$h_2 = [-1, 1, 2]$<br>
$h_3 = [1, 1, 1]$<br>
$h_4 = [2, 1, 1]$<br>
And the previous decoder activation $s_{i-1}$<br>
$s_{i-1} = [0.1, 1, -2]$<br>
Thus, we can calculate the sdot product $a_{ij}<br>
$ a_{i1} = [0.1, 1, -2]^T \sdot [0, 1, 4] = -7$<br>
$ a_{i2} = [0.1, 1, -2]^T \sdot [-1, 1, 2] = -3.1$<br>
$ a_{i3} = [0.1, 1, -2]^T \sdot [1, 1, 1] = -0.9$<br>
$ a_{i4} = [0.1, 1, -2]^T \sdot [2, 1, 1] = -0.8$<br>

Using these, we can calculate teh attention weights $\alpha_{ij}$.<br>
$\alpha_{ij}$ is given as<br>
$$\alpha_{ij} = \frac{e^{a_{ij}}}{\sum_{k}^{}e^{a_{ik}}}$$
since we have calculated our $a_{ij}$'s, we start of with calculating $\sum_{k}^{}e^{a_{ik}}$<br>
$\sum_{k}^{}e^{a_{ik}}=e^{-7}+e^{-3.1}+e^{-0.9}+e^{-0.8}=0.90185970821$<br>
Thus, we get 
$$\alpha_{i1} = \frac{e^{-7}}{e^{0.902}} = 0.0003700547$$
$$\alpha_{i2} = \frac{e^{-3.1}}{e^{0.902}} = 0.01828160879$$
$$\alpha_{i3} = \frac{e^{-0.9}}{e^{0.902}} = 0.16499176618$$
$$\alpha_{i4} = \frac{e^{-0.8}}{e^{0.902}} = 0.18234410171$$

Now we can calculate the context vector $c_i$ given as<br>
$$c_i=\sum_{j}^{} \alpha_{ij}h_j$$
As
$$c_i=0.0003700547 \sdot [0, 1, 4] + 0.01828160879 \sdot [-1, 1, 2] + 0.16499176618 \sdot [1, 1, 1] + 0.18234410171 \sdot [2, 1, 1] = [0.511398, 0.365988, 0.385379]$$
A bit prettier:
$$c_i=[0.511398, 0.365988, 0.385379]$$


### d: Transformers

Explain the 'positional encoding' step for transformers. Why is it done, how is it done?

### Answer:
Unlike RNNs or LSTMs, which inherently process data sequentially, transformers process input data in parallel. Thus, transformers need to incorporate the information given by the sequence order. This is why positional encoding is used. An example of two sentences that would be identical without positional encoding in transformers could be "I did very well in Deep Learning" and "very well in Deep Learning I did".

One way to do it, as in the original paper, a positional encoding tensor were added to the inputs, with the same dimension, such that they can be summed:
$$ \Chi \rightarrow \Chi + PE$$ 
Where PE is defined as<br>
$$PE(pos,2i) = \sin(\frac{pos}{10000 \sdot \frac{2i}{d_{model}}})$$
$$PE(pos,2i+1) = \cos(\frac{pos}{10000 \sdot \frac{2i}{d_{model}}})$$
Where pos is the position and i is the dimension.<br>
This is a "relative encoding" approach, as the values of the encoding does not scale with input length.<br>
The discussion on how to approach encoding is not yet over, so while this implementation showcased additive encoding, it might not be the best approach.


### e: Bounding box detection:
Given a dataset with two classes; cats and dogs, and the following detections:

TP = True positive
FP = False positive

cat_det = [TP, FP, TP, FP, TP]
pred_scores_cat = [0.7, 0.3, 0.5, 0.6, 0.55]

dog_det = [FP, TP, TP, FP, TP, TP]
pred_scores_dog = [0.4, 0.35, 0.95, 0.5, 0.6, 0.7]

There are in total 3 cats and 4 dogs in the images.

Calculate the mean average precision (mAP)

### Answer:
mAP is defined as
$$mAP = \frac{1}{N} \sum_{i=1}^{N}AP_i$$
Where AP is the average precision for each class.




### f: Semantic segmentation - FCN 1:
Given an image sized 1024x768x3 (width x height x channels), with 7 classes. What is the size of the target image if targets are one-hot encoded?



### g: Semantic segmentation - FCN 2: 
What is a fully-convolutional network? When is it useful?



### h:Residual Networks:
Explain residual layers and their advantage. 



### i: Intersection-Over-Union
Calculate the intersection over union in for these four bounding-boxes and target bounding boxes:

![](bounding-boxes.PNG)



### j: Variational autoencoders:
What are the strengths of a variational autoencoder (VAE) compared to an autoencoder (AE)?

# Problem 2:
Below is attached a script for generating a data set to learn translation of dates between multiple human readable and a machine readable format (ISO 8601). 

Task: Using an encoder-decoder setup, perform translation from human readable to machine readable formats. Please express the performance of your trained network in terms of average accuracy of the translated output (so, accuracy on a per-character basis). 

Restriction: we specifically demand that your presented solution does not include an attention layer. 

 Despite this restriction, the task can be solved in numerous different ways. Here are some examples of solutions of similar problems, for inspiration: 

https://jscriptcoder.github.io/date-translator/Machine%20Translation%20with%20Attention%20model.html

https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

Script for generating data set below:

In [3]:
#Copied from dateTrans_student1.py file from brightspace
"""
Created on Mon Oct 18 17:47:38 2021

@author: au207178
"""

#https://www.kaggle.com/eswarchandt/neural-machine-translation-with-attention-dates

from faker import Faker
import random
from tqdm import tqdm
from babel.dates import format_date

from faker import Faker
fake = Faker()

Faker.seed(101)
random.seed(101)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")



#%% pytorch dataset

class datesDataset(torch.utils.data.Dataset):
    def __init__(self,locale='da',inputLength=40,outputLength=12, dataSetSize=100):
        
        self.inputLength=inputLength
        self.outputLength=outputLength
        self.length=dataSetSize
        self.lan=locale
        
        self.FORMATS= ['short', # d/M/YY
           'medium', # MMM d, YYY
           'long', # MMMM dd, YYY
           'full', # EEEE, MMM dd, YYY
           'd MMM YYY', 
           'd MMMM YYY',
           'dd/MM/YYY',
           'EE d, MMM YYY',
           'EEEE d, MMMM YYY']
        

        #generate vocabularies:
        alphabet=sorted(tuple('abcdefghijklmnopqrstuvwxyzæøå'))
        numbers=sorted(tuple('0123456789'))
        symbols=['<SOS>','<EOS>',' ',',','.','/','-','<unk>', '<pad>'];
        self.humanVocab=dict(zip(symbols+numbers+alphabet,
                            list(range(len(symbols)+len(numbers)+len(alphabet)))))
        self.machineVocab =dict(zip(symbols+numbers,list(range(len(symbols)+len(numbers)))))
        self.invMachineVocab= {v: k for k, v in self.machineVocab.items()}

    def string_to_int(self,string, length, vocab):
        string = string.lower()
        

        if not len(string)+2<=length: #+2 to make room for SOS and EOS
            print(len(string),string)
            print('Length:',length)
            
            raise AssertionError()

        
        rep = list(map(lambda x: vocab.get(x, '<unk>'),string))
        rep.insert(0,vocab['<SOS>']); rep.append(vocab['<EOS>']) #add start and of sequence
        
        if len(string) < length:
            rep += [vocab['<pad>']] * (length - len(rep))
        
        return rep        
        
    def __len__(self):
        return self.length
        
    def __getitem__(self, idx):
        dt = fake.date_object()

        date = format_date(dt, format=random.choice(self.FORMATS), locale=self.lan)
        human_readable = date.lower().replace(',', '')
        machine_readable = dt.isoformat()
        
        humanEncoded=torch.LongTensor(self.string_to_int(human_readable,self.inputLength,self.humanVocab))
        machineEncoded=torch.LongTensor(self.string_to_int(machine_readable,self.outputLength,self.machineVocab))
        

        
        return human_readable, machine_readable, humanEncoded,machineEncoded

e=datesDataset()
human_readable, machine_readable, humanEncoded,machineEncoded=e[0]

In [161]:
#Doing some basic testing to get a feel for the dataset

#e[x][0] are the human readable dates, e[x][1] are the machine readable dates. [x][2] encodes [x][0] and [x][3] encodes [x][1]
e[0]
#e.machineVocab
#e.humanVocab

('20 nov. 1989',
 '1989-11-20',
 tensor([ 0, 11,  9,  2, 32, 33, 40,  4,  2, 10, 18, 17, 18,  1,  8,  8,  8,  8,
          8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,
          8,  8,  8,  8]),
 tensor([ 0, 10, 18, 17, 18,  6, 10, 10,  6, 11,  9,  1]))

In [345]:
#define network
#I spent a couple of hours trying with an LSTM implementation, but in the end I was having too many issues with the architecture, so, i ended up with a GRU implementation instead.
#The good thing about using GRU as the architecture, is its faster convergence compared to LSTM, and it's also less prone to overfitting.
#I used the tutorial from https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html as a guideline, both when trying with the LSTM and when implementing the GRU.

#https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size, dropout_p=0.1):
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, input):
        embedded = self.dropout(self.embedding(input))
        output, hidden = self.gru(embedded)
        return output, hidden
        
#https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html        
class Decoder(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)

    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None):
        batch_size = encoder_outputs.size(0)
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long, device=device).fill_(0) #SOS token is 0 in the vocabs
        decoder_hidden = encoder_hidden
        decoder_outputs = []

        for i in range(40): #Max length of input sequence is 40
            decoder_output, decoder_hidden  = self.forward_step(decoder_input, decoder_hidden)
            decoder_outputs.append(decoder_output)

            if target_tensor is not None:
                # Teacher forcing: Feed the target as the next input
                decoder_input = target_tensor[:, i].unsqueeze(1) # Teacher forcing
            else:
                # Without teacher forcing: use its own predictions as the next input
                _, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze(-1).detach()  # detach from history as input

        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        return decoder_outputs, decoder_hidden, None # We return `None` for consistency in the training loop

    def forward_step(self, input, hidden):
        output = self.embedding(input)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.out(output)
        return output, hidden
            
        

class Net(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Net, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.4):        
        # src: source sequence, trg: target sequence
         #Teacher forcing ratio is the probability of using teacher forcing, I tried different values, eg. 0.05, 0.1, 0.2, 0.5, 0.9, 1.0
         #but 0.4 seemed fine, while also not making the model rely on it too much.
        batch_size = src.size(0)
        max_len = trg.size(1)  #12 (len of longest output sequence)
        trg_vocab_size = self.decoder.embedding.num_embeddings #19 (len of machine vocab)
        
        
        # tensor to store decoder outputs
        outputs = torch.zeros(batch_size, max_len, trg_vocab_size).to(self.device)
        
        # encode the source sentence
        encoder_output, hidden = self.encoder(src) #As we're not doing attention, the encoder output is actually not used
        
        # first input to the decoder is the <sos> tokens
        input = torch.tensor([0]*batch_size).to(self.device)  # assuming <sos> token is 0
        input = input.unsqueeze(1)  # add batch dimension
        

        for t in range(1, max_len):
            output, hidden = self.decoder.forward_step(input, hidden) #output is the logits
            outputs[:, t] = output.squeeze(1) #Save the logits in the outputs tensor
            teacher_force = random.random() < teacher_forcing_ratio
            
            top1 = output.squeeze(1).max(1)[1] #Get the index of the highest logit
            input = (trg[:, t] if teacher_force else top1).unsqueeze(1) #If teacher forcing, use the actual target, else use the highest logit in the next iteration
        
        return outputs
        



In [346]:
#data processing
X, y, X_enc, y_enc = zip(*[e[i] for i in range(e.length)])

#Making the encoded data into tensors
X_enc = torch.stack(X_enc)
y_enc = torch.stack(y_enc)

In [347]:
#Splitting the data into train and test sets
#Defining the inputs and targets, creating the dataset
inputs = torch.tensor(X_enc, dtype=torch.long).to(device)
targets = torch.tensor(y_enc, dtype=torch.long).to(device)
dataset = torch.utils.data.TensorDataset(inputs, targets)
indices = list(range(len(dataset)))

#Batch size
batch_size = 16 #Tested with bs 4, 16, 24, 32, 64. 16 seemed to be the best, but 24 and 32 were also fine.

#Splitting the dataset into training, validation and test sets
train_indices, temp_indices = train_test_split(indices, test_size=0.2, random_state=42) #80% training, 20% to be split below
valid_indices, test_inddices = train_test_split(temp_indices, test_size=0.5, random_state=42) #10% validation, 10% test


#Create subsets of the data using the split indices
train_data = torch.utils.data.Subset(dataset, train_indices)
valid_data = torch.utils.data.Subset(dataset, valid_indices)
test_data = torch.utils.data.Subset(dataset, test_inddices)

# Create the dataloaders
train_dl = DataLoader(train_data, batch_size=batch_size, shuffle=True, num_workers=0)
valid_dl = DataLoader(valid_data, batch_size=batch_size, shuffle=True, num_workers=0)
test_dl = DataLoader(test_data, batch_size=batch_size, shuffle=True, num_workers=0)

  inputs = torch.tensor(X_enc, dtype=torch.long).to(device)
  targets = torch.tensor(y_enc, dtype=torch.long).to(device)


In [348]:
encoder = Encoder(input_size=48, hidden_size=512).to(device) #tested hidden size with 64, 128, 256. 128 performed best in test: 75 vs 80% accuracy
decoder = Decoder(hidden_size=512, output_size=19).to(device)
net = Net(encoder, decoder, device).to(device)
#Optimizer and loss function
optimizer = optim.Adam(net.parameters(), lr=0.001) #tested with lr 0.005, 0.002, 0.001, 0.0005, 0.0001. - 0.001 performed best
loss = nn.CrossEntropyLoss()
#Training
nEpochs = 2001
best_loss = 1000
best_acc = 0
best_valid_loss = 1000
best_valid_acc = 0

mod = 50 #Modulus for printing the best results

for iEpoch in range(nEpochs):
    net.train()
    totLoss=0
    cor_pred = 0
    tot_pred = 0
    cor_pred_val = 0
    tot_pred_val = 0
    for xbatch,ybatch in train_dl:
        xbatch=xbatch.to(device=device)
        ybatch=ybatch.to(device=device)

        y_pred = net(src=xbatch, trg=ybatch)  
        y_pred = y_pred.permute(0, 2, 1) ## permute to match CrossEntropyLoss input of (Batchsize, num_classes, seq_len)
        
        #Next section is to calculate the performance of average accuracy of the translated output
        #Convert the logits to predictions
        softmax = nn.Softmax(dim=1)
        predictions = torch.argmax(softmax(y_pred), dim=1)
        
        # Mask to filter out padding tokens in the targets
        mask = ybatch != 8
        # Use the mask to filter out padding in both predictions and targets
        predictions_masked = torch.masked_select(predictions, mask)
        targets_masked = torch.masked_select(ybatch, mask)

        # Correct predictions
        cor_pred += (predictions_masked == targets_masked).sum().item()
        
        # Total predictions in batch
        tot_pred += targets_masked.size(0)

        #Back to Calculating loss like normal
        loss_val = loss(y_pred, ybatch.long())
        totLoss+=loss_val
        
        # Zero the gradients before running the backward pass.
        net.zero_grad()
    
        loss_val.backward()

        #gradient clipping to deal with exploding gradients.
        torch.nn.utils.clip_grad_norm_(net.parameters(),10, norm_type=2.0)
        optimizer.step()
    accuracy_train = cor_pred / tot_pred
    if (accuracy_train > best_acc) & (iEpoch % mod == 0):
        best_acc = accuracy_train
        print('Train acc', iEpoch,":", best_acc)
        
    if (totLoss < best_loss) & (iEpoch % mod == 0):
        best_loss = totLoss
        print('Train loss',iEpoch,":", totLoss)
    
    net.eval()  # Set the model to evaluation mode
    valid_loss_sum = 0.0
    correct_predictions = 0
    #For this i just reused the code from training, some naming might be weird
    for inputs, labels in valid_dl:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = net(inputs, labels)
        outputs = outputs.permute(0, 2, 1)
        batch_loss = loss(outputs, labels.long())
        valid_loss_sum += batch_loss.item()
        
        #Convert the logits to predictions
        softmax = nn.Softmax(dim=1)
        predictions = torch.argmax(softmax(y_pred), dim=1)
        
        # Mask for non-padding tokens in the targets
        mask = ybatch != 8  # Create a mask for non-padding tokens
        # Use the mask to filter out padding in both predictions and targets
        predictions_masked = torch.masked_select(predictions, mask)
        targets_masked = torch.masked_select(ybatch, mask)

        # Correct predictions
        cor_pred_val += (predictions_masked == targets_masked).sum().item()
        
        # Total predictions in batch
        tot_pred_val += targets_masked.size(0)
        
    # Calculate accuracy
    valid_acc = cor_pred_val / tot_pred_val

    valid_loss = valid_loss_sum / len(valid_dl)

    if (valid_loss < best_valid_loss) & (iEpoch % mod == 0):
        best_valid_loss = valid_loss
        netImage = net.state_dict()
        print('Val loss', iEpoch,":", best_valid_loss)
        bestPred = outputs  # Be cautious about overwriting bestPred every epoch
    if (valid_acc > best_valid_acc) & (iEpoch % mod == 0):
        best_valid_acc = valid_acc
        netImage = net.state_dict()
        print('Val acc', iEpoch,":", best_valid_acc)
        bestPred = outputs  # Be cautious about overwriting bestPred every epoch

        

Train acc 0 : 0.11875
Train loss 0 : tensor(14.7818, device='cuda:0', grad_fn=<AddBackward0>)
Val loss 0 : 2.901568651199341
Val acc 0 : 0.14583333333333334
Train acc 50 : 0.43645833333333334
Train loss 50 : tensor(10.1085, device='cuda:0', grad_fn=<AddBackward0>)
Val loss 50 : 1.9745641946792603
Val acc 50 : 0.4479166666666667
Train acc 100 : 0.4479166666666667
Train loss 100 : tensor(9.0248, device='cuda:0', grad_fn=<AddBackward0>)
Val loss 100 : 1.756047010421753
Train acc 150 : 0.4875
Train loss 150 : tensor(8.5627, device='cuda:0', grad_fn=<AddBackward0>)
Val loss 150 : 1.7199493646621704
Val acc 150 : 0.53125
Train acc 200 : 0.559375
Train loss 200 : tensor(7.9359, device='cuda:0', grad_fn=<AddBackward0>)
Val loss 200 : 1.4749699831008911
Train acc 250 : 0.6114583333333333
Train loss 250 : tensor(7.1089, device='cuda:0', grad_fn=<AddBackward0>)
Val acc 250 : 0.625
Train acc 350 : 0.6864583333333333
Train loss 350 : tensor(5.9997, device='cuda:0', grad_fn=<AddBackward0>)
Val loss 

In [349]:
#Some testing
best_test_loss = 1000    
net.eval()  # Set the model to evaluation mode
test_loss_sum = 0.0
best_test_acc = 0
cor_pred_test = 0
tot_pred_test = 0
for inputs, labels in test_dl:
    inputs, labels = inputs.to(device), labels.to(device)
    outputs = net(inputs, labels)
    outputs = outputs.permute(0, 2, 1)
    batch_loss = loss(outputs, labels.long())
    test_loss_sum += batch_loss.item()
    
    #Convert the logits to predictions
    softmax = nn.Softmax(dim=1)
    predictions = torch.argmax(softmax(outputs), dim=1)
    
    # Mask for non-padding tokens in the targets
    mask = labels != 8  # Create a mask for non-padding tokens
    # Use the mask to filter out padding in both predictions and targets
    predictions_masked = torch.masked_select(predictions, mask)
    targets_masked = torch.masked_select(labels, mask)

    # Correct predictions
    cor_pred_test += (predictions_masked == targets_masked).sum().item()
    
    # Total predictions in batch
    tot_pred_test += targets_masked.size(0)
        
    # Calculate accuracy
test_acc = cor_pred_test / tot_pred_test

test_loss = test_loss_sum / len(test_dl)

if test_loss < best_test_loss:
    best_loss = test_loss
    netImage = net.state_dict()
    print('updated best loss')
    bestPred = outputs  # Be cautious about overwriting bestPred every epoch

print('Test loss:', test_loss)
print('Test acc:', test_acc)

updated best loss
Test loss: 1.410487413406372
Test acc: 0.775


## Training and validation
So, we converge towards 1 in validation accuracy pretty fast, especially if we tune the hyperparameters a bit towards more hidden layers. The validation loss seems to converge around 1.1 and the training loss around 1.2x. With validation and training accuracy reaching 1.0 before 150 epochs.

## Test results
So a test accuracy of 80% was around the best i achieved (the highest i saw was 80,3). This was tested with different hyperparameters. It can be seen in the comments which values i tested with, but i tested on hidden_size, lr in the optimizer, and different batch sizes.
If I'd been able to implement the logic using LSTM, I suspect i would've gotten better accuracy, however, time was running from me. Furthermore, with the use of attention we also would expect significant improvement to the accuracy.
Maybe it could also have been interesting to look at the Date-level Accuracy, instead of only the Character accuracy.
Overall, I think we would like higher accuracy, as the task, on an intuition level, does not seem that hard. I am not entirely sure why I was only able to achieve 80% accuracy. Maybe I'm overfitting, however, when I was running fewer epochs, the results were worse overall.
Finally, I tried running 2000 epochs a few times, without any improvements. Also running with learning rates of down to 0.000001 without improvements.