## Qing Xiong -- MLE Intern Task

### Question 1: Suppose that we design a deep architecture to represent a sequence by stacking self-attention layers with positional encoding. What could be issues? (paragraph format)

#### Answer:

A deep network that represents a squence by stacking self-attention layers with positional encoding could have multiple issues. First of all, the attention layers can be computationally costly as the sequence length and number of layers increase, and it also requires a lot of memory for computing attention. Such deep architecture might cause overfitting when training dataset is small. Besides, when the model gets deeper and deeper, a fixed positional encoding might be insufficient for certain tasks and the model itself also gets harder to interpret.

### Question2: Can you design a learnable positional encoding method? (Create dummy dataset)

In [1]:
import torch
from torch import nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import random
from torch.nn.utils.rnn import pad_sequence


In [2]:
## dataset class
class DummyDataset(Dataset):
    def __init__(self, max_seq_len, num_samples):
        self.max_seq_len = max_seq_len
        self.num_samples = num_samples
        # create dummy data with randomized vector & length
        self.data = [torch.randint(0, 100, (random.randint(1, max_seq_len),)) for _ in range(num_samples)]

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        sequence = self.data[idx]
        positions = torch.arange(len(sequence))
        return sequence, positions

In [8]:
# Hyperparameters for the dataset
batch_size = 32
num_samples = 1000
max_seq_len = 20

def collate_fn(batch):
    sequences, positions = zip(*batch)
    sequences_pad = pad_sequence(sequences, batch_first=True, padding_value=0)
    positions_pad = pad_sequence(positions, batch_first=True, padding_value=-1)
    return sequences_pad, positions_pad

# Create the dataset and dataloader
dataset = DummyDataset(max_seq_len, num_samples)
dataloader = DataLoader(dataset, batch_size=batch_size, collate_fn=collate_fn)

In [4]:
# reference: https://arxiv.org/pdf/1705.03122.pdf 3.1 positional embeddings
# e = (w1 + p1, . . . , wm + pm)
class LearnablePositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(LearnablePositionalEncoding, self).__init__()

        # Embedding layers for token and positional information
        self.token_embedding = nn.Embedding(100, d_model)  # Assuming tokens are integers from 0 to 99
        self.pos_embedding = nn.Embedding(max_len, d_model)

    def forward(self, x):
        # x: [batch_size, seq_len]

        # Get token embeddings
        x = self.token_embedding(x)

        # Generate position indices
        pos_indices = torch.arange(x.size(1)).unsqueeze(0).expand(x.size(0), -1).to(x.device)

        # Return the sum of token and positional embeddings
        return x + self.pos_embedding(pos_indices)

class PositionPredictor(nn.Module):
    def __init__(self, d_model, max_len):
        super().__init__()
        self.pos_enc = LearnablePositionalEncoding(d_model, max_len)
        self.linear = nn.Linear(d_model, 1)

    def forward(self, x):
        x = self.pos_enc(x)

        return self.linear(x).squeeze(-1)


In [9]:
# Instantiate model and optimizer
model = PositionPredictor(d_model=16, max_len=20)
optimizer = optim.Adam(model.parameters())

# Define loss function
criterion = nn.MSELoss()

# Start training (for 25 epochs)
for epoch in range(25): 
    for sequences, positions in dataloader:

        # print(sequences.size())
        outputs = model(sequences)

        # Mask to ignore padded positions
        mask = positions != -1
        outputs = outputs[mask]
        positions = positions[mask]

        loss = criterion(outputs, positions.float())

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

Epoch 1, Loss: 58.27162551879883
Epoch 2, Loss: 55.42763137817383
Epoch 3, Loss: 52.194889068603516
Epoch 4, Loss: 48.48335647583008
Epoch 5, Loss: 44.3350944519043
Epoch 6, Loss: 39.89034652709961
Epoch 7, Loss: 35.34430694580078
Epoch 8, Loss: 30.909711837768555
Epoch 9, Loss: 26.77927589416504
Epoch 10, Loss: 23.08452606201172
Epoch 11, Loss: 19.875179290771484
Epoch 12, Loss: 17.128015518188477
Epoch 13, Loss: 14.781341552734375
Epoch 14, Loss: 12.767078399658203
Epoch 15, Loss: 11.02596378326416
Epoch 16, Loss: 9.511139869689941
Epoch 17, Loss: 8.186941146850586
Epoch 18, Loss: 7.026272296905518
Epoch 19, Loss: 6.008105754852295
Epoch 20, Loss: 5.115661144256592
Epoch 21, Loss: 4.335185527801514
Epoch 22, Loss: 3.6551496982574463
Epoch 23, Loss: 3.065659523010254
Epoch 24, Loss: 2.557969093322754
Epoch 25, Loss: 2.1240713596343994
