# 3. Hybrid Transformers (maybe) on MNIST

### About this notebook

This is a bonus notebook that was used in the 50.039 Deep Learning course at the Singapore University of Technology and Design.

**Author:** Matthieu DE MARI (matthieu_demari@sutd.edu.sg)

**Version:** 1.1 (29/08/2023)

**Requirements:**
- Python 3 (tested on v3.11.4)
- Matplotlib (tested on v3.7.2)
- Numpy (tested on v1.25.2)
- Torch (tested on v2.0.1+cu118)
- Torchvision (tested on v0.15.2+cu118)
- We also strongly recommend setting up CUDA on your machine! (At this point, honestly, it is almost mandatory).

### Imports and CUDA

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
import torchvision.datasets as datasets
CUDA = torch.cuda.is_available()
device = torch.device("cuda" if CUDA else "cpu")

### Load MNIST

At this point, do I really need to explain what this does?

In [2]:
# Load the MNIST dataset and prepare dataloaders as usual
mnist_train = datasets.MNIST(root='.', train = True, download = True,
                             transform = transforms.ToTensor())
mnist_test = datasets.MNIST(root='.', train = False, download = True,
                            transform = transforms.ToTensor())
train_loader = torch.utils.data.DataLoader(mnist_train, batch_size = 32, shuffle = True)
test_loader = torch.utils.data.DataLoader(mnist_test, batch_size = 32, shuffle = False)

### Define self-attention layer, and Transformer model

We will have to flatten the images to process them with Linear operations and attention operations.

In [3]:
# Define a self-attention layer implementation
class SelfAttentionLayer(nn.Module):
    def __init__(self, in_features):
        super(SelfAttentionLayer, self).__init__()
        self.in_features = in_features
        self.query = nn.Linear(in_features, in_features)
        self.key = nn.Linear(in_features, in_features)
        self.value = nn.Linear(in_features, in_features)

    def forward(self, x):
        batch_size = x.size(0)
        query = self.query(x).view(batch_size, -1, self.in_features)
        key = self.key(x).view(batch_size, -1, self.in_features)
        value = self.value(x).view(batch_size, -1, self.in_features)
        attention_weights = F.softmax(torch.bmm(query, key.transpose(1, 2))/(self.in_features**0.5), dim = 2)
        out = torch.bmm(attention_weights, value).view(batch_size, -1)
        return out

In [4]:
# Neural network definition using self-attention
class Transformer(nn.Module):
    def __init__(self):
        super(Transformer, self).__init__()
        self.fc1 = nn.Linear(28*28, 128)
        self.attention1 = SelfAttentionLayer(128)
        self.fc2 = nn.Linear(128, 64)
        self.attention2 = SelfAttentionLayer(64)
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        x = x.view(-1, 28*28)
        x = F.relu(self.fc1(x))
        x = self.attention1(x)
        x = F.relu(self.fc2(x))
        x = self.attention2(x)
        x = self.fc3(x)
        return x

### Try out our model

Create model and see its structure

In [5]:
# Create model
model = Transformer()
print(model)

Transformer(
  (fc1): Linear(in_features=784, out_features=128, bias=True)
  (attention1): SelfAttentionLayer(
    (query): Linear(in_features=128, out_features=128, bias=True)
    (key): Linear(in_features=128, out_features=128, bias=True)
    (value): Linear(in_features=128, out_features=128, bias=True)
  )
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (attention2): SelfAttentionLayer(
    (query): Linear(in_features=64, out_features=64, bias=True)
    (key): Linear(in_features=64, out_features=64, bias=True)
    (value): Linear(in_features=64, out_features=64, bias=True)
  )
  (fc3): Linear(in_features=64, out_features=10, bias=True)
)


### Simple trainer like before

Again, very similar to what we have done in Week 4...

In [6]:
# Create model
model = Transformer()
# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr = 0.001)

# Train the model
n_epochs = 5
for epoch in range(n_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Flatten image
        images = images.reshape(-1, 28*28)
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        # Backprop
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # Display
        if (i + 1) % 100 == 0:
            print("Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}".format(epoch + 1, \
                                                                     n_epochs, \
                                                                     i + 1, \
                                                                     len(train_loader), \
                                                                     loss.item()))

Epoch [1/5], Step [100/1875], Loss: 0.5645
Epoch [1/5], Step [200/1875], Loss: 0.6769
Epoch [1/5], Step [300/1875], Loss: 0.1388
Epoch [1/5], Step [400/1875], Loss: 0.5357
Epoch [1/5], Step [500/1875], Loss: 0.2873
Epoch [1/5], Step [600/1875], Loss: 0.2454
Epoch [1/5], Step [700/1875], Loss: 0.0574
Epoch [1/5], Step [800/1875], Loss: 0.0829
Epoch [1/5], Step [900/1875], Loss: 0.2409
Epoch [1/5], Step [1000/1875], Loss: 0.1930
Epoch [1/5], Step [1100/1875], Loss: 0.1293
Epoch [1/5], Step [1200/1875], Loss: 0.3335
Epoch [1/5], Step [1300/1875], Loss: 0.1581
Epoch [1/5], Step [1400/1875], Loss: 0.1583
Epoch [1/5], Step [1500/1875], Loss: 0.2497
Epoch [1/5], Step [1600/1875], Loss: 0.1110
Epoch [1/5], Step [1700/1875], Loss: 0.3590
Epoch [1/5], Step [1800/1875], Loss: 0.1453
Epoch [2/5], Step [100/1875], Loss: 0.0219
Epoch [2/5], Step [200/1875], Loss: 0.1266
Epoch [2/5], Step [300/1875], Loss: 0.1361
Epoch [2/5], Step [400/1875], Loss: 0.0613
Epoch [2/5], Step [500/1875], Loss: 0.1208
Ep

### Evaluate model

We get a 97% test accuracy, after only 5 iterations of training!

In [7]:
# Test the model
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        # Flatten images
        images = images.reshape(-1, 28 * 28)
        # Forward pass and accuracy calculation
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
    # Final display
    print("Test Accuracy: {} %".format(100*correct/total))

Test Accuracy: 97.73 %


### Quick question

Could we obtain a better performance could be obtained by combining Convolutional operations and Attention ones?

Would the layer below do the trick?

In [8]:
# Define a convolutional attention layer implementation
class ConvAttentionLayer(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=3):
        super(ConvAttentionLayer, self).__init__()
        self.query_conv = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, padding=kernel_size // 2)
        self.key_conv = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, padding=kernel_size // 2)
        self.value_conv = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, padding=kernel_size // 2)

    def forward(self, x):
        query = self.query_conv(x)
        key = self.key_conv(x)
        value = self.value_conv(x)
        batch_size, channels, height, width = query.size()
        query = query.view(batch_size, channels, -1)
        key = key.view(batch_size, channels, -1)
        value = value.view(batch_size, channels, -1)
        attention_weights = F.softmax(torch.bmm(query.transpose(1, 2), key), dim=2)
        out = torch.bmm(value, attention_weights).view(batch_size, channels, height, width)
        return out

Could we then use it to assemble a Convolutional Transformer?

In [9]:
# Neural network definition using convolutional attention
class ConvTransformer(nn.Module):
    def __init__(self):
        super(ConvTransformer, self).__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size = 3, padding = 1)
        self.attention1 = ConvAttentionLayer(16, 16)
        self.conv2 = nn.Conv2d(16, 32, kernel_size = 3, padding = 1)
        self.attention2 = ConvAttentionLayer(32, 32)
        self.fc = nn.Linear(32*28*28, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.attention1(x)
        x = F.relu(self.conv2(x))
        x = self.attention2(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

# Create model
conv_model = ConvTransformer()
print(conv_model)

ConvTransformer(
  (conv1): Conv2d(1, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (attention1): ConvAttentionLayer(
    (query_conv): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (key_conv): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (value_conv): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  )
  (conv2): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (attention2): ConvAttentionLayer(
    (query_conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (key_conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (value_conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  )
  (fc): Linear(in_features=25088, out_features=10, bias=True)
)


**Open question:** Would that train and obtain better performance than the "Linear" transformer we trained earlier?