# `016` Transformers

Requirements: 014 Attention and dropout, 015 Residual connections

☢️☢️ WIP ☢️☢️

Attention mechanisms were proposed as a way to make the RNN inputs contain information about other terms in the sequence. This way, every sequence element can be contextualized properly with the information from other tokens. However, an architecture called transformer proposed by [Vaswani et al., 2017](ttps://arxiv.org/pdf/2002.04745v1.pdf) took the world by surprise.

Basically, he removed the RNN layers and used just a bunch of linear layers, attention mechanisms, residual connections and normalization. Applied in the context of German to English translation, the architecture achieved better quality (BLEU score) than any previous model. Furthermore, applying it over general text corpuses created a level of generalization pretty impressive and scalable with model size.

In this notebook I will define a transformer block, build a model with many of them, and train it over the Spanish novels corpus also used for the Embeddings notebook.

In [7]:
from matplotlib import pyplot as plt
from re import sub
from time import time
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [8]:
tokens = ' !(),-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz¡¿ÁÉÍÑÓÚÜáéíñóúü'
t2i = {c: i for i, c in enumerate(tokens)}
i2t = {i: c for i, c in enumerate(tokens)}

with open('custom-data/spanish-novels.txt', encoding='utf-8') as fp:
	data = fp.read()

data = sub(r'\s*\n\s*', ' ', data)
data = [t2i[c] for c in data if c in tokens]
data = torch.tensor(data, dtype=torch.long).to(device)
print(f'dataset size: {data.shape[0]} (that\'s like {100 * data.shape[0] / 20000000000000:.6f}% of GPT-4\'s training data)')

dataset size: 19263904 (that's like 0.000096% of GPT-4's training data)


In [15]:
def get_batch(size=32, context_size=16):
	starts = torch.randint(0, len(data) - context_size, (size,))
	x = torch.stack([data[s:s+context_size] for s in starts])
	y = torch.stack([data[s+context_size] for s in starts])
	return x, y

x, y = get_batch()
print(x.shape, y.shape)

torch.Size([32, 16]) torch.Size([32])


In [28]:
class TransformerBlock(torch.nn.Module):
	def __init__(self, channels=64, num_heads=4, dropout=0.1):
		super().__init__()
		self.attention_heads = torch.nn.MultiheadAttention(channels, num_heads, dropout)
		self.norm1 = torch.nn.RMSNorm(channels)
		self.ff = torch.nn.Linear(channels, channels)
		self.norm2 = torch.nn.RMSNorm(channels)

	def forward(self, x):
		x = x + self.attention_heads(x, x, x)[0]
		x = self.norm1(x)
		x = x + self.ff(x)
		x = torch.relu(x)
		x = self.norm2(x)
		return x

In [30]:
class LanguageModel(torch.nn.Module):
	def __init__(self, vocab_size, context_size=16, hidden_size=64, num_blocks=8, num_heads=4, dropout=0.1):
		super().__init__()
		self.tok_emb = torch.nn.Embedding(vocab_size, hidden_size)
		self.pos_emb = torch.nn.Embedding(context_size, hidden_size)
		self.blocks = torch.nn.Sequential(*[
			TransformerBlock(hidden_size, num_heads, dropout)
			for _ in range(num_blocks)
		])
		self.out = torch.nn.Linear(hidden_size, vocab_size)

	def forward(self, x):
		x = self.tok_emb(x) + self.pos_emb(torch.arange(x.size(-1)))
		x = self.blocks(x)
		return self.out(x).softmax(-1)

model = LanguageModel(len(tokens))
print(f'Model has {sum(p.numel() for p in model.parameters())} parameters')

Model has 179929 parameters


In [41]:
def train(model, epochs=10000, batch_size=32, context_size=16, lr=1e-3):
	model.train()
	optimizer = torch.optim.Adam(model.parameters(), lr=lr)
	losses = []
	start = time()
	for epoch in range(epochs):
		x_batch, y_batch = get_batch(size=batch_size, context_size=context_size)
		logits = model(x_batch)[:, -1, :]
		loss = torch.nn.functional.cross_entropy(logits, y_batch)
		optimizer.zero_grad()
		loss.backward()
		optimizer.step()
		losses.append(loss.item())
		if epoch % 2000 == 0 and epoch > 0 or epoch == 50:
			remaining = (time() - start) * (epochs - epoch) / (epoch + 1)
			print(f'epoch {epoch:4d}, loss: {loss.item():6f}, remaining {remaining:.0f}s')
	return losses

In [42]:
print('Gross training')
losses = train(model)
print('Fine training')
losses.extend(train(model, batch_size=64, lr=1e-4))

Gross training
epoch   50, loss: 4.348711, remaining 202s
epoch 2000, loss: 4.226552, remaining 162s


KeyboardInterrupt: 