# Prompting
In this notebook we continue with the GPT2 architecture we developed earlier, but this time we focus on strategies from generating samples from a trained model (i.e. prompting).

In [None]:
# All dependencies for the entire notebook
from matplotlib import pyplot as plt
from tqdm.auto import tqdm, trange
from tokenizers import (
    decoders,
    models,
    trainers,
    Tokenizer,
)

import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data.dataloader import DataLoader
from torch.utils.data import RandomSampler

DEVICE = torch.device('cuda')

## Data
We use the tiny shakespeare dataset because it is small and has interesting structure (i.e. lots of newlines, character names in all capital letters).

In [None]:
!wget -nc https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

Below we specify a TextDataset class. However, instead of using a character-level tokenizer (like before), we train a bytepair encoder from the huggingface tokenizer library. 

We use our trained tokenizer to tokenize the entire dataset.

In [None]:
class TextDataset:
    def __init__(self, tokens: list[int], tokenizer, context_size: int=256):
        self.tokens = tokens
        self.tokenizer = tokenizer
        self.vocab_size = len(tokenizer.get_vocab())
        self.context_size = context_size

    def __repr__(self):
        n_chars = len(self.data)
        vocab_size = self.vocab_size
        context_size = self.context_size
        return f'CharacterDataset({n_chars=}, {vocab_size=}, {context_size=})'

    @classmethod
    def from_textfile(cls, filename: str, tokenizer, context_size: int=256) -> 'TextDataset':
        """Load a textfile and automatically 'train' a character level tokenizer"""
        with open(filename, 'r') as fh:
            data = fh.read()
            tokens = tokenizer.encode(data).ids
            return cls(tokens, tokenizer, context_size=context_size)

    def train_test_split(self, train_percentage: float=0.9) -> tuple['TextDataset','TextDataset']:
        n_train_chars = int(train_percentage * len(self.tokens))

        train_tokens = self.tokens[:n_train_chars]
        train_dataset = TextDataset(train_tokens, self.tokenizer, self.context_size)

        test_tokens = self.tokens[n_train_chars:]
        test_dataset = TextDataset(test_tokens, self.tokenizer, self.context_size)

        return train_dataset, test_dataset

    def __len__(self) -> int:
        return len(self.tokens) - self.context_size

    def __getitem__(self, pos: int) -> torch.tensor:
        """Return tokens starting at pos up to pos + context_size, targets are shifted by one position"""
        # grab a chunk of context_size tokens from the data
        tokens = self.tokens[pos:pos + self.context_size + 1]
        # convert to tensor
        tokens = torch.tensor(tokens, dtype=torch.long)
        # targets are shifted one position from input
        return tokens[:-1], tokens[1:]

# Specify a BPE tokenizer using Huggingface's tokenizers
tokenizer = Tokenizer(models.BPE(byte_fallback=True))
tokenizer.decoder = decoders.ByteLevel()

# Train on shakespeare
trainer = trainers.BpeTrainer(vocab_size=512)
tokenizer.train(["input.txt"], trainer=trainer)

dataset = TextDataset.from_textfile('./input.txt', tokenizer=tokenizer)
train_dataset, test_dataset = dataset.train_test_split()

len(train_dataset), len(test_dataset)

## Model
Below is a full implementation of a GPT2-style transformer, except we use a fast cuda implementation to calculate the dot-product attention. 

The biggest difference compared to our earlier model is in the `generate` method of the `GPT` class, which now includes several options to configure the way we sample tokens:
- Greedy generation: at each sampling iteration picks the token with the highest probability
- Stochastic generation: at each iteration sample from a distribution with probabilities determined by our model (this is what we have done so far)
- Top-K sampling: sample tokens from a distribution, but only the k most likely tokens get a non-zero probability
- Temperature scaling: scale logits by a 'temperature' value, lower temperatures lead to sampling from a more 'spiked' probability distribution.

Executing the code block below generates a sample from an untrained model and prints the individual tokens to highlight that we are not working at the character level anymore.

In [None]:
class MLP(nn.Module):
    """Simple multi-layer perceptron with two linear layers and a relu non-linearity in between"""
    def __init__(self, embedding_dim: int, dropout: float):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Linear(in_features=embedding_dim, out_features=4*embedding_dim),
            nn.ReLU(),
            nn.Linear(in_features=4*embedding_dim, out_features=embedding_dim),
            nn.Dropout(dropout)
        )

    def forward(self, x: torch.tensor) -> torch.tensor:
        return self.mlp(x)

class MultiHeadDotProductAttention(nn.Module):
    """Multi Head Dot Product attention"""
    def __init__(self, embedding_dim: int, n_heads: int, dropout: float):
        super().__init__()
        if embedding_dim % n_heads != 0:
            raise Exception('n_heads must be dividable by n_embed')

        self.n_heads = n_heads

        # attention input projections
        self.w_q = nn.Linear(in_features=embedding_dim, out_features=embedding_dim)
        self.w_k = nn.Linear(in_features=embedding_dim, out_features=embedding_dim)
        self.w_v = nn.Linear(in_features=embedding_dim, out_features=embedding_dim)

        # output projection
        self.out_project = nn.Linear(in_features=embedding_dim, out_features=embedding_dim)

        #dropouts
        self.attention_dropout = nn.Dropout(dropout)
        self.projection_dropout = nn.Dropout(dropout)
        self.dropout = dropout

    def forward(self, x: torch.tensor) -> torch.tensor:
        # B, L, N
        # N = n_heads x head_size
        batch_dim, input_length, embedding_dim = x.size()

        # calculate input projections and divide over heads
        # 'view' and 'transpose' reorder in subtly different ways and we need both
        # (B, L, n_heads, head_dim) -> (B, n_heads, L, head_dim)
        q = self.w_q(x).view(batch_dim, input_length, self.n_heads, embedding_dim // self.n_heads).transpose(1,2)
        k = self.w_k(x).view(batch_dim, input_length, self.n_heads, embedding_dim // self.n_heads).transpose(1,2)
        v = self.w_v(x).view(batch_dim, input_length, self.n_heads, embedding_dim // self.n_heads).transpose(1,2)

        # use fast cuda kernel for attention
        pred = F.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)

        # reshape multiple heads back into contiguous representation
        pred = pred.transpose(1, 2).contiguous().view(batch_dim, input_length, embedding_dim)

        # return linear projection
        return self.projection_dropout(self.out_project(pred))

class TransformerBlock(nn.Module):
    """Transformer block that combines attention and MLP, both with pre-layernorm and residual connections"""
    def __init__(self, embedding_dim: int, n_heads:int, dropout:float):
        super().__init__()
        self.attention = nn.Sequential(
            MultiHeadDotProductAttention(
                embedding_dim=embedding_dim,
                n_heads=n_heads,
                dropout=dropout,
            ),
            nn.LayerNorm(embedding_dim)
        )
        self.mlp = nn.Sequential(
            MLP(embedding_dim=embedding_dim, dropout=dropout),
            nn.LayerNorm(embedding_dim)
        )

    def forward(self, x: torch.tensor) -> torch.tensor:
        """Calculate attention and communication between channels, both with residual connections"""
        # Communicate between positions (i.e. attention)
        attn = self.attention(x) + x
        # Communicate between embedding dimensions
        res = self.mlp(attn) + attn
        return res

class AdditivePositionalEmbedding(nn.Module):
    """Wrapper class to add positional encoding to already embedded tokens"""
    def __init__(self, context_size: int, embedding_dim: int):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings=context_size, embedding_dim=embedding_dim)

    def forward(self, x: torch.tensor) -> torch.tensor:
        """Add positional embeddings based on input dimensions, use residual connection"""
        pos = torch.arange(0, x.size(1), dtype=torch.long, device=x.device)
        return self.embedding(pos) + x

class GPT(nn.Module):
    def __init__(
        self,
        context_size: int=None,
        vocab_size: int=None,
        n_layers: int=6,
        n_heads: int=8,
        embedding_dim: int=384,
        dropout: float=0.1,
    ):
        super().__init__()
        self.context_size = context_size
        self.vocab_size = vocab_size

        self.transformer = nn.Sequential(
            nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim),
            AdditivePositionalEmbedding(context_size, embedding_dim),
            nn.Dropout(dropout),
            nn.Sequential(*[TransformerBlock(embedding_dim=embedding_dim, n_heads=n_heads, dropout=dropout) for _ in range(n_layers)]),
            nn.LayerNorm(embedding_dim),
            nn.Linear(in_features=embedding_dim, out_features=vocab_size)
        )

        # weight tying of input embedding and output projection (https://paperswithcode.com/method/weight-tying)
        self.transformer[0].weight = self.transformer[-1].weight

        # init all weights
        self.apply(self._init_weights)

    def forward(self, idx: torch.tensor, targets: torch.tensor=None) -> torch.tensor:
        logits = self.transformer(idx)
        loss = None if targets is None else F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        accuracy = None if targets is None else (logits.argmax(dim=-1) == targets).sum() / targets.numel()
        return logits,loss,accuracy

    def _init_weights(self, module: nn.Module) -> None:
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def generate(
        self,
        prompt:str='\n',
        sample_length: int=256,
        greedy: bool=True,
        top_k: int=None,
        temperature: float=1.0
    ) -> list[int]:
        """Generate sample tokens using a variety of strategies: greedy, stochastic, top-k. Temperature scales model logits"""
        device = next(self.parameters()).device
        prompt_tokens = dataset.tokenizer.encode(prompt).ids
        idx = torch.tensor(prompt_tokens, dtype=torch.long, device=device)

        for _ in trange(sample_length, desc='Generating sample'):
            logits,_,_ = self(idx[-self.context_size:][None])
            logits = logits[0,-1,:]

            if greedy:
                idx_next = logits.argmax()[None]
            else:
                if top_k:
                    logits[logits.argsort()[:-top_k]] = -torch.inf
                if temperature:
                    logits = logits / temperature
                probs = F.softmax(logits, dim=0)
                idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, idx_next])

        return idx.tolist()


model = GPT(context_size=dataset.context_size, vocab_size=dataset.vocab_size)
sample = model.generate('I am', sample_length=16)
print([dataset.tokenizer.decode([s]) for s in sample])

## Training
We specify a GPT transformer with a context size of 256 tokens, 8 `TransformerBlock` layers, 128 embedding dimensions divided over 16 attention heads. We train for 1000 steps using a batch size of 16, which should take ±3 minutes.

In [None]:
dataset = TextDataset.from_textfile('./input.txt', tokenizer=tokenizer, context_size=256)

model = GPT(
    context_size=dataset.context_size,
    vocab_size=dataset.vocab_size,
    embedding_dim=128,
    n_layers=8,
    n_heads=16,
    dropout=0.2
)
model.to(DEVICE)
model.train()

train_steps = 2_000
batch_size = 32

train_dataset, test_dataset = dataset.train_test_split()

train_dataloader = DataLoader(
    dataset=train_dataset,
    sampler=RandomSampler(train_dataset, num_samples=train_steps * batch_size),
    batch_size=batch_size,
)
test_dataloader = DataLoader(
    dataset=test_dataset,
    sampler=RandomSampler(test_dataset, replacement=True),
    batch_size=batch_size,
)
test_dataloader = iter(test_dataloader)

optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-3)
model.train()

train_losses = []
train_accuracies = []
test_losses = []
test_accuracies = []

for i, (train_x, train_y) in enumerate(tqdm(train_dataloader)):
    # forward the model
    _,train_loss,train_accuracy = model(train_x.to(DEVICE), train_y.to(DEVICE))

    # save losses on train and test every 20 iterations
    if i % 20 == 0:
        train_losses.append(train_loss.item())
        train_accuracies.append(train_accuracy.item())
        test_x, test_y = next(test_dataloader)
        _,test_loss,test_accuracy = model(test_x.to(DEVICE), test_y.to(DEVICE))
        test_losses.append(test_loss.item())
        test_accuracies.append(test_accuracy.item())

    # backprop and update the parameters
    model.zero_grad(set_to_none=True)
    train_loss.backward()

    # prevent gradients from exploding
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

fig,[ax1, ax2] = plt.subplots(ncols=2, figsize=(12,4))
ax1.plot(train_losses, label='train loss')
ax1.plot(test_losses, label='test loss')
ax1.legend()
ax2.plot(train_accuracies, label='train accuracy')
ax2.plot(test_accuracies, label='test accuracy')
ax2.legend()

### Exercise 1
Train a GPT on the tinyshakespeare dataset using the configuration provided in the notebook. Use the codeblock below to perform several prompting and sampling experiments.
- Generate a sample using greedy decoding, and a few using stochastic decoding (i.e. `greedy=False`). What difference do you find?
- Generate a sample using stochastic decoding, but specify `top_k=1`. What do you notice?
- Generate two samples using stochastic decoding, one with `temperature=0.01` and one with `temperature=10.0`. What do you notice?
- Experiment with specifying a different prompt, use stochastic decoding with `top_k=5` and `temperature=1.0`. Is there variation in the quality of the results when you specify different prompts?

In [None]:
model.eval()
sample = model.generate(prompt='Aluminium', greedy=False, top_k=5)
print(dataset.tokenizer.decode(sample))

## Answers
### Exercise 1
- Greedy sample is not very good, and always the same. Stochastic samples often (but not always) capture more of the structure found in the tinyshakespeare dataset (like double whitespaces and character names in all capital letters). Stochastic sample feels more 'real'
- `top_k=1` is the same as greedy decoding!
- Low temperature sample is very similar to greedy decoding, high temperature looks a lot like untrained model
- Our model is still not producing very readable text so it can be difficult to judge, but for example prompting with 'DUKE ' often leads to samples that contain the word 'lord' a lot. Prompting with 'Be gone ' is almost always followed by the words 'of' or 'to the'. Prompting 'KING RICHARD ' sometimes results in varieties of the name that are in the training set such as 'KING RICHARD II' and 'KING RICHARD III' but also in reasonable sounding varieties that are not the data such as 'KING RICHARD OF YORK'. Prompting 'Aluminium' never gets recognized as a word that should be followed by a space.