# Course 3 : Language Modeling

The slides of the course are available [here](https://github.com/NathanGodey/AdvancedNLP/raw/main/slides/pdf/course3_lm.pdf)

## Part 1: Homemade Transformers

In this section, we will reproduce the forward pass of a Transformers from scratch. **Don't forget to enable the GPU.**

In [None]:
import torch
import torch.nn as nn

### Question 1
Given a Query, Key or Value tensor of shape `batch_size x sequence_length x hidden_dim`, design a function (in PyTorch) that adds a head dimension for `num_heads` heads.

In [None]:
def split_into_heads(input_tensor, num_heads):
  ...

In [None]:
test_Q = torch.randn(8, 128, 64)
assert split_into_heads(test_Q, 8).shape == (8, 128, 8, 8)

### Question 2
Given a split Query, Key or Value tensor of shape `batch_size x sequence_length x num_heads x head_hidden_dim`, design the function (in PyTorch) that removes the head dimension.

In [None]:
def concat_heads(input_tensor):
  ...

In [None]:
concat_heads(split_into_heads(test_Q, 8)) - test_Q

### Question 3
Given a Query, Key and Value tensors of shape `batch_size x sequence_length x num_heads x head_hidden_dim`, design the function that performs the self-attention product. Test it with random inputs.

In [None]:
def head_level_self_attention(Q, K, V):
  ...

### Question 4
Rewrite the function from Question 3 allowing the use of a causal mask:

In [None]:
def head_level_self_attention(Q, K, V, causal=True):
  ...

### Question 5
Create a `CustomTransformer` class (`nn.Module` child) using the previous functions. The forward pass will go as:
1. Use a first LayerNorm
2. Compute (Q, K, V) with a single linear projection `hidden_dim -> 3*hidden_dim` (with bias)
3. Compute self-attention
4. Do a linear projection keeping dimension (with bias)
5. Add original input to current result
6. Use a second LayerNorm
7. Do a linear projection (with bias) to some `intermediate_dim`
8. Apply a given activation function (argument of the class)
9. Do a linear projection (with bias) back to `hidden_dim``
10. Add output of step 5 to current result


In [None]:
class CustomTransformer(nn.Module):
  ...

### Question 6

Create a `CustomInputEmbedding` class (`nn.Module` child) that generates input embeddings from batched input tokens ids. It will provide one token embedding for each input token and add an absolute positional embedding.

In [None]:
class CustomInputEmbedding(nn.Module):
  ...

### Question 7

The GPT-2 model family was designed the following way:
- Embed input tokens adding an absolute positional embedding
- Pass through $N$ Transformer layers
- Apply a final LayerNorm
- Use a Linear LM head to make a prediction

Using all previous classes, create a `CustomGpt2` module. Test it on random inputs.

In [None]:
class CustomGpt2(nn.Module):
  ...

## Part 2: Weight conversion

In this section, we import the weights of the original GPT-2 (small version) and we convert them into our custom format.

### Question 7
Download the `gpt2` model from HuggingFace as an `AutoModelForCausalLM`. Print it and find out its hyper-parameters. Instantiate a similar `CustomGpt2` model.

### Question 8
Create a function that converts a `Conv1D` layer into a `nn.Linear` layer. Check if the Conv1D and its Linear counterpart give the same results on random inputs, and if they run as fast.

In [None]:
def conv2linear(conv_layer):
  ...

### Question 9
Create a `convert_weights` function that sets all equivalent parameters in your `CustomGpt2` model to the values of their HuggingFace counterpart. Make a real-life prediction to check that their outputs are similar.

In [None]:
def convert_weights(original_gpt2, custom_gpt2):
  ...

## Part 3: Generation

Let's now use our model in generation mode.

### Question 10

Write a `greedy_generate` function that uses your custom GPT2 and performs greedy generation. Try it on a short sentence (don't forget a stopping condition).

In [None]:
def greedy_generate(model, sentence, ...):
  ...

In [None]:
sentence = "..."
tokens = greedy_generate(model, sentence, ...)
print(tokenizer.decode(tokens))

### Question 11

Write a `topk_generate` function that uses your custom GPT2 and performs top-k generation (sampling in top-k tokens).

In [None]:
def topk_generate(model, sentence, k, ...):
  ...

### Question 12

Write a `nucleus_generate` function that uses your custom GPT2 and performs top-p generation (sampling in tokens until cumulated probability is greater than p).

In [None]:
def nucleus_generate(model, sentence, p, ...):
  ...

### Question 13

Write a `beam_generate` function that uses your custom GPT2 and performs beam-search generation.

In [None]:
def beam_generate(model, sentence, num_beams, ...):
  ...

### Question 14

Using the `%timeit` magic operation, measure and compare the throughput of each generation method.

## Part 4: KV caching

To make our model faster, we implement KV caching in this section.

### Question 15

Re-implement the `head_level_self_attention` function so it can include a KV cache. Careful: Q, K and V should now correspond only to inputs that are not in the cache. This function should return the attention output and the updated cache.

In [None]:
def head_level_self_attention(Q, K, V, causal=True, cached_kv=None):
  ...

### Question 16

Implement the `CustomTransformerWithCache` inheriting from `CustomTransformer`, with a forward function that takes `cached_kv` as an argument, and returns the updated KV cache.

In [None]:
class CustomTransformerWithCache(CustomTransformer):
  ...

### Question 17

Create the `CustomGpt2WithCache` class using the `CustomTransformerWithCache` block. Instantiate a `CustomGpt2WithCache` object with the weights of the original GPT-2. The forward will return the KV caches of each Transformer layer in a tuple.

In [None]:
class CustomGpt2WithCache(nn.Module):
  ...

### Question 18

Test the KV cache behaviour by simulating two steps of greedy generation with the cache system:
- Forward a whole sequence and keep the KV cache (step 1). Add the next predicted token to the sequence.
- Feed the new sequence **and the KV cache** to your GPT-2.

Compare the resulting prediction with and without cache, and check that they are similar.

### Question 19

Implement the greedy generation function with KV caching. Compare it to the vanilla greedy generation without cache with `%timeit`.

In [None]:
def greedy_generate_with_cache(model, sentence, ...):
  ...

## Bonus : Streaming LLM

The paper [Efficient Streaming Language Models with Attention Sinks
](https://arxiv.org/pdf/2309.17453.pdf) proposes a KV caching method that allows model to generate beyond their context window with minimal performance loss. Implement their approach in your KV cache system and measure the resulting performance gaps.