# Natural Language Processing

### 2.3 RNN and Transformers

In this tutorial, we will cover:

- Reccurent Neural Network
- Self-Attention
- Transformers

Prerequisites:

- Python
- numpy

<br>Prof. Iacopo Masi and Prof. Stefano Faralli

TA: Robert Adrian Minut

# Let us download a zip file with some codes and data

In [None]:
# unzip colab zip file with Tiny Shakespeare dataset and model checkpoints
! wget -c https://github.com/iacopomasi/NLP/raw/main/course/AA2324/2_05_transformers_bert/colab.zip
! unzip colab.zip

--2024-09-07 16:02:22--  https://github.com/iacopomasi/NLP/raw/main/course/AA2324/2_05_transformers_bert/colab.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/iacopomasi/NLP/main/course/AA2324/2_05_transformers_bert/colab.zip [following]
--2024-09-07 16:02:23--  https://raw.githubusercontent.com/iacopomasi/NLP/main/course/AA2324/2_05_transformers_bert/colab.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3835010 (3.7M) [application/zip]
Saving to: ‘colab.zip’


2024-09-07 16:02:23 (46.0 MB/s) - ‘colab.zip’ saved [3835010/3835010]

Archive:  colab.zip
   creating: ckpts/
  inflating: ckpt

# The data!

In [None]:
! head -n 100 data/input.txt

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [None]:
import numpy as np

# data I/O
data = open('data/input.txt', 'r').read() # should be simple plain text file
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('data has %d characters, %d unique.' % (data_size, vocab_size))
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

data has 1115393 characters, 65 unique.


# Neural Language Modeling

## 1. Recurrent Neural Networks (RNNs)

**Introduction to Recurrent Neural Networks (RNNs)**

Welcome to the exciting world of Recurrent Neural Networks (RNNs)!

RNNs are a class of neural networks that are powerful for modeling sequence data such as time series or natural language.

**What are RNNs?**

RNNs are a type of artificial neural network designed to recognize patterns in sequences of data, such as text, genomes, handwriting, or numerical time series data emanating from sensors, stock markets and government agencies.

**How do RNNs work?**

Unlike traditional neural networks, RNNs have a "memory" which captures information about what has been calculated so far. In essence, they have loops to allow information to persist. This loop structure allows them to take in not just an individual data point, but also a sequence of data points.

**Applications of RNNs**

RNNs are used in a variety of applications where sequential data is involved. This includes:

- Language modeling and generating text
- Speech recognition
- Machine translation
- Image captioning
- Time series prediction

**Advantages of RNNs**

- **Ability to process sequences of variable length:** Unlike feedforward neural networks, RNNs can process inputs of any length.
- **Modeling of temporal dynamics:** RNNs can form a deeper understanding of a sequence and its context compared to other algorithms.

**Challenges with RNNs**

- **Difficulty in training (vanishing gradients):** RNNs are notoriously difficult to train because of the vanishing gradient problem, which is when gradients shrink as they backpropagate through time.
- **Limited memory:** Practical RNNs can only look back a few steps in the past.

In the following sections, we will dive deeper into the architecture of RNNs, explore how they are trained, and implement them step by step.

Code credits: [Andrej Karpathy](https://github.com/karpathy).

In [None]:
# hyperparameters
hidden_size = 100 # size of hidden layer of neurons
seq_length = 25 # number of steps to unroll the RNN for
learning_rate = 1e-1

# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

### 1.1 **EXERCISE 1 💻** RNN forward pass and Loss Function

The loss function is a critical component in training Recurrent Neural Networks (RNNs). It measures the difference between the predicted output of the RNN and the actual data it’s trying to model or predict. In the context of language modeling, where RNNs predict the next character in a sequence, the loss function measures the difference between the predicted output and the actual data, quantifying the RNN’s performance.

Implement the `lossFun` function, which computes the entire forward pass of the RNN and calculates the loss during the RNN’s forward pass:

- The function takes a sequence of input characters, target characters, and the initial hidden state of the RNN.
- It then performs a forward pass through the network, calculating the loss at each step using a softmax cross-entropy formula.
- The cumulative loss across all steps gives an overall measure of the RNN’s performance.

😱 Note the backward pass is already implement and can be very difficult to code it just in python.

⚠ Note: In this code we are reimplementing an RNN using bare numpy and python. While this can be an interesting academic task, do not use it in production and rely on tools such as [Pytorch](pytorch.org) and [Pytorch Lighthing](https://lightning.ai/docs/pytorch/stable/).


In [None]:
# @title 🧑🏿‍💻 Your code here

def lossFun(inputs, targets, hprev):
  """
  inputs,targets are both list of integers.
  hprev is Hx1 array of initial hidden state

  We first forward the inputs through the model and compute:
    - loss
    - gradients on the parameters
    - the hidden state at the last time step

  returns the loss, gradients on model parameters, and last hidden state
  """
  xs, hs, ys, ps = {}, {}, {}, {}
  hs[-1] = np.copy(hprev)
  loss = 0

  # forward pass
  for t in range(len(inputs)):
    ## Add ther the code to perform the
    ## forward pass and loss calculation in the RNN.
    pass

  # backward pass: compute gradients going backwards
  dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
  dbh, dby = np.zeros_like(bh), np.zeros_like(by)
  dhnext = np.zeros_like(hs[0])
  for t in reversed(range(len(inputs))):
    dy = np.copy(ps[t])
    dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here
    dWhy += np.dot(dy, hs[t].T)
    dby += dy
    dh = np.dot(Why.T, dy) + dhnext # backprop into h
    dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
    dbh += dhraw
    dWxh += np.dot(dhraw, xs[t].T)
    dWhh += np.dot(dhraw, hs[t-1].T)
    dhnext = np.dot(Whh.T, dhraw)
  for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
    np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
  return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]

# test the loss function
inputs = [char_to_ix[ch] for ch in data[:seq_length]]
targets = [char_to_ix[ch] for ch in data[1:seq_length+1]]
hprev = np.zeros((hidden_size,1)) # reset RNN memory
loss, dWxh, dWhh, dWhy, dbh, dby, _ = lossFun(inputs, targets, hprev)
print('loss:', loss)

NameError: name 'seq_length' is not defined

In [None]:
# @title 👀 Solution

def lossFun(inputs, targets, hprev):
  """
  inputs,targets are both list of integers.
  hprev is Hx1 array of initial hidden state

  We first forward the inputs through the model and compute:
    - loss
    - gradients on the parameters
    - the hidden state at the last time step

  returns the loss, gradients on model parameters, and last hidden state
  """
  xs, hs, ys, ps = {}, {}, {}, {}
  hs[-1] = np.copy(hprev)
  loss = 0
  # forward pass
  for t in range(len(inputs)):
    xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
    xs[t][inputs[t]] = 1
    hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
    ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
    ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
    loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
  # backward pass: compute gradients going backwards
  dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
  dbh, dby = np.zeros_like(bh), np.zeros_like(by)
  dhnext = np.zeros_like(hs[0])
  for t in reversed(range(len(inputs))):
    dy = np.copy(ps[t])
    dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here
    dWhy += np.dot(dy, hs[t].T)
    dby += dy
    dh = np.dot(Why.T, dy) + dhnext # backprop into h
    dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
    dbh += dhraw
    dWxh += np.dot(dhraw, xs[t].T)
    dWhh += np.dot(dhraw, hs[t-1].T)
    dhnext = np.dot(Whh.T, dhraw)
  for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
    np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
  return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]

# test the loss function
inputs = [char_to_ix[ch] for ch in data[:seq_length]]
targets = [char_to_ix[ch] for ch in data[1:seq_length+1]]
hprev = np.zeros((hidden_size,1)) # reset RNN memory
loss, dWxh, dWhh, dWhy, dbh, dby, _ = lossFun(inputs, targets, hprev)
print('loss:', loss)

loss: 104.35454377851322


### 1.2 **EXERCISE 2** 💻 Inference: Forward pass + Sampling



**Objective**

- The goal of this exercise is to implement a function that can sample a sequence of integers from the model, which represents a sequence of characters in text generation.

**Key Components**

- `sample` function: This function takes the last hidden state `h`, a seed index `seed_ix`, and the number of characters `n` to generate.

- Sampling Process:
  - Initializes an empty list `ixes` to store the generated sequence.
  - For each step in the range `n`, it performs the following:
    - Updates the hidden state `h`.
    - Computes the output probabilities for the next character.
    - Samples a new character index from the probability distribution.
    - Appends the new index to the list `ixes`.
  - Returns the list `ixes` as the output sequence.

**Implementation Details**

The function uses a for loop to iterate over the desired number of characters to generate.

At each iteration, it updates the hidden state and calculates the output probabilities using the model’s parameters.
It then samples a new character index based on these probabilities and appends it to the sequence.

In [None]:
# @title 🧑🏿‍💻 Your code here

def sample(h, seed_ix, n):
  """
  sample a sequence of integers from the model
  h is memory state, seed_ix is seed letter for first time step
  """
  x = np.zeros((vocab_size, 1))
  x[seed_ix] = 1
  ixes = []
  for _ in range(n):
    pass
  return ixes

In [None]:
# @title 👀 Solution

def sample(h, seed_ix, n):
  """
  sample a sequence of integers from the model
  h is memory state, seed_ix is seed letter for first time step
  """
  x = np.zeros((vocab_size, 1))
  x[seed_ix] = 1
  ixes = []
  for _ in range(n):
    h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
    y = np.dot(Why, h) + by
    p = np.exp(y) / np.sum(np.exp(y))
    ix = np.random.choice(range(vocab_size), p=p.ravel())
    x = np.zeros((vocab_size, 1))
    x[ix] = 1
    ixes.append(ix)
  return ixes

### Testing the inference...

In [None]:
# test sampling
hprev = np.zeros((hidden_size,1)) # reset RNN memory
ix = sample(hprev, char_to_ix['a'], 200)
txt = ''.join(ix_to_char[ix] for ix in ix)
print('----\n %s \n----' % (txt, ))

----
 WMRE,VA3GbzALEC.XXaTkZvEIhCGFUj
 zjc?$wHVhxGxLVjHaqds;YOGONu-AePahka.WkqJBC;CndKiG'vCJWx
DcQkbgvD
bAtbs?apa.ImmMdCREi;GzPcyoGJdlKYzwIicXjTnuVxRo?t
c,V;WP il?il$,a.gymQdPIvdRKNgoCYw::3PRYf:-jFp,T!VRke! 
----


### 1.3 Optimization

Here's a breakdown of the optimization steps:

- **Initialization**: We need some variables to keep in memory the latest weight matrices (`mWxh`, `mWhh`, `mWhy`) and biases (`mbh`, `mby`), they're necessary for the Adagrad algorithm.
- **Loss Initialization**: `smooth_loss` is initialized as the negative log likelihood loss assuming a uniform distribution over the vocabulary at the first iteration.
- **Training Loop**: The section includes an infinite while loop.
  - **Data Preparation**: Sequences of inputs and targets are prepared for each training step.
  - **Sampling**: Occasionally, the model samples outputs to provide a glimpse of the learning progress.
  - **Forward and Backward Passes**: The model performs forward and backward passes to compute the loss and gradients.
  - **Parameter Update**: An Adagrad update is applied to the parameters using the computed gradients.
  - **Progress Tracking**: The `smooth_loss` is updated to track the progress of training.
  - **Checkpointing**: The best model parameters are saved based on the `smooth_loss`.

  **Forward Pass**

![Forward Pass Computations](https://github.com/iacopomasi/NLP/blob/main/course/AA2324/2_03_seq_processing/figs/rnn_glance.png?raw=true)

**Backward Pass and BPTT**

![Backward Pass Computations](https://github.com/iacopomasi/NLP/blob/main/course/AA2324/2_03_seq_processing/figs/rnn_backprop/Slide58.png?raw=true)

## The Main Loop

In [None]:
n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
smooth_loss = best_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0
while True:
  # prepare inputs (we're sweeping from left to right in steps seq_length long)
  if p+seq_length+1 >= len(data) or n == 0:
    hprev = np.zeros((hidden_size,1)) # reset RNN memory
    p = 0 # go from start of data
  inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
  targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]

  # sample from the model now and then
  if n % 100 == 0:
    sample_ix = sample(hprev, inputs[0], 200)
    txt = ''.join(ix_to_char[ix] for ix in sample_ix)
    print('----\n %s \n----' % (txt, ))

  # forward seq_length characters through the net and fetch gradient
  loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
  smooth_loss = smooth_loss * 0.999 + loss * 0.001
  if n % 100 == 0: print('iter %d, loss: %f' % (n, smooth_loss)) # print progress
  if n % 1000 == 0 and smooth_loss < best_loss:
      best_loss = smooth_loss
      kw_save_params = {k:v for v, k in zip([Wxh, Whh, Why, bh, by],
                                            ['Wxh','Whh','Why','bh','by'])}
      np.savez_compressed('ckpts/rnn.npz', **kw_save_params)

  # perform parameter update with Adagrad
  for param, dparam, mem in zip([Wxh, Whh, Why, bh, by],
                                [dWxh, dWhh, dWhy, dbh, dby],
                                [mWxh, mWhh, mWhy, mbh, mby]):
    mem += dparam * dparam
    param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update

  p += seq_length # move data pointer
  n += 1 # iteration counter

----
 nJ3
OHzAjd
3v$!T-?z-?$cvnQsBADckDa LekqxPtJv!ZfAIJqFwU.QaU;3:W xSE!oiD!OscUeufB$Gc$'gDmiVxo?yTl-,yz
z-lOu!wvXtIbQ!YWYiAREV?&J$GnQcMtjrCcnE3etjR dKbofzI&i,DZ?L:t!ysF:sOC-ZPEzIJRC;GODGL!M:M!l'mHjxNtgN.z 
----
iter 0, loss: 104.359677
----
 retiea.nreatd totiaSuasi.t  n
devr sh v   nsh rhgYbgc y .Y sr tasiososwyah;ruSt?eowsi
t aepch tLyoefo    hw
l rhwtegoetukzxctgdhshashssnrQ en tCSe ih WeSithh 
oeWrrh : iheg wthwha iahostHgdat:Ttboh s  
----
iter 100, loss: 104.381504
----
  nbeB'egeeaewi hahyT oknThithTwle b meyfmagediN ? rT,onult
:Ens
srdlfiuaI u:nrh  ovscoi ke
T  eselaTslerh,eMMoe,akeTr,le   ?doe
rtWinaeruu eca fahoemvl
m

daettlbWe,oh HheTc h
Ae
s
esoiaestoosw nene t 
----
iter 200, loss: 102.661118
----
  feliIth He Hoss 
urmeihrek tsah wy ;?sh siEd
hyMAdsvshtn 
WRitbt

teeto r
nsayn gdmst ih oMwyit 
z if ansehb nbewf's l piiten
s 
iety tto etocgf!
  ,a r, btgltMpq i t
y

Mfkli Ud ,nisoYheih,o wira
se 
----
iter 300, loss: 101.022314
----
 acm,
Xtenheace -zsdLyiuweinoneswlloi

KeyboardInterrupt: 

## Generate

In [None]:
# test sampling after training
hprev = np.zeros((hidden_size,1)) # reset RNN memory
ix = sample(hprev, char_to_ix['a'], 200)
txt = ''.join(ix_to_char[ix] for ix in ix)
print('----\n %s \n----' % (txt, ))

----
 kethue hedare heat; ein a prous frumos.

QUEEN:
Stor your'st and 'To cray beorstes lecks;
It are, and deatte my faite exouth
'Beat. wonch soke you.

GLUCUS:
Whate's chank, atinother,
Prane deat
Be dom 
----


## Load params

In [None]:
# load parameters from npz file if needed
rnn_params = np.load('ckpts/rnn.npz')
Wxh, Whh, Why, bh, by = [rnn_params[k] for k in ['Wxh','Whh','Why','bh','by']]

## 2. Transformers (MiniGPT)


Now will try to do the same but using a more state-of-the-art architecture for Language Modeling (LM).
Not RNN but a transformers architecture called MiniGPT.

In [None]:
"""
Credits: Andrej Karpathy. Reference: https://github.com/karpathy/minGPT/blob/master/mingpt/model.py
"""

import math

import torch
import torch.nn as nn
from torch.nn import functional as F

### 2.1 Activation Function

The GELU (Gaussian Error Linear Unit) activation function is a non-linear activation function used in neural networks, particularly in the field of Natural Language Processing (NLP). It was introduced in the paper “Gaussian Error Linear Units (GELUs)” by Dan Hendrycks and Kevin Gimpel.

Reasoning Behind GELU:

- Smooth Approximation to ReLU: GELU provides a smooth curve that approximates the ReLU (Rectified Linear Unit) function. Unlike ReLU, which abruptly cuts off values below zero, GELU allows for a small gradient when the input is negative or around zero. This can help mitigate the “dying ReLU” problem where neurons stop learning during training due to zero gradients.
- Stochastic Regularization: The GELU function can be interpreted as a stochastic regularizer, introducing randomness during the training process. This can help prevent overfitting, as it encourages the model to learn more robust features that generalize better to unseen data.
- Non-monotonicity: GELU is non-monotonic around zero, which means it can allow for both positive and negative changes in the output even for small changes in the input near zero. This property can help the model learn more complex patterns.

Comparison to Other Activation Functions:

- ReLU: GELU is smoother than ReLU and doesn’t suffer from the dying ReLU problem. However, ReLU is simpler and computationally less expensive.
- Sigmoid/Tanh: Unlike Sigmoid and Tanh, GELU does not saturate for large positive or negative inputs, which means it does not squash the gradients during backpropagation, leading to better learning dynamics.
- Leaky ReLU/PReLU: While Leaky ReLU and Parametric ReLU (PReLU) also address the dying ReLU problem by allowing a small, non-zero gradient when the input is negative, GELU provides a probabilistic interpretation and smoother transition.
- Swish: Swish is another smooth activation function similar to GELU. Both have shown to perform well in deep learning models, but GELU has gained popularity in transformer-based architectures like BERT and GPT.

<br>

$$
\text{GELU}(x) \approx 0.5x(1 + \tanh[\sqrt{ 2/π }(x + 0.044715x^3)])
$$


✋ Note unlike before now we will use **pytorc** as main framework to implement the transformers.

So for example we will derive a class from  the `nn.Module`



### **EXERCISE 3** 💻
Implement the GELU activation function currently in Google BERT repo (identical to OpenAI GPT).

    Reference: Gaussian Error Linear Units (GELU) paper: https://arxiv.org/abs/1606.08415

Please, implement it in pytorch NOT numpy as the gradients have to flow through the computational graph.

So use `torch.tanh` and do NOT use `np.tanh` etc

In [None]:
# @title 🧑🏿‍💻 Your code here

class NewGELU(nn.Module):
    """
    Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT).
    Reference: Gaussian Error Linear Units (GELU) paper: https://arxiv.org/abs/1606.08415
    """
    def forward(self, x):
        return None

# test the NewGELU activation function
x = torch.ones(1)
gelu = NewGELU()
print(gelu(x))

None


In [None]:
# @title 👀 Solution

class NewGELU(nn.Module):
    def forward(self, x):
        return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))

# test the NewGELU activation function
x = torch.ones(1)
gelu = NewGELU()
print(gelu(x))

tensor([0.8412])


### 2.2 Self-Attention

<img src="https://github.com/iacopomasi/NLP/blob/main/course/AA2324/2_05_transformers_bert/figs/self-attn.png?raw=1" alt="self-attention compute diagram" width="800"/>

Source: [magazine.sebastianraschka.com](https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention)

**Exercise**

Let's implement the `CausalSelfAttention` class, which is a crucial component of the Transformer architecture. This class handles the **multi-head masked self-attention mechanism**, allowing the model to focus on different parts of the input sequence when making predictions.

Here's a breakdown of what you are expected to code:

- **Initialization**: We already defined for you the `__init__` method to initialize the key, query, and value projections for all heads (`self.c_attn`), the output projection (`self.c_proj`), dropout layers for attention and residuals (`self.attn_dropout`, `self.resid_dropout`), and a causal mask to ensure attention is only applied to the left in the input sequence.

- **Forward Pass**: In the `forward` method, you should:
  - Reshape and transpose the input `x` to get separate projections for query, key, and value for each attention head.
  - Perform the dot product between queries and keys, scale it, and apply the causal mask.
  - Apply softmax to the scaled dot product to get the attention weights, followed by the dropout.
  - Multiply the attention weights by the values to get the output of the self-attention layer.
  - Project the concatenated output of all heads through the final linear layer and apply dropout.

- **Causal Mask**: The causal mask is **created to ensure that during self-attention, each position can only attend to previous positions and itself**, preventing information flow from future positions.

The solution code provides a detailed implementation of these steps, which students can use as a reference to understand the mechanics of self-attention in the Transformer model. The code is expected to be written within the PyTorch framework, utilizing classes and methods from the `torch.nn` module.


In [None]:
# @title 🧑🏿‍💻 Your code here

class CausalSelfAttention(nn.Module):
    """
    A vanilla multi-head masked self-attention layer with a projection at the end.
    """

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        # regularization
        self.attn_dropout = nn.Dropout(config.attn_pdrop)
        self.resid_dropout = nn.Dropout(config.resid_pdrop)
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                     .view(1, 1, config.block_size, config.block_size))
        self.n_head = config.n_head
        self.n_embd = config.n_embd

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

        y = None

        # 1. Calculate query, key, values for all heads in batch and move head forward to be the batch dim

        # 2. Causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)

        # 3. Mask out the lower half of the dot product matrix, excluding the diagonal

        # 4. Apply the softmax, dropout, and then output projection

        # 5. Apply the residual dropout

        return y

In [None]:
# @title 👀 Solution

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        # regularization
        self.attn_dropout = nn.Dropout(config.attn_pdrop)
        self.resid_dropout = nn.Dropout(config.resid_pdrop)
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                     .view(1, 1, config.block_size, config.block_size))
        self.n_head = config.n_head
        self.n_embd = config.n_embd

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k ,v  = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        # B = batch size
        # nh = number of heads
        # T = sequence length
        # hs = embedding dimension of head size (hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        # att is alwats quadratic TxT wrt to the sequence length

        # mask out the lower half of the dot product matrix, excluding the diagonal
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))

        # apply the softmax, dropout, and then output projection
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        #  residual dropout
        y = self.resid_dropout(self.c_proj(y))

        return y

In [None]:
def test_self_attention():
    config = type('Config', (object,), {})()
    config.n_embd = 768
    config.n_head = 12
    config.block_size = 128
    config.attn_pdrop = 0.1
    config.resid_pdrop = 0.1
    attn = CausalSelfAttention(config)
    x = torch.ones(1, 128, 768)
    y = attn(x)
    print(y.shape)
    print(y[0, 0, :10])

test_self_attention()

torch.Size([1, 128, 768])
tensor([ 0.0032,  0.1618, -0.2629, -0.0000, -1.0349,  0.2537,  0.5518, -0.1566,
        -0.0092, -0.2230], grad_fn=<SliceBackward0>)


### 2.3  EXERCISE 4 💻 Transformer Block


You will now put together what we've seen so far into a single block of the Transformer model, which includes self-attention and a feed-forward neural network.

**Components**

- Layer Normalization: Applied before self-attention and the feed-forward network for stable training. Use it straight from pytorch with `nn.LayerNorm`
- Causal Self-Attention: A multi-head self-attention mechanism that ensures the model only attends to earlier positions in the sequence.
- Feed-Forward Neural Network (MLP):
  - Consists of two linear transformations with a GELU non-linearity in between.
  - The first linear transformation expands the dimensionality, and the second one projects it back.
  - Dropout is applied after the second linear transformation.
- Residual Connections: Used around both the self-attention and the feed-forward network to help with gradient flow.
- Forward Pass:
  - The input goes through layer normalization, followed by self-attention.
  - The output of self-attention is then passed through another layer normalization and the feed-forward network.
  - Residual connections are added at both stages.

In [None]:
# @title 🧑🏿‍💻 Your code here

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        # init all the params you need here

    def forward(self, x):
        #   - Forward Pass:
        #   - The input goes through layer normalization, followed by self-attention.
        #   - The output of self-attention is then passed through another layer
        #     normalization and the feed-forward network.
        #   - Residual connections are added at both stages.
        return x

In [None]:
# @title 👀 Solution

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = nn.ModuleDict(dict(
            c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd),
            c_proj  = nn.Linear(4 * config.n_embd, config.n_embd),
            act     = NewGELU(),
            dropout = nn.Dropout(config.resid_pdrop),
        ))
        m = self.mlp
        self.mlpf = lambda x: m.dropout(m.c_proj(m.act(m.c_fc(x)))) # MLP forward

    def forward(self, x):
        x = x + self.attn(self.ln_1(x)) # x + ... is the residual connection
        x = x + self.mlpf(self.ln_2(x))
        return x

## 3. GPT Model

In [None]:
# @title GPT model boilerplate

class GPT(nn.Module):
    """ GPT Language Model """

    @staticmethod
    def get_default_config():
        '''C = {'model_type': None, 'n_layer': None, 'n_head': None, 'n_embd': None, # either model_type or (n_layer, n_head, n_embd) must be given in the config
             'vocab_size': None, 'block_size': None, # these options must be filled in externally
             'embd_pdrop': 0.1, 'resid_pdrop': 0.1, 'attn_pdrop': 0.1,} # dropout hyperparameters'''

        class C:
            def __init__(self):
                self.model_type = None
                self.n_layer = None
                self.n_head = None
                self.n_embd = None
                self.vocab_size = None
                self.block_size = None
                self.embd_pdrop = 0.1
                self.resid_pdrop = 0.1
                self.attn_pdrop = 0.1

            def merge_from_dict(self, kwargs):
                for k, v in kwargs.items():
                    setattr(self, k, v)

        return C()

    def __init__(self, config):
        super().__init__()
        assert config.vocab_size is not None
        assert config.block_size is not None
        self.block_size = config.block_size

        type_given = config.model_type is not None
        params_given = all([config.n_layer is not None, config.n_head is not None, config.n_embd is not None])
        assert type_given ^ params_given # exactly one of these (XOR)
        if type_given:
            # translate from model_type to detailed configuration
            config.merge_from_dict({
                # names follow the huggingface naming conventions
                # GPT-1
                'openai-gpt':   dict(n_layer=12, n_head=12, n_embd=768),  # 117M params
                # GPT-2 configs
                'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
                'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
                'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
                'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
                # Gophers
                'gopher-44m':   dict(n_layer=8, n_head=16, n_embd=512),
                # (there are a number more...)
                # I made these tiny models up
                'gpt-mini':     dict(n_layer=6, n_head=6, n_embd=192),
                'gpt-micro':    dict(n_layer=4, n_head=4, n_embd=128),
                'gpt-nano':     dict(n_layer=3, n_head=3, n_embd=48),
            }[config.model_type])

        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.embd_pdrop),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        # init all weights, and apply a special scaled init to the residual projections, per GPT-2 paper
        self.apply(self._init_weights)
        for pn, p in self.named_parameters():
            if pn.endswith('c_proj.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))

        # report number of parameters (note we don't count the decoder parameters in lm_head)
        n_params = sum(p.numel() for p in self.transformer.parameters())
        print("number of parameters: %.2fM" % (n_params/1e6,))

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        elif isinstance(module, nn.LayerNorm):
            torch.nn.init.zeros_(module.bias)
            torch.nn.init.ones_(module.weight)

    @classmethod
    def from_pretrained(cls, model_type):
        """
        Initialize a pretrained GPT model by copying over the weights
        from a huggingface/transformers checkpoint.
        """
        assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
        from transformers import GPT2LMHeadModel

        # create a from-scratch initialized minGPT model
        config = cls.get_default_config()
        config.model_type = model_type
        config.vocab_size = 50257 # openai's model vocabulary
        config.block_size = 1024  # openai's model block_size
        model = GPT(config)
        sd = model.state_dict()

        # init a huggingface/transformers model
        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
        sd_hf = model_hf.state_dict()

        # copy while ensuring all of the parameters are aligned and match in names and shapes
        keys = [k for k in sd_hf if not k.endswith('attn.masked_bias')] # ignore these
        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
        # basically the openai checkpoints use a "Conv1D" module, but we only want to use a vanilla nn.Linear.
        # this means that we have to transpose these weights when we import them
        assert len(keys) == len(sd)
        for k in keys:
            if any(k.endswith(w) for w in transposed):
                # special treatment for the Conv1D weights we need to transpose
                assert sd_hf[k].shape[::-1] == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k].t())
            else:
                # vanilla copy over the other parameters
                assert sd_hf[k].shape == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k])

        return model

    def configure_optimizers(self, train_config):
        """
        This long function is unfortunately doing something very simple and is being very defensive:
        We are separating out all parameters of the model into two buckets: those that will experience
        weight decay for regularization and those that won't (biases, and layernorm/embedding weights).
        We are then returning the PyTorch optimizer object.
        """

        # separate out all parameters to those that will and won't experience regularizing weight decay
        decay = set()
        no_decay = set()
        whitelist_weight_modules = (torch.nn.Linear, )
        blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding)
        for mn, m in self.named_modules():
            for pn, p in m.named_parameters():
                fpn = '%s.%s' % (mn, pn) if mn else pn # full param name
                # random note: because named_modules and named_parameters are recursive
                # we will see the same tensors p many many times. but doing it this way
                # allows us to know which parent module any tensor p belongs to...
                if pn.endswith('bias'):
                    # all biases will not be decayed
                    no_decay.add(fpn)
                elif pn.endswith('weight') and isinstance(m, whitelist_weight_modules):
                    # weights of whitelist modules will be weight decayed
                    decay.add(fpn)
                elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules):
                    # weights of blacklist modules will NOT be weight decayed
                    no_decay.add(fpn)

        # validate that we considered every parameter
        param_dict = {pn: p for pn, p in self.named_parameters()}
        inter_params = decay & no_decay
        union_params = decay | no_decay
        assert len(inter_params) == 0, "parameters %s made it into both decay/no_decay sets!" % (str(inter_params), )
        assert len(param_dict.keys() - union_params) == 0, "parameters %s were not separated into either decay/no_decay set!" \
                                                    % (str(param_dict.keys() - union_params), )

        # create the pytorch optimizer object
        optim_groups = [
            {"params": [param_dict[pn] for pn in sorted(list(decay))], "weight_decay": train_config.weight_decay},
            {"params": [param_dict[pn] for pn in sorted(list(no_decay))], "weight_decay": 0.0},
        ]
        optimizer = torch.optim.AdamW(optim_groups, lr=train_config.learning_rate, betas=train_config.betas)
        return optimizer

### 3.1 EXERCISE 5 💻 Forward Pass and Generation


Let's implement the most important methods of our GPT model.

**Forward Pass**

- **Input**: The forward pass begins with the model taking in a sequence of token indices and optional target ids.
- **Positional Encoding**: The model computes positional encodings to maintain the order of the tokens.
- **Transformer Blocks**: The token embeddings and positional encodings are passed through multiple transformer blocks, each consisting of self-attention and feed-forward layers.
- **Output Logits**: The final output of the transformer blocks is a set of logits representing the probability distribution over the vocabulary for each token position.
- **Loss**: If target ids are provided, we must calculate the loss over the true logits using `F.cross_entropy`.

**Generation**

- **Sampling**: The generation process involves sampling new tokens based on the output logits. This can be done greedily by picking the most likely next token or stochastically by sampling from the probability distribution.
- **Temperature**: A temperature parameter can be used to control the randomness of the sampling process. A lower temperature makes the model more confident (less random), while a higher temperature makes the sampling more diverse (more random).
- **Top-k Sampling**: Optionally, the model can limit the sampling pool to the top-k most likely next tokens to prevent unlikely tokens from being selected.
- **Sequence Completion**: The model continues to sample new tokens until it generates a complete sequence or reaches a specified maximum length.

the model that you receive as input is of type `GPT` defined above.

The things you have to us are:
```python
        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.embd_pdrop),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
```

In [None]:
# @title 🧑🏿‍💻 Your code here

def forward(model, idx, targets=None):
    device = idx.device
    b, t = idx.size()
    assert t <= model.block_size, f"Cannot forward sequence of length {t}, block size is only {model.block_size}"

    logits = None
    loss = None

    # 1. Get the position of each token in the sequence
    # use torch.arange

    # 2. Forward to the model itself

    # 3. If given some desired targets, also calculate the loss

    return logits, loss

In [None]:
# @title 👀 Solution

def forward(model, idx, targets=None):
    device = idx.device
    b, t = idx.size()
    assert t <= model.block_size, f"Cannot forward sequence of length {t}, block size is only {model.block_size}"

    pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape (1, t)

    # forward the GPT model itself
    tok_emb = model.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
    pos_emb = model.transformer.wpe(pos) # position embeddings of shape (1, t, n_embd)
    x = model.transformer.drop(tok_emb + pos_emb) # embeddings dropout
    for block in model.transformer.h: # loop over transformer blocks
        x = block(x)

    x = model.transformer.ln_f(x) # LN on the last block
    logits = model.lm_head(x) # decoder head

    # if we are given some desired targets also calculate the loss
    loss = None
    if targets is not None:
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

    return logits, loss

In [None]:
# @title 🧑🏿‍💻 Your code here

def generate(model, idx, max_new_tokens, temperature=1.0, do_sample=False, top_k=None):
    """
    Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
    the sequence max_new_tokens times, feeding the predictions back into the model each time.
    Most likely you'll want to make sure to be in model.eval() mode of operation for this.
    """
    with torch.no_grad():
        for _ in range(max_new_tokens):
            # 1. If the sequence context is growing too long we must crop it at block_size

            # 2. Forward the model to get the logits for the index in the sequence

            # 3. Pluck the logits at the final step and scale by desired temperature

            # 4. Optionally crop the logits to only the top k options

            if top_k is not None:
                pass

            # 5. Apply softmax to convert logits to (normalized) probabilities

            # 6. Either sample from the distribution or take the most likely element
            if do_sample:
                pass
            else:
                pass

            # 7. Append sampled index to the running sequence and continue

    return idx

In [None]:
# @title 👀 Solution

def generate(model, idx, max_new_tokens, temperature=1.0, do_sample=False, top_k=None):
    """
    Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
    the sequence max_new_tokens times, feeding the predictions back into the model each time.
    Most likely you'll want to make sure to be in model.eval() mode of operation for this.
    """
    with torch.no_grad():
        for _ in range(max_new_tokens):
            # 1. If the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= model.block_size else idx[:, -model.block_size:]
            # 2. Forward the model to get the logits for the index in the sequence
            logits, _ = forward(model, idx_cond)
            # 3. Pluck the logits at the final step and scale by desired temperature
            logits = logits[:, -1, :] / temperature
            # 4. Optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, k=top_k, dim=-1)
                logits[logits < v[:, [-1]]] = -float('Inf')
            # 5. Apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # 6. Either sample from the distribution or take the most likely element
            if do_sample:
                idx_next = torch.multinomial(probs, num_samples=1)
            else:
                _, idx_next = torch.topk(probs, k=1, dim=-1)
            # 7. Append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

    return idx

In [None]:
# @title Forward Test

def test_forward():
    config = GPT.get_default_config()
    config.model_type = 'gpt-nano'
    config.vocab_size = 256
    config.block_size = 128

    torch.manual_seed(42) # set torch seed for reproducibility

    model = GPT(config)

    idx = torch.randint(0, 256, (1, 128))
    with torch.no_grad():
        logits, loss = forward(model, idx)

    print(logits.shape, loss)
    print(logits[0, 0, :10])

test_forward()

number of parameters: 0.10M
torch.Size([1, 128, 256]) None
tensor([-0.0288,  0.1416, -0.0897,  0.1571,  0.0139,  0.1899,  0.1623,  0.0366,
        -0.1775, -0.1000])


In [None]:
# @title Generate Test

def test_generate():
    config = GPT.get_default_config()
    config.model_type = 'gpt-nano'
    config.vocab_size = 256
    config.block_size = 128

    torch.manual_seed(42) # set torch seed for reproducibility

    model = GPT(config)
    model.eval()

    idx = torch.randint(0, 256, (1, 128))
    idx = generate(model, idx, 20)

    print(idx.shape)
    print(idx)

test_generate()

number of parameters: 0.10M
torch.Size([1, 148])
tensor([[156,  33,  99,  94,  60, 118, 220, 213, 198,  56, 200,  17,  11, 210,
         226, 194, 255, 234,  31, 199, 193,   9,  56,  98, 138, 219,  79,  50,
         125, 241, 168, 164,  94, 253,  26, 159, 111,  31, 199,   2, 255, 246,
         228, 189, 205,  76, 203, 229,  81, 130,   4, 200,   4, 166, 189,  46,
          52,  78, 241,  50, 230,  85, 222,  84,  68, 159,  10,  79,  25,  58,
          95, 222,  79, 121, 159, 155,   2, 211, 193,  15, 116,   8, 241, 244,
         176,  30, 195,  20, 198,  74,  90, 170, 124,  65,  37, 243,  60, 222,
          72, 213, 102, 212,  28, 166,   7, 237,  59,  97, 183, 126, 193,  87,
         217,  97, 179, 166,  56,  44, 118, 167,   2, 145, 163,  68, 225, 149,
          65, 182, 129,  22,  46,  31, 132, 132, 132, 132, 132, 132, 132, 132,
         132, 132, 132, 132, 132, 132, 132, 132]])


In [None]:
# @title Putting it all together: GPT model

class GPT(nn.Module):
    """ GPT Language Model """

    @staticmethod
    def get_default_config():
        '''C = {'model_type': None, 'n_layer': None, 'n_head': None, 'n_embd': None, # either model_type or (n_layer, n_head, n_embd) must be given in the config
             'vocab_size': None, 'block_size': None, # these options must be filled in externally
             'embd_pdrop': 0.1, 'resid_pdrop': 0.1, 'attn_pdrop': 0.1,} # dropout hyperparameters'''

        class C:
            def __init__(self):
                self.model_type = None
                self.n_layer = None
                self.n_head = None
                self.n_embd = None
                self.vocab_size = None
                self.block_size = None
                self.embd_pdrop = 0.1
                self.resid_pdrop = 0.1
                self.attn_pdrop = 0.1

            def merge_from_dict(self, kwargs):
                for k, v in kwargs.items():
                    setattr(self, k, v)

        return C()

    def __init__(self, config):
        super().__init__()
        assert config.vocab_size is not None
        assert config.block_size is not None
        self.block_size = config.block_size

        type_given = config.model_type is not None
        params_given = all([config.n_layer is not None, config.n_head is not None, config.n_embd is not None])
        assert type_given ^ params_given # exactly one of these (XOR)
        if type_given:
            # translate from model_type to detailed configuration
            config.merge_from_dict({
                # names follow the huggingface naming conventions
                # GPT-1
                'openai-gpt':   dict(n_layer=12, n_head=12, n_embd=768),  # 117M params
                # GPT-2 configs
                'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
                'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
                'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
                'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
                # Gophers
                'gopher-44m':   dict(n_layer=8, n_head=16, n_embd=512),
                # (there are a number more...)
                # I made these tiny models up
                'gpt-mini':     dict(n_layer=6, n_head=6, n_embd=192),
                'gpt-micro':    dict(n_layer=4, n_head=4, n_embd=128),
                'gpt-nano':     dict(n_layer=3, n_head=3, n_embd=48),
            }[config.model_type])

        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.embd_pdrop),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        # init all weights, and apply a special scaled init to the residual projections, per GPT-2 paper
        self.apply(self._init_weights)
        for pn, p in self.named_parameters():
            if pn.endswith('c_proj.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))

        # report number of parameters (note we don't count the decoder parameters in lm_head)
        n_params = sum(p.numel() for p in self.transformer.parameters())
        print("number of parameters: %.2fM" % (n_params/1e6,))

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        elif isinstance(module, nn.LayerNorm):
            torch.nn.init.zeros_(module.bias)
            torch.nn.init.ones_(module.weight)

    @classmethod
    def from_pretrained(cls, model_type):
        """
        Initialize a pretrained GPT model by copying over the weights
        from a huggingface/transformers checkpoint.
        """
        assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
        from transformers import GPT2LMHeadModel

        # create a from-scratch initialized minGPT model
        config = cls.get_default_config()
        config.model_type = model_type
        config.vocab_size = 50257 # openai's model vocabulary
        config.block_size = 1024  # openai's model block_size
        model = GPT(config)
        sd = model.state_dict()

        # init a huggingface/transformers model
        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
        sd_hf = model_hf.state_dict()

        # copy while ensuring all of the parameters are aligned and match in names and shapes
        keys = [k for k in sd_hf if not k.endswith('attn.masked_bias')] # ignore these
        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
        # basically the openai checkpoints use a "Conv1D" module, but we only want to use a vanilla nn.Linear.
        # this means that we have to transpose these weights when we import them
        assert len(keys) == len(sd)
        for k in keys:
            if any(k.endswith(w) for w in transposed):
                # special treatment for the Conv1D weights we need to transpose
                assert sd_hf[k].shape[::-1] == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k].t())
            else:
                # vanilla copy over the other parameters
                assert sd_hf[k].shape == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k])

        return model

    def configure_optimizers(self, train_config):
        """
        This long function is unfortunately doing something very simple and is being very defensive:
        We are separating out all parameters of the model into two buckets: those that will experience
        weight decay for regularization and those that won't (biases, and layernorm/embedding weights).
        We are then returning the PyTorch optimizer object.
        """

        # separate out all parameters to those that will and won't experience regularizing weight decay
        decay = set()
        no_decay = set()
        whitelist_weight_modules = (torch.nn.Linear, )
        blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding)
        for mn, m in self.named_modules():
            for pn, p in m.named_parameters():
                fpn = '%s.%s' % (mn, pn) if mn else pn # full param name
                # random note: because named_modules and named_parameters are recursive
                # we will see the same tensors p many many times. but doing it this way
                # allows us to know which parent module any tensor p belongs to...
                if pn.endswith('bias'):
                    # all biases will not be decayed
                    no_decay.add(fpn)
                elif pn.endswith('weight') and isinstance(m, whitelist_weight_modules):
                    # weights of whitelist modules will be weight decayed
                    decay.add(fpn)
                elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules):
                    # weights of blacklist modules will NOT be weight decayed
                    no_decay.add(fpn)

        # validate that we considered every parameter
        param_dict = {pn: p for pn, p in self.named_parameters()}
        inter_params = decay & no_decay
        union_params = decay | no_decay
        assert len(inter_params) == 0, "parameters %s made it into both decay/no_decay sets!" % (str(inter_params), )
        assert len(param_dict.keys() - union_params) == 0, "parameters %s were not separated into either decay/no_decay set!" \
                                                    % (str(param_dict.keys() - union_params), )

        # create the pytorch optimizer object
        optim_groups = [
            {"params": [param_dict[pn] for pn in sorted(list(decay))], "weight_decay": train_config.weight_decay},
            {"params": [param_dict[pn] for pn in sorted(list(no_decay))], "weight_decay": 0.0},
        ]
        optimizer = torch.optim.AdamW(optim_groups, lr=train_config.learning_rate, betas=train_config.betas)
        return optimizer

    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.block_size, f"Cannot forward sequence of length {t}, block size is only {self.block_size}"
        pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape (1, t)

        # forward the GPT model itself
        tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
        pos_emb = self.transformer.wpe(pos) # position embeddings of shape (1, t, n_embd)
        x = self.transformer.drop(tok_emb + pos_emb)
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)

        # if we are given some desired targets also calculate the loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, do_sample=False, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.block_size else idx[:, -self.block_size:]
            # forward the model to get the logits for the index in the sequence
            logits, _ = self(idx_cond)
            # pluck the logits at the final step and scale by desired temperature
            logits = logits[:, -1, :] / temperature
            # optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, top_k)
                logits[logits < v[:, [-1]]] = -float('Inf')
            # apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # either sample from the distribution or take the most likely element
            if do_sample:
                idx_next = torch.multinomial(probs, num_samples=1)
            else:
                _, idx_next = torch.topk(probs, k=1, dim=-1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

### 3.2 EXERCISE 6 💻 GPT Model Training


**Objective**

This exercise aims to guide you through the process of training a GPT-like language model. It involves setting up the training loop, managing data inputs, updating model parameters based on the loss function, and saving a checkpoint of the model once in a while.

**Components**

- **Data Preparation**: Organize the input data into batches suitable for training the model.
- **Loss Function**: Calculate the loss value to evaluate the model's predictions against the actual target outputs.
- **Optimizer**: Set up an optimizer that will adjust the model's weights based on the computed gradients to minimize the loss.
- **Training Loop**: Create a loop that repeatedly feeds data into the model, computes the loss, and updates the model's parameters, and saves checkpoints of the model.

In [None]:
# @title 🧑🏿‍💻 Your code here

def train(model, optimizer, train_data, n_epochs=10, batch_size=32):
    """
    Train the model on the given data for n_epochs.
    """
    from tqdm import tqdm

    bar = tqdm(range(n_epochs))

    for epoch in bar:
        # 1. Shuffle the data

        # 2. Shift the input to the right by one for the targets

        model.train()
        for i in range(0, len(train_data), batch_size):
            loss = None

            # 3. Zero the gradients

            # 4. Forward pass and loss calculation

            # 5. Backward pass

            # 6. Update the weights


            bar.set_description(f'epoch {epoch}, loss {loss.item():.4f if loss is not None else 0.0}')

        if epoch % 10 or epoch == n_epochs - 1:
            # 7. Every 10 epochs we will sample some text to see how the model is doing
            model.eval()
            idx = char_to_ix['H'] * torch.ones(1, 1).long()
            idx = model.generate(idx, 100, temperature=0.9, do_sample=True, top_k=50)
            print('generated sample:', ''.join([ix_to_char[ix] for ix in idx[0].cpu().numpy()]), flush=True)

            # 8. Save the model
            torch.save(model.state_dict(), 'ckpts/GPT-char.pth')

In [None]:
# @title 👀 Solution

def train(model, optimizer, train_data, n_epochs=10, batch_size=32, device='cpu'):
    """
    Train the model on the given data for n_epochs.
    """
    from tqdm import tqdm

    bar = tqdm(range(n_epochs))

    for epoch in bar:
        # 1. Shuffle the data
        train_data = train_data[torch.randperm(len(train_data))]
        targets = train_data.clone()

        # 2. Shift the input to the right by one for the targets
        targets = targets.roll(-1, dims=1)

        model.train()
        for i in range(0, len(train_data), batch_size):
            # 3. Zero the gradients
            optimizer.zero_grad()

            # 4. Forward pass and loss calculation
            idx = train_data[i:i+batch_size]
            targets_batch = targets[i:i+batch_size]
            _, loss = model(idx, targets=targets_batch)

            # 5. Backward pass
            loss.backward()

            # 6. Update the weights
            optimizer.step()
            bar.set_description(f'epoch {epoch}, loss {loss.item():.4f}')

        if epoch % 10 == 0 or epoch == n_epochs - 1:
            # 7. Every 10 epochs we will sample some text to see how the model is doing
            model.eval()
            idx = char_to_ix['\n'] * torch.ones(1, 1).long()

            # move to device
            idx = idx.to(device)

            idx = model.generate(idx, 100, temperature=0.9, do_sample=True, top_k=50)
            print('generated sample:', ''.join([ix_to_char[ix] for ix in idx[0].cpu().numpy()]), flush=True)

            # 8. Save the model
            torch.save(model.state_dict(), 'ckpts/GPT-char.pth')

In [None]:
# tokenize data into a list of ids
tokenized_data = torch.tensor([char_to_ix[c] for c in data], dtype=torch.long)

# split the data into sequences of block_size
block_size = 256
train_data = torch.cat([tokenized_data[i:i+block_size][None] for i in
                        range(0, tokenized_data.size(0) - block_size, block_size)],
                        dim=0)

In [None]:
# Let's test the training function

def test_train(train_data):
    import torch.optim as optim

    config = GPT.get_default_config()
    config.model_type = 'gpt-micro'
    config.vocab_size = 65 # number of unique characters in the data, 65 for Tiny Shakespeare
    config.block_size = 256

    torch.manual_seed(42) # set torch seed for reproducibility

    model = GPT(config)

    optimizer = optim.AdamW(model.parameters(), lr=1e-3)

    # move the model to the GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)

    # move the data to the GPU if available
    train_data = train_data.to(device)

    train(model, optimizer, train_data, n_epochs=150, batch_size=64, device=device)

test_train(train_data)

number of parameters: 0.83M


epoch 0, loss 4.1876:   0%|          | 0/150 [00:00<?, ?it/s]

epoch 0, loss 2.5420:   0%|          | 0/150 [00:01<?, ?it/s]

generated sample: 
RINind
ERICI ssQ:
I le bouthe, hirf HCVII
Beineat, tobl's acanowe ath buhy w es r, afollind:
Adel ch


epoch 10, loss 1.7225:   7%|▋         | 10/150 [00:19<04:02,  1.73s/it]

generated sample: 
GLOUCESTER:
Ay all gove but all.

LADY ANNE:
A crust angues to grace! O, more,
That good thl.

LAUDI


epoch 20, loss 1.4943:  13%|█▎        | 20/150 [00:36<03:42,  1.71s/it]

generated sample: 

Afd rawn, and still prodites; to the sumbless to be the follow's son
Is suimist to continued of the


epoch 30, loss 1.3758:  20%|██        | 30/150 [00:53<03:25,  1.71s/it]

generated sample: 
KING RICHARD III:
By thy ungrace shall it yet at ever thy malicians.

CLARENCE:
Woeful thou art me h


epoch 40, loss 1.3210:  27%|██▋       | 40/150 [01:10<03:07,  1.70s/it]

generated sample: 
SOMERSET:
Come, and thou wert thy father's company.

QUEEN MARGARET:
You have heard to our counself:


epoch 50, loss 1.5244:  33%|███▎      | 50/150 [01:28<02:49,  1.70s/it]

generated sample: 
What the wing?

LUCIO:
Is goddes, thus have no mine.

ILLIUS:
No; he's you hear not but to go.

BIAN


epoch 60, loss 1.3234:  40%|████      | 60/150 [01:45<02:32,  1.70s/it]

generated sample: 
As he's hours overful escept that had been
The ignorance of whom I have safety:
Therefore I was near


epoch 70, loss 1.3444:  47%|████▋     | 70/150 [02:02<02:16,  1.70s/it]

generated sample: 

The people.

GONZALO:
No more sweet princely.

SEBASTIAN:
'Tis a man,
But spoke to make him his cha


epoch 80, loss 1.3217:  53%|█████▎    | 80/150 [02:19<01:59,  1.71s/it]

generated sample: 

LEONTES:
He's not by you: not well; if I had send with a
way the to plucket. You prithee me: good u


epoch 90, loss 1.2000:  60%|██████    | 90/150 [02:36<01:42,  1.71s/it]

generated sample: 
EEN MARCIUS:
Now, then a daughter, or the chose forth
of his name, and best that he bears doth herb



epoch 100, loss 1.3273:  67%|██████▋   | 100/150 [02:54<01:25,  1.70s/it]

generated sample: 
That you should as London as I say.

ESCALUS:
How but she did: so the crutchers of thy womb,
And nev


epoch 110, loss 1.2062:  73%|███████▎  | 110/150 [03:11<01:08,  1.71s/it]

generated sample: 
CLARENCE:
So she news but is seen with what treason me?

GLOUCESTER:
Sir, come, we will have stony t


epoch 120, loss 1.1883:  80%|████████  | 120/150 [03:28<00:51,  1.70s/it]

generated sample: 
VOLUMNIA:
Hair sir! I said the shame!

COMINIUS:
I have known you, sir, friend Marcius, the mother
A


epoch 130, loss 1.2260:  87%|████████▋ | 130/150 [03:45<00:34,  1.71s/it]

generated sample: 
BUCKINGHAM:
Well, God I to play not my children:
And must not so remember it;
And to his needful pri


epoch 140, loss 1.2707:  93%|█████████▎| 140/150 [04:02<00:17,  1.70s/it]

generated sample: 

PETRUCHIO:
Sir, by the deserts attend me; now and with
a single are all scurried for the matter and


epoch 149, loss 1.0982:  99%|█████████▉| 149/150 [04:18<00:01,  1.71s/it]

generated sample: 
And by adverse; but she should mark
A shire in the world of her green.

KING RICHARD III:
Ay, as I h


epoch 149, loss 1.0982: 100%|██████████| 150/150 [04:18<00:00,  1.72s/it]
