<a href="https://colab.research.google.com/github/parthasarathydNU/gen-ai-coursework/blob/main/advanced-llms/Bigram_Language_Model_and_Generative_Pretrained_Transformer_(GPT).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2: Bigram Language Model and Generative Pretrained Transformer (GPT)


The objective of this assignment is to train a simplified transformer model. The primary differences between the implementation:
* tokenizer (we use a character level encoder simplicity and compute constraints)
* size (we are using 1 consumer grade gpu hosted on colab and a small dataset. in practice, the models are much larger and are trained on much more data)
* efficiency


Most modern LLMs have multiple training stages, so we won't get a model that is capable of replying to you yet. However, this is the first step towards a model like ChatGPT and Llama.




In [3]:
%matplotlib inline
import torch
import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass
from torch import nn
import torch.nn.functional as F

## Part 1: Bigram MLP for TinyShakespeare (35 points)

1a) (1 point). Create a list `chars` that contains all unique characters in `text`

1b) (2 points). Implement `encode(s: str) -> list[int]`

1c) (2 points). Implement `decode(ids: list[int]) -> str`

1d) (5 points). Create two tensors, `inputs_one_hot` and `outputs_one_hot`. Use one hot encoding. Make sure to get every consecutive pair of characters. For example, for the word 'hello', we should create the following input-output pairs
```
he
el
ll
lo
```

1e) (10 points). Implement BigramOneHotMLP, a 2 layer MLP that predicts the next token. Specifically, implement the constructor, forward, and generate. The output dimension of the first layer should be 8. Use `torch.optim`. The activation function for the first layer should be `nn.LeakyReLU()`

Note: Use the `torch.nn.function.cross_entropy` loss. Read the [docs](https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html) about how this loss function works. The logits are the output of a network WITHOUT an activation function applied to the last layer. There are activation functions are applied to every layer except the last.

1f) (5 points). Train the BigramOneHotMLP for 1000 steps.

1g) (5 points). Create two tensors, `input_ids` and `outputs_one_hot`. These `input_ids` will be used for the embedding layer.

1h) (5 points). Implement and train BigramEmbeddingMLP, a 2 layer mlp that predicts the next token. Specifically, implement the constructor, forward, and generate functions. The output dimension of the first layer should be 8. Use `torch.optim`.



Note: the output will look like gibberish


In [1]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-07-01 23:38:05--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-07-01 23:38:06 (131 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
# For the bigram model, let's use the first 1000 characters for the data

with open('input.txt', 'r') as f:
    text = f.read()
text = text[:5000]

In [None]:
text

"First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you know Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us kill him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be done: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citizens, the patricians good.\nWhat authority surfeits on would relieve us: if they\nwould yield us but the superfluity, while it were\nwholesome, we might guess they relieved us humanely;\nbut they think we are too dear: the leanness that\nafflicts us, the object of our misery, is as an\ninventory to particularise their abundance; our\nsufferance is a gain to them Let us revenge this with\nour pikes, ere we become rakes: for the gods know I\nspeak this in hunger 

In [None]:
# Get unique characters from the text
# To have unique values, we can use the set data structure
# Reference https://stackoverflow.com/questions/13902805/list-of-all-unique-characters-in-a-string
chars = list(set(text))

In [None]:
len(chars)

53

> We have 46 unique characters

In [None]:

def encode(string: str) -> list[int]:
    """
    Given a string, encode returns a list of integers that represent the characters
    in the string.
    """
    encodedChars = []

    for s in string:
      encodedChars.append(chars.index(s))
    return encodedChars

def decode(ids: list[int]) -> str:
    """
    Given a list of integers, decode returns the characters in the list as a string.
    """
    decodedChars = [];
    for id in ids:
      decodedChars.append(chars[id])
    return "".join(decodedChars)

# Testing the encode and decode functions

encoded = encode('hello')
decoded = decode(encoded)
print(encoded)
print(decoded)

[14, 21, 3, 3, 29]
hello


In [None]:
# import torch.nn.functional as F

def create_one_hot_inputs_and_outputs(text: str) -> list[torch.tensor, torch.tensor]:
    """
    For a given word we need to generate all pairs of consecutive characters.
    For example, for the word 'hello', we should create the following input-output pairs
    he
    el
    ll
    lo

    Additionally, we need to conver the input and output to one hot
    encoded tensors.

    Here we have 46 unique characters, so we take the input of shape (1)
    And return an output of shape (1,46) for each character

    Say we have the word hello, we need to return the following tensors
    [one_hot_encoded_h][one_hot_encoded_e]
    [one_hot_encoded_e][one_hot_encoded_l]
    [one_hot_encoded_l][one_hot_encoded_l]
    [one_hot_encoded_l][one_hot_encoded_o]

    All the tensors should be of shape (1,46)
    And all the tensors on the left, go into inputs_one_hot
    And all the tensors on the right, go into outputs_one_hot
    """
    inputs_one_hot = []
    outputs_one_hot = []

    # we know which index does each character fall into
    # usign the encode method

    for i in range(len(text) - 1):
      input_char = text[i]
      output_char = text[i+1]

      # One-hot encoding the input character
      input_one_hot = torch.zeros(1, len(chars))
      input_one_hot[0][encode(input_char)] = 1
      inputs_one_hot.append(input_one_hot)

      # One-hot encoding the output character
      output_one_hot = torch.zeros(1, len(chars))
      output_one_hot[0][encode(output_char)] = 1
      outputs_one_hot.append(output_one_hot)

    return inputs_one_hot, outputs_one_hot

# Example usage
inputs_one_hot, outputs_one_hot = create_one_hot_inputs_and_outputs(text="hi")
print(inputs_one_hot)
print(outputs_one_hot)

[tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])]
[tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 1., 0., 0., 0.]])]


In [None]:
inputs_one_hot, outputs_one_hot = create_one_hot_inputs_and_outputs(text)

In [None]:
print(len(inputs_one_hot))
print(len(outputs_one_hot))
print(inputs_one_hot[0].shape)
print(outputs_one_hot[0].shape)

4999
4999
torch.Size([1, 53])
torch.Size([1, 53])


> We have one hot encoded both the input and output combinations. And each entry is of shape (1,46)

In [None]:
"""
  Implement BigramOneHotMLP, a 2 layer MLP that predicts the next token.
  Specifically, implement the constructor, forward, and generate.
  The output dimension of the first layer should be 8. Use torch.optim.
  The activation function for the first layer should be nn.LeakyReLU()
  Note: Use the torch.nn.function.cross_entropy loss.
  Read the docs about how this loss function works.
  The logits are the output of a network WITHOUT an activation
  function applied to the last layer. There are activation functions
  are applied to every layer except the last.
"""


class BigramOneHotMLP(nn.Module):
    def __init__(self):
        super().__init__() # Calls the init function on the nn.Module class
        # 2 Layer mlp that predicts the next token
        self.fc1 = nn.Linear(len(chars), 8) # takes inchars. values and outputs 8 values
        self.fc2 = nn.Linear(8, len(chars)) # takes in 8 values and outputs 46 values

    def forward(self, x):
        # x: [*, 5] batch size: *, 46 dimension (one hot)
        out = self.fc1(x) # [*, 8]
        out = F.leaky_relu(out) # [*, 8] Activation function of the first layer is leaky relu
        out = self.fc2(out) # [*, 46] Final output values, gives a one hot encoded value of the output character
        return out

bigram_one_hot_mlp = BigramOneHotMLP()

### Defining the loss function and optimizer

Reference : https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html#torch.nn.functional.cross_entropy

```python
torch.nn.functional.cross_entropy(
  input, target, weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean', label_smoothing=0.0
)
```

Parameters
- input (Tensor) – Predicted unnormalized logits;
- target (Tensor) – Ground truth class indices or class probabilities;
- reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. **bold text** Default: 'mean'

```python
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
```

### Training the model

In [None]:
# Set the model to training mode
bigram_one_hot_mlp.train()

# Define the loss function and optimizer
optimizer = torch.optim.SGD(bigram_one_hot_mlp.parameters(), lr=0.01)

# Compute the cross entropy loss between input logits and target.
loss_fn = nn.CrossEntropyLoss()

# We already have the inputs and outputs defined as one hot encoded
# We just need to convert them to float so that values flow smoothly through the model

inputs_one_hot = torch.vstack(inputs_one_hot).float()
outputs_one_hot = torch.vstack(outputs_one_hot).float()

In [None]:
inputs_one_hot.shape

torch.Size([4999, 53])

In [None]:
target_indices = torch.argmax(outputs_one_hot, dim=1)

# training loop
for epoch in range(5000):
  # Flush the gradients from prev iteration
  optimizer.zero_grad()

  # forward pass
  outputs = bigram_one_hot_mlp(inputs_one_hot)
  loss = loss_fn(outputs, target_indices)

  # backward pass and updatin the weights
  loss.backward()
  optimizer.step()

  if(epoch + 1) % 50 == 0:
    print(f'Epoch {epoch + 1}, Loss: {loss.item()}')

Epoch 50, Loss: 3.289602756500244
Epoch 100, Loss: 3.2819199562072754
Epoch 150, Loss: 3.2746875286102295
Epoch 200, Loss: 3.2678771018981934
Epoch 250, Loss: 3.261469602584839
Epoch 300, Loss: 3.2554359436035156
Epoch 350, Loss: 3.2497506141662598
Epoch 400, Loss: 3.2443928718566895
Epoch 450, Loss: 3.2393407821655273
Epoch 500, Loss: 3.234571933746338
Epoch 550, Loss: 3.230064868927002
Epoch 600, Loss: 3.225799798965454
Epoch 650, Loss: 3.221761465072632
Epoch 700, Loss: 3.2179362773895264
Epoch 750, Loss: 3.2143027782440186
Epoch 800, Loss: 3.2108397483825684
Epoch 850, Loss: 3.207523822784424
Epoch 900, Loss: 3.2043662071228027
Epoch 950, Loss: 3.2013511657714844
Epoch 1000, Loss: 3.1984596252441406
Epoch 1050, Loss: 3.1956849098205566
Epoch 1100, Loss: 3.1930253505706787
Epoch 1150, Loss: 3.190463066101074
Epoch 1200, Loss: 3.187995672225952
Epoch 1250, Loss: 3.1856091022491455
Epoch 1300, Loss: 3.1832964420318604
Epoch 1350, Loss: 3.1810507774353027
Epoch 1400, Loss: 3.1788628101

In [None]:
def generate(model, start='a', max_new_tokens=5) -> str:
    """
    Generate text given a starting point
    """

    model.eval() # set the model to eval mode
    with torch.no_grad(): # disables gradient calculation

      word = start;
      currentChar = start;
      for _ in range(max_new_tokens):

        # We one hot encode the current character
        input_one_hot = torch.zeros(1, len(chars))
        input_one_hot[0][encode(currentChar)] = 1


        # Pass this through the model to get the probablility distribution
        # of the next char in the same shape as the input
        output = model.forward(input_one_hot) # [1, 46] Eg: [0.06, 0.01, 0.5, .....]

        # Get the index that has the max value
        # torch.argmax(output): This function is used to find the index of the maximum value in the output tensor.
        # .item(): This method is used to get a Python number from a tensor containing a single value.
        next_char_id = torch.argmax(output).item()

        # Get the character from the set based on the id
        next_char = chars[next_char_id]

        # Update the word and the current character
        currentChar = next_char
        word += currentChar

      return word

In [None]:
print(generate(model=bigram_one_hot_mlp, start='r', max_new_tokens=100))

re e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e 


### Observations and thoughts

Looks like the one hot encoding input does not help the model that much in predicting the next token.

Here are some thoughts regarding one hot encoded vectors vs embeddings:

Embeddings provide a powerful alternative to one-hot encoded inputs, especially in models dealing with natural language processing, recommendation systems, and other domains where categorical variables play a crucial role. Here are the primary reasons why embeddings often result in better model performance compared to one-hot encoded inputs:

### 1. **Dimensionality Reduction**
- **One-hot encoding** produces very high-dimensional vectors where the length is equal to the number of categories in the vocabulary. For example, a vocabulary with 10,000 words results in 10,000-dimensional vectors, with only one non-zero element. This high dimensionality can lead to computational inefficiency and sparsity issues.
- **Embeddings**, on the other hand, map these large one-hot vectors into a much smaller dimensional space (e.g., 50, 100, or 300 dimensions). This dense representation reduces the model's memory footprint and accelerates computation.

### 2. **Semantic Information Preservation**
- **One-hot vectors** are orthogonal and equidistant from each other, implying no relationship between different categories or words. This means one-hot encoding does not capture any form of similarity or semantic relationship between the categories.
- **Embedding vectors** are learned during training and can capture complex relationships between categories. Words or items that are used in similar contexts will have embeddings that are closer in the vector space, which helps in capturing semantic meanings and relationships that are not possible with one-hot encoding.

### 3. **Better Gradient Flow**
- In neural networks, **one-hot inputs** often lead to inefficient gradient flow during backpropagation because only a small part of the weights (those corresponding to the 'hot' part of the vector) are updated at each training step. This inefficiency can slow down the learning process.
- **Embeddings** provide a more efficient gradient flow as every dimension of the embedding vector contributes to the learning process and receives updates during training. This generally leads to faster convergence in training neural networks.

### 4. **Generalization**
- **Embeddings** can help the model generalize better to new, unseen examples. Since embeddings capture semantic relationships, a well-trained model can perform reasonably well even when encountering new words or categories similar to those seen during training.
- With **one-hot encoding**, any new category not seen during training would require expanding the encoding scheme, which might not be feasible and does not leverage any learned contextual relationships.

### 5. **Network Depth and Complexity**
- Using **embeddings** allows deeper network architectures since the input data is more manageable in size and richer in information. Embeddings can be easily integrated into various layers of a neural network, facilitating more complex interactions in the model.
- With **one-hot encoded inputs**, adding network depth can be less effective due to the sparsity and high dimensionality, which might not add meaningful information through deeper layers.

### Example in NLP:
In natural language processing, word embeddings like Word2Vec, GloVe, or those learned by an embedding layer in a deep learning model allow words with similar meanings to have similar representations. This is not achievable with one-hot encoding where each word is isolated with no shared information.

In summary, embeddings provide a compact, dense, and semantically rich representation of categorical data, making them more suitable for complex models and large datasets where one-hot encoding would be inefficient both computationally and in terms of learning capability.

### Embedding Bigram MLP

In [None]:
def create_embedding_inputs_and_outputs() -> list[torch.tensor, torch.tensor]:
    """
    This function retruns input ids and outpits_one_hot_encoded
    Here we are trying to perform better than the one_hot_encoded Bigram MLP

    In the one hot encoded version we did pairs of
    [one_hot_encoded_h][one_hot_encoded_e]
    [one_hot_encoded_e][one_hot_encoded_l]

    Here we do it in a different way
    [id_h][one_hot_encoded_e]
    [id_e][one_hot_encoded_l]
    [id_l][one_hot_encoded_l]
    [id_l][one_hot_encoded_o]
    """
    input_ids = []
    outputs_one_hot = []

    # We have the chars list that has list of characters
    # We have the text, which is the dataset that we have to train on

    for i in range(len(text) - 1):
      input_char = text[i]
      output_char = text[i+1]

      # Id of the input character
      input_id = chars.index(input_char)
      input_ids.append(input_id)

      # One-hot encoding the output character
      output_one_hot = torch.zeros(1, len(chars))
      output_one_hot[0][encode(output_char)] = 1
      outputs_one_hot.append(output_one_hot)

    return input_ids, outputs_one_hot


input_ids, outputs_one_hot = create_embedding_inputs_and_outputs()

In [None]:
input_ids[0]

21

In [None]:
chars[21]

'F'

In [None]:
print(len(input_ids))
print(len(outputs_one_hot))
print(input_ids[0])
print(outputs_one_hot[0])

print(input_ids[1])
print(outputs_one_hot[1])

4999
4999
tensor(21)
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.])
tensor(49)
tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])


In [None]:
# Converting them to tensors
input_ids = torch.tensor(input_ids)
outputs_one_hot = torch.vstack(outputs_one_hot)

# printing the shape
print(input_ids.shape)
print(outputs_one_hot.shape)

torch.Size([4999])
torch.Size([4999, 53])


In [None]:
"""
We are training a model to learn how to predict the
next character given the previous character.
"""

class BigramEmbeddingMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.token_embedding = nn.Embedding(len(chars), 5)
        self.fc1 = nn.Linear(5, 8) # takes inchars. values and outputs 8 values
        self.fc2 = nn.Linear(8, len(chars)) # takes in 8 values and outputs 46 values

    def forward(self, x):
        embeddedX = self.token_embedding(x)
        out = self.fc1(embeddedX)
        out = F.leaky_relu(out)
        out = self.fc2(out)
        return out

    def generate(self, start='a', max_new_tokens=100) -> str:
        self.eval() # set the model to eval mode
        with torch.no_grad(): # disables gradient calculation
          word = start
          current_char = start
          # we want to generate max_new_tokens
          for _ in range(max_new_tokens):
            output = self.forward(torch.tensor([chars.index(current_char)]))
            next_char_id = torch.argmax(output).item()
            next_char = chars[next_char_id]
            word += next_char
            current_char = next_char
          return word



bigram_embedding_mlp = BigramEmbeddingMLP()

optimizer = torch.optim.SGD(bigram_embedding_mlp.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss()

bigram_embedding_mlp.train()

# training loop
for _ in range(5000):
    optimizer.zero_grad()
    outputs = bigram_embedding_mlp(input_ids)
    loss = loss_fn(outputs, outputs_one_hot)
    loss.backward()
    optimizer.step()

    if _ % 50 == 0:
        print(f'Epoch {_ + 1}, Loss: {loss.item()}')

Epoch 1, Loss: 4.001214027404785
Epoch 51, Loss: 3.9568183422088623
Epoch 101, Loss: 3.914968490600586
Epoch 151, Loss: 3.8750500679016113
Epoch 201, Loss: 3.8365707397460938
Epoch 251, Loss: 3.798905372619629
Epoch 301, Loss: 3.761927843093872
Epoch 351, Loss: 3.725783109664917
Epoch 401, Loss: 3.6902477741241455
Epoch 451, Loss: 3.6553354263305664
Epoch 501, Loss: 3.6211750507354736
Epoch 551, Loss: 3.5879199504852295
Epoch 601, Loss: 3.5557920932769775
Epoch 651, Loss: 3.5250096321105957
Epoch 701, Loss: 3.4957902431488037
Epoch 751, Loss: 3.4684042930603027
Epoch 801, Loss: 3.442491292953491
Epoch 851, Loss: 3.4182868003845215
Epoch 901, Loss: 3.3958003520965576
Epoch 951, Loss: 3.374884843826294
Epoch 1001, Loss: 3.355443000793457
Epoch 1051, Loss: 3.337252616882324
Epoch 1101, Loss: 3.3200581073760986
Epoch 1151, Loss: 3.3037796020507812
Epoch 1201, Loss: 3.28837251663208
Epoch 1251, Loss: 3.2737858295440674
Epoch 1301, Loss: 3.2598977088928223
Epoch 1351, Loss: 3.246689081192016

In [None]:
print(bigram_embedding_mlp.generate(start='a'))
print(bigram_embedding_mlp.generate(start='b'))
print(bigram_embedding_mlp.generate(start='c'))
print(bigram_embedding_mlp.generate(start='d'))
print(bigram_embedding_mlp.generate(start='e'))
print(bigram_embedding_mlp.generate(start='f'))
print(bigram_embedding_mlp.generate(start='u'))
print(bigram_embedding_mlp.generate(start='v'))

at t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t 
b t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t
ce t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t 
d t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t
e t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t
f t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t
u t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t
v t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t


### Observations

Look about the same as the one hot encoded bigram model. The loss has reduced over time.

**One hot encoded bigram**:
- Epoch 5000, Loss: 3.050889253616333

**Embedding bigram**:
- Epoch 2650, Loss: 3.0476651191711426

## Part 2: Generative Pretrained Transformer (65 points)

For this part, it is best to use a gpu. In the settings at the top go to Runtime -> Change Runtime Type and select T4 GPU

In [2]:
# run nvidia-smi to check gpu usage
!nvidia-smi

Mon Jul  1 23:38:17 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
# For the gpt model, let's use the full text

with open('input.txt', 'r') as f:
    text = f.read()

Implement a character level tokenization function.

1. Create a list of unique characters in the string. (1 points)
2. Implement a function `encode(s: str) -> list[int]` that takes a string and returns a list of ids (1 point)
3. Implement a function `decode(ids: list[int]) -> str` that takes a list of ids (ints) and returns a string (1 point)


In [4]:
chars = []

# List of unique characters in the string
for char in text:
    if char not in chars:
        chars.append(char)

def encode(s: str) -> list[int]:
    """
    Takes in a list of characters and returns a list of ids (ints)
    """
    return [chars.index(char) for char in s]

def decode(ids: list[int]) -> str:
    """
    Takes in a list of ids (ints) and returns a string
    """
    return ''.join([chars[id] for id in ids])

In [6]:
print(f"Length of characters : {len(chars)}")
print(f"First 10 characters : {chars[:10]}")
print(f"Encoded text : {encode('hello')}")
print(f"Decoded text : {decode(encode('hello'))}")

Length of characters : 65
First 10 characters : ['F', 'i', 'r', 's', 't', ' ', 'C', 'z', 'e', 'n']
Encoded text : [22, 8, 28, 28, 14]
Decoded text : hello


> cuda(device=None, non_blocking=False, memory_format=torch.preserve_format) -> Tensor
Returns a copy of this object in CUDA memory.
If this object is already in CUDA memory and on the correct device,
then no copy is performed and the original object is returned.

In [15]:
# Converting the input data to a tensor
encoded_text = encode(text)
print(f"First 10 chars '{text[0:10]}'")
print(f"encoded text first 10 chars {encoded_text[0:10]}")
print(f"Total character count {len(text)}")

First 10 chars 'First Citi'
encoded text first 10 chars [0, 1, 2, 3, 4, 5, 6, 1, 4, 1]
Total character count 1115394


In [12]:
data = torch.tensor(encode(text), dtype=torch.long).cuda()

In [14]:
data.shape

torch.Size([1115394])

In [16]:
# We can think of this as a window of characters that we use as the prefix to predict the next character
block_size = 16
data[:block_size+1] # first 17 entities in the tensor ( characters )

tensor([ 0,  1,  2,  3,  4,  5,  6,  1,  4,  1,  7,  8,  9, 10, 11, 12,  8],
       device='cuda:0')

To train a transformer, we feed the model `n` tokens (context) and try to predict the `n+1`th token (target) in the sequence.



In [18]:
# Here le'ts just pick the first block of size `block_size` and try to
# visuazlie how the transformer learns to predict the next character
# We design the system to only learn from tokens that are before the token that has to be predicted

x = data[:block_size]
y = data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target is: {target}")

when input is tensor([0], device='cuda:0') the target is: 1
when input is tensor([0, 1], device='cuda:0') the target is: 2
when input is tensor([0, 1, 2], device='cuda:0') the target is: 3
when input is tensor([0, 1, 2, 3], device='cuda:0') the target is: 4
when input is tensor([0, 1, 2, 3, 4], device='cuda:0') the target is: 5
when input is tensor([0, 1, 2, 3, 4, 5], device='cuda:0') the target is: 6
when input is tensor([0, 1, 2, 3, 4, 5, 6], device='cuda:0') the target is: 1
when input is tensor([0, 1, 2, 3, 4, 5, 6, 1], device='cuda:0') the target is: 4
when input is tensor([0, 1, 2, 3, 4, 5, 6, 1, 4], device='cuda:0') the target is: 1
when input is tensor([0, 1, 2, 3, 4, 5, 6, 1, 4, 1], device='cuda:0') the target is: 7
when input is tensor([0, 1, 2, 3, 4, 5, 6, 1, 4, 1, 7], device='cuda:0') the target is: 8
when input is tensor([0, 1, 2, 3, 4, 5, 6, 1, 4, 1, 7, 8], device='cuda:0') the target is: 9
when input is tensor([0, 1, 2, 3, 4, 5, 6, 1, 4, 1, 7, 8, 9], device='cuda:0') the

Revisiting some basics:

Terms:
- Block Size: The number of characters that the system has been trained to take into consideration while learning to predict the next character

In [21]:
batch_size = 64
device = 'cuda' if torch.cuda.is_available() else 'cpu'
def get_batch():
    """
    This function is responsible for creating a batch of batch_size
    For training a GPT model

    """
    # Here we generate a tensor `ix` containing `batch_size` random
    # indices within the range `0` to `len(data) - block_size`
    # we substract `block_size` from the end so that the last
    # selected block stays within the list of available characters
    # in the text
    ix = torch.randint(len(data) - block_size, (batch_size,))

    # Here we create a stack of tensors (batch size) each of length
    # block_size that start from the
    # above picked random indices
    x = torch.stack([data[i:i+block_size] for i in ix])

    # creates the target tensor y similarly, but shifted
    # one position to the right, representing the next
    # character to predict for each position in x.
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y


### Clarifying some more terms before we proceed to the next step:

### Block Size
- **Block Size** in a GPT model training code refers to the length of each segment of the input data that each training example consists of. This is directly analogous to what you might think of as "sequence length" in other contexts but is specifically termed "block size" in training scenarios for models like GPT.
- In the context of the Transformer attention head depicted in the image below, each "sequence" or input processed through the attention head would be of a fixed length equivalent to the block size. In the image, while not specified as "block size," the dimension that would correspond to this term is the middle dimension of the tensors, which is 32 in your example (`(8, 32, 64)`).

### Sequence Length
- **Sequence Length** in more general contexts refers to the total length of sequences being processed, which may vary unless specifically pre-processed to be uniform. In models like Transformers (and as seen in the attention head diagram), sequence length is typically fixed to a specific size for each training or inference pass. This fixed length is crucial for attention calculations across the entire sequence uniformly.
- In the Transformer attention head diagram, this "sequence length" is manifest in each stage of the attention mechanism. It represents the number of positions (or tokens) in each sequence that the model processes simultaneously, marked as the second dimension in the tensors.

### How It Relates to the Diagram
- In the attention head diagram present below, all tensors maintain a consistent second dimension (32 in this case), reflecting the fixed sequence length or block size used for the calculations. This consistent dimensionality across layers and operations ensures that each token in the sequence can be related to every other token via the attention mechanism, a key feature enabling the model to capture complex dependencies across the input.
- The operations like matrix multiplication (`matmul`) between the transposed keys and queries, and subsequent operations like softmax and dropout, all depend on this fixed sequence length to compute the attention scores and ultimately the output sequence. This fixed length, as used in your GPT model training, allows the Transformer to utilize positional relationships effectively.

### Summary
In summary, in the Transformer model, as depicted by the attention head image below, block size and sequence length can be considered equivalent, referring to the fixed size of the input sequences used for training and inference. This term varies in usage depending on the model architecture but is crucial for models like Transformers that depend on a fixed dimension to compute relationships between all pairs of inputs within a sequence effectively.

1. **Batch Size**: This is the number of samples processed before the model is updated. For exaple if we are are dealing with (8, 32, 64), the first dimension "8" typically represents the batch size. This means that the model processes 8 samples at a time.

2. **Sequence Length**: This is the length of the input sequences each sample in a batch contains. In the example tensor, "32" represents the sequence length / block size, indicating each sample consists of 32 sequential elements or **tokens**. ***For instance, in natural language processing, this could represent 32 words in a sentence. In this example since we will be predicting characters, each token represents a character from the list of unique characters available.***

3. **Feature Dimension**: This indicates the number of features each element of the sequence holds. The "64" in the example tensor suggests that each token or element of the sequence is represented by a vector of 64 features. These could be embeddings that encapsulate the token's meaning in a dense vector.

4. **Block Size**: This term isn't explicitly shown in the diagram but is related to how data is structured or processed in blocks during certain operations. For instance, in memory management or in GPU computation, operations might be optimized by processing data in "blocks." In the context of transformers or deep learning, block size might refer to the dimensionality of sub-parts of the model such as in splitting matrices for parallel processing, but typically it's not a term used to describe tensor dimensions directly.

Thus, in the below shown transformer model diagram:
- Batch size: 8 (number of samples processed together)
- Sequence length / block size: 32 (number of tokens, items, or steps per sample)
- Feature dimension: 64 (features per token or step in each sequence)

### Single Self Attention Head (5 points)
![](https://i.ibb.co/GWR1XG0/head.png)

### Explaining the above artention head set up

This diagram depicts the computational flow in a typical attention head of a Transformer neural network, commonly used in models like GPT and BERT. Here’s a breakdown of the operations and their significance:

1. **Input Tensor**: The input tensor has the shape (8, 32, 64), where 8 represents the batch size, 32 the sequence length / block size, and 64 the feature dimension of each token in the sequence.

2. **Linear Layers**: Three parallel linear transformations are applied to the input tensor. Each layer outputs a tensor of shape (8, 32, 16). These transformations typically generate the queries (Q), keys (K), and values (V) which are used in the attention mechanism.

3. **Transpose Operation**: The output of one of the linear layers (presumably representing keys, K) is transposed to change its shape from (8, 32, 16) to (8, 16, 32). This operation is necessary for matrix multiplication with the queries (Q). Here we notice that the transpose is only done across the second and third dimensions as the first dimension only represents the batch size

4. **Matrix Multiplication (matmul)**: The output of the transpose operation (K transposed) is matrix-multiplied with another linear output (Q). ***This results in a shape of (8, 32, 32), representing the raw attention scores before they are normalized.***

5. **Multiplication (mul)**: This operation might be an element-wise multiplication used as part of scaling the attention scores by the square root of the dimension of the keys to stabilize gradients during training, although the typical square root scaling is not explicitly shown here. More on why we scale the attention values below ...

6. **Masked Fill**: This operation is used to apply masks to the attention scores. Masks are often used to ignore (or mask) padding tokens or future tokens during training in sequence models. The operation doesn't change the shape of the tensor. The upper right half of the matrix is made as zeros in this step.

7. **Softmax**: The softmax function is applied across the last dimension (32) to normalize the attention scores to a probability distribution.

8. **Dropout**: Dropout is a regularization technique where random elements of the tensor are zeroed out during training to prevent overfitting. The shape remains unchanged.

9. **Matrix Multiplication (matmul)**: The normalized and possibly masked attention scores are then matrix-multiplied with the third linear output (V, values), resulting in an output tensor of shape (8, 32, 16). This operation computes the weighted sum of the values based on the attention scores.

10. **Output Tensor**: The final output tensor is generated with the shape (8, 32, 16), likely to be fed into subsequent layers of the Transformer or processed further depending on the specific architecture and task.

This detailed flow illustrates how attention mechanisms selectively focus on different parts of the input sequence, weighting input features by relevance, which is central to the success of Transformer models in handling various sequence-based tasks in natural language processing.

### The scaling of the attention scores

The scaling of the attention scores based on the dimension of the keys in the Transformer architecture addresses a specific challenge in training deep learning models that use softmax to calculate probabilities.

### Background on Dot Products and Their Scale
The attention mechanism computes the dot products between the query and all keys in the sequence. These dot products are a critical component because they determine the attention scores that indicate how much each part of the input should contribute to the output at each position. However, the magnitude of the dot products depends on the dimensionality of the keys and queries. Here's why:

- The dot product of two vectors increases with the number of dimensions. Specifically, if each component of the vectors is drawn from a distribution with a constant variance, the variance of the dot product is proportional to the dimensionality of the vectors.
- As the dimension of the keys (and queries, since they are usually of the same dimension) increases, the average value of the dot products becomes larger. This can lead to extremely large values, especially when working with high-dimensional data, which is common in models like Transformers.

### Impact on Softmax
The softmax function, which is used to convert these dot products into probabilities (or attention scores), is highly sensitive to changes in input values. Specifically:
- Large values in the softmax input can lead to a situation where the softmax function's output is close to zero for all inputs except the largest one (a phenomenon often referred to as the softmax function "saturating"). This saturation can significantly slow down learning, as it leads to very small gradients during backpropagation — essentially, the network is less able to learn from the input data.

### Why Scale by Square Root of Dimension?
The scaling factor used, \(\sqrt{d_k}\) (where \(d_k\) is the dimension of the keys), helps mitigate these effects:
- **Normalization**: By dividing the dot products by \(\sqrt{d_k}\), you effectively normalize them, bringing their variance back to a more manageable scale. This normalization helps maintain a more uniform scale across different model sizes and configurations.
- **Gradient Stability**: By keeping the dot products (and thus the inputs to the softmax) at a reasonable magnitude, the scaling prevents gradients under the softmax from becoming too small. This is crucial for efficient learning, as it ensures that each update step during training is informative enough to guide the model towards better performance without being too noisy or too minimal.

### Conclusion
Scaling by the square root of the dimension of the keys is a practical approach to ensuring that the attention mechanism operates effectively across different settings and model scales, facilitating stable and efficient training. This method is particularly vital in deep learning architectures like Transformers, where models often deal with high-dimensional data and require careful handling of numerical stability during training operations.

### Terms relevant to constructing the SelfAttention Head:

In the context of Transformer architectures, the **head size** in an attention head refers to the dimension of the vectors used for each of the queries (Q), keys (K), and values (V) within a single attention head. This is a key parameter that defines how much information each attention head can capture.

### Definition and Calculation

- **Head Size**: The head size is essentially the dimensionality of the Q, K, and V vectors within each specific attention head. It is typically derived by dividing the total dimension of the model's embeddings (\(d_{\text{model}}\)) by the number of attention heads (\(\text{num\_heads}\)). This allows the model to distribute the embedding information across multiple heads, each focusing on different features or relationships in the data.

### Formula
The head size for queries and keys (\(d_k\) and \(d_q\)) is often the same and can be calculated as:
\[ d_k = d_q = \frac{d_{\text{model}}}{\text{num\_heads}} \]
For values (\(d_v\)), it is usually the same as \(d_k\) and \(d_q\), though this can vary depending on specific model architectures or design choices:
\[ d_v = \frac{d_{\text{model}}}{\text{num\_heads}} \]

### Example
If a Transformer model uses an embedding dimension (\(d_{\text{model}}\)) of 512 and has 8 attention heads:
\[ d_k = d_q = d_v = \frac{512}{8} = 64 \]
Thus, each head processes vectors of size 64 for queries, keys, and values.

### Importance
The choice of head size affects how finely the model can focus on different aspects of the input data. Each head can potentially learn to attend to different parts of the sequence or different types of relationships:
- **Smaller head sizes** can lead to a more focused and granular attention mechanism, where each head might specialize more distinctly.
- **Larger head sizes** provide more capacity to each head, which can be useful for capturing more complex patterns or dependencies, but may reduce the diversity of what different heads can learn.

Adjusting the head size is a balance between computational efficiency, capacity, and the diversity of information that the attention heads can capture. It's an important aspect of model tuning, especially in tasks requiring nuanced understanding of context or relationships within the data.

### How does this translate to code

Following this example: If a Transformer model uses an embedding dimension (\(d_{\text{model}}\)) of 512 and has 8 attention heads:
\[ d_k = d_q = d_v = \frac{512}{8} = 64 \]
Thus, each head processes vectors of size 64 for queries, keys, and values.

Here each layer q, k and v, instead of processing all the tokens that are a part of the embedding, will only process tokens that are passed into this head.

Hence the q, k and v linear layers will be of shape batch_size x head_size


### The Mask

The code snippet provided below involves creating a triangular mask and then applying it to an attention matrix in a Transformer model, typically used in tasks like text processing or sequence modeling. Let’s break down the two lines to understand what’s happening:

### Line 1: Creating the Mask
```python
mask = torch.tril(torch.ones(timesteps, timesteps))
```
- **`torch.ones(timesteps, timesteps)`**: This function creates a 2D tensor (square matrix) filled with the value `1`, where the dimensions of the matrix are both `timesteps`. `timesteps` could be the length of a sequence being processed, such as the number of words in a sentence.
- **`torch.tril()`**: This function takes a tensor and returns a lower triangular part of the matrix. It zeroes out all elements above the main diagonal. The main diagonal and the elements below remain as they were, which in this case, are all `1`s due to the `torch.ones()` function. This triangular matrix is typically used in attention mechanisms to ensure that the attention calculation for a given timestep only considers that timestep and the ones before it (i.e., ensuring causality in models like GPT).

### Line 2: Applying the Mask to the Attention Matrix
```python
masked_attention = attention.masked_fill(mask == 0, float('-inf'))
```
- **`mask == 0`**: This operation compares each element of the `mask` tensor to `0`. Since `mask` is a lower triangular matrix with `1`s in the lower triangle and `0`s elsewhere, this operation generates a Boolean tensor where `True` corresponds to the positions where the mask had `0`s (i.e., the upper triangular part of the matrix) and `False` everywhere else (i.e., the lower triangular part).
- **`masked_fill()`**: This method is called on the `attention` tensor. It takes two arguments: a mask and a value to fill. The mask here is the Boolean tensor from `mask == 0`. Wherever the mask is `True`, the `attention` tensor is filled with `float('-inf')`. This effectively applies the mask by setting the attention scores in the upper triangle (those that should not be considered due to causality) to negative infinity.

### Why `float('-inf')`?
In attention mechanisms, especially when followed by a softmax operation, setting values to negative infinity before softmax ensures that those values have zero probability. When softmax is applied to a vector containing negative infinity, the exponential of negative infinity is zero, hence those positions do not contribute to the output of the softmax.

### Summary
- The `mask` tensor is used to enforce causality in the attention mechanism by preventing the model from attending to future timesteps in the sequence. This is essential in models like GPT where predictions for a given position should only depend on previous positions.
- The `mask == 0` operation identifies positions that should be ignored (in this context, future timesteps), and `masked_fill` applies this by setting such positions in the attention matrix to negative infinity, effectively removing them from consideration during attention normalization (softmax).

In [4]:
class SelfAttentionHead(nn.Module):
  """
  This class implements a single self attention head
  For the input dimensions we have batch size , sequence length , feature dimension

  First we need to implement the K, Q and V layers
  These are three linear layers - we can refer to
  https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear

  For a
  """
  def __init__(self, head_size):
    self.k = nn.Linear(batch_size, head_size, bias=False)
    self.q = nn.Linear(batch_size, head_size, bias=False)
    self.v = nn.Linear(batch_size, head_size, bias=False)

  def forward(self, x):
    """
    The forward step contains the following steps
    1. Pass x through the linear layers
    2. Perform transpose operation on k
    3. Calculate the bidirectional attention value q.k
    4. Scaling the attention value
    5. Masked fil
    6. Softmax
    7. Dropout
    8. Matrix multiplication with v

    Batch, Tokens, Chanels
    """
    B, T, C = x.shape
    k = self.k(x)
    q = self.q(x)
    v = self.v(x)

    k = k.transpose(1, 2) # Here we transpose dimensions 1 and 2
    # k is now of dimension B, C, T

    # Here we calculate the bidirectional attention value q.k
    attention = torch.matmul(q, k) # B, T, C * B, C, T = B, T, T

    # Scaling the attention value by the sq root of the chanels / features
    attention = attention * C**-0.5 # this will be of dimension B, T, T

    # Masked fill
    mask = torch.tril(torch.ones(T, T)) # We create a mask of dimensions T, T
    # We apply the mask to the attention matrix
    # float('-inf') is applied to all positions where the mask value is 0
    masked_attention = attention.masked_fill(mask == 0, float('-inf'))

    # Softmax
    # the dimension value has to be set to -1 to indicate the last dimension
    attention = F.softmax(masked_attention, dim=-1)

    # Dropout
    attention = F.dropout(attention, p=0.1)

    # Matrix multiplication with v
    output = torch.matmul(attention, v)

    return output

### Multihead Self Attention (5 points)

`constructor`

- Create 4 `SelfAttentionHead` instances. Consider using `nn.ModuleList`
- Create a linear layer with n_embd input dim and n_embd output dim

`forward`

In the forward implementation, pass `x` through each head, then concatenate all the outputs along the feature dimension, then pass the concatenated output through the linear layer

![](https://i.ibb.co/y5SwyZZ/multihead.png)

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        pass

    def forward(self, x):
        pass


## MLP (2 points)
Implement a 2 layer MLP


![](https://i.ibb.co/C0DtrF5/ff.png)

In [None]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        # implement
        pass

    def forward(self, x: torch.tensor) -> torch.tensor:
        # implement
        pass

## Transformer block (20 points)

Layer normalization help training stability by normalizing the outputs of neurons within a single layer across all features for each individual data point, not across a full batch or a specific feature.

Dropout is a form of regularization to prevent overfitting.

This is the diagram of a transformer block:

![](https://i.ibb.co/X85C473/block.png)

In [None]:
class Block(nn.Module):
    def __init__(self, n_embd: int, n_head: int):
        pass

    def forward(self, x):
        pass

## GPT

`constructor` (5 points)

1. create the token embedding table and the position embedding table
2. create variable `self.blocks` that is a series of 4 `Block`s. The data will pass through each block sequentially. Consider using `nn.Sequential`
3. create a layer norm layer
4. create a linear layer for predicting the next token

`forward(self, idx, targets=None)`. (5 points)

`forward` takes a batch of context ids as input of size (B, T) and returns the logits and the loss, if targets is not None. If targets is None, return the logits and None.
1. get the token by using the token embedding table created in the constructor
2. create the position embeddings
3. sum the token and position embeddings to get the model input
4. pass the model through the blocks, the layernorm layer, and the final linear layer
5. compute the loss

`generate(start_char, max_new_tokens, top_p, top_k, temperature) -> str` (5 points)
1. implement top p, top_k, and temperature for sampling



![](https://i.ibb.co/n8sbQ0V/Screenshot-2024-01-23-at-8-59-08-PM.png)

In [None]:
class GPT(nn.Module):
    def __init__(self, n_embd, n_head):
        pass

    def forward(self, idx, targets=None):
        pass

    def generate(self, start_char, max_new_tokens, top_p, top_k, temperature):
        pass

### Training loop (15 points)

implement training loop

In [None]:
model = GPT().to('cuda') # make you are running this on the GPU
max_iters = 5000

for iter in range(max_iters):
    pass

### Generate text


print some text that your model generates