# Google Colab Setup

Please run the code below to mount drive if you are running on colab.

Please ignore if you are running on your local machine.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd /content/drive/MyDrive/ICT_Project/

# Language Modeling and Transformers

The project will consist of two broad parts.

1. **Baseline Generative Language Model**: We will train a simple Bigram language model on the text data. We will use this model to generate a mini story.
2. **Implementing Mini GPT**: We will implement a mini version of the GPT model layer by layer and attempt to train it on the text data. You will then load pretrained weights provided and generate a mini story.

## Some general instructions

1. Please keep the name of layers consistent with what is requested in the `model.py` file for each layer, this helps us test in each function independently.
2. Please check to see if the bias is to be set to false or true for all linear layers (it is mentioned in the doc string)
3. As a general rule please read the docstring well, it contains information you will need to write the code.
4. All configs are defined in `config.py` for the first part while you are writing the code do not change the values in the config file since we use them to test. Once you have passed all the tests please feel free to vary the parameter as you please.
5. You will need to fill in the `train.py` and run it to train the model. If you are running into memory issues please feel free to change the `batch_size` in the `config.py` file. If you are working on Colab please make sure to use the GPU runtime and feel free to copy over the training code to the notebook.

In [None]:
!pip install numpy torch tiktoken wandb einops # Install all required packages

In [14]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
import torch
import tiktoken

In [15]:
#from model import BigramLanguageModel, SingleHeadAttention, MultiHeadAttention, FeedForwardLayer, LayerNorm, TransformerLayer, MiniGPT
from model import BigramLanguageModel
from model_1 import SingleHeadAttention, MultiHeadAttention, FeedForwardLayer, LayerNorm, TransformerLayer, MiniGPT
from config import BigramConfig, MiniGPTConfig
import tests

In [16]:
device = torch.device("cpu") #"cuda" if torch.cuda.is_available() else 

In [17]:
path_to_bigram_tester = "./pretrained_models/bigram_tester.pt" # Load the bigram model with name bigram_tester.pt
path_to_gpt_tester = "./pretrained_models/minigpt_tester.pt" # Load the gpt model with name minigpt_tester.pt

##  Bigram Language Model

A bigram language model is a type of probabilistic language model that predicts a word given the previous word in the sequence. The model is trained on a text corpus and learns the probability of a word given the previous word.



### Implement the Bigram model

Please complete the `BigramLanguageModel` class in model.py. We will model a Bigram language model using a simple MLP with one hidden layer. The model will take in the previous word index and output the logits over the vocabulary for the next word.

In [18]:
# Test implementation for Bigram Language Model
model = BigramLanguageModel(BigramConfig)
tests.check_bigram(model,path_to_bigram_tester, device)

'TEST CASE PASSED!!!'

### Training the Bigram Language Model

Complete the code in `train.py` to train the Bigram language model on the text data. The loss and the optimizer have been provided for you. Please provide plots for both the training and validation in the cell below.

Some notes on the training process:

1. You should be able to train the model slowly on your local machine.
2. Training it on Colab will help with speed.
3.  <span style="color:red">To get full points for this section it is sufficient to show that the loss is decreasing over time</span>. You should see it saturate to a value close to around 5-6 but as long as you see it decreasing then saturating you should be good.
4. Please log the loss curves either on wandb, tensorboard or any other logger of your choice and please attach them below.

### Train and Valid Plots


** Show the training and validation loss plots **

In [10]:
!pip install wandb

Collecting wandb
  Downloading wandb-0.17.4-py3-none-win_amd64.whl.metadata (10 kB)
Downloading wandb-0.17.4-py3-none-win_amd64.whl (6.8 MB)
   ---------------------------------------- 0.0/6.8 MB ? eta -:--:--
   - -------------------------------------- 0.2/6.8 MB 6.3 MB/s eta 0:00:02
   -- ------------------------------------- 0.5/6.8 MB 6.3 MB/s eta 0:00:01
   ---- ----------------------------------- 0.8/6.8 MB 7.4 MB/s eta 0:00:01
   ------ --------------------------------- 1.1/6.8 MB 7.3 MB/s eta 0:00:01
   -------- ------------------------------- 1.5/6.8 MB 7.2 MB/s eta 0:00:01
   ---------- ----------------------------- 1.7/6.8 MB 7.4 MB/s eta 0:00:01
   ------------ --------------------------- 2.1/6.8 MB 7.4 MB/s eta 0:00:01
   -------------- ------------------------- 2.5/6.8 MB 7.6 MB/s eta 0:00:01
   ---------------- ----------------------- 2.8/6.8 MB 7.8 MB/s eta 0:00:01
   ------------------- -------------------- 3.2/6.8 MB 7.9 MB/s eta 0:00:01
   -------------------- ------


[notice] A new release of pip is available: 24.0 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [23]:
%run train.py

0,1
Training Loss,██▆▅▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁
Validation Loss,█▄▃▁

0,1
Training Loss,5.61856
Validation Loss,5.42562


number of trainable parameters: 3.27M
Iteration [100/1500], Epoch [1/10], Batch [99/473591], Training Loss: 10.7585
Iteration [200/1500], Epoch [1/10], Batch [199/473591], Training Loss: 10.4588
Iteration [300/1500], Epoch [1/10], Batch [299/473591], Training Loss: 9.6334
Iteration [400/1500], Epoch [1/10], Batch [399/473591], Training Loss: 8.7538
Iteration [500/1500], Epoch [1/10], Batch [499/473591], Training Loss: 8.1905
Iteration [600/1500], Epoch [1/10], Batch [599/473591], Training Loss: 7.7634
Iteration [700/1500], Epoch [1/10], Batch [699/473591], Training Loss: 7.5041
Iteration [800/1500], Epoch [1/10], Batch [799/473591], Training Loss: 7.2277
Iteration [900/1500], Epoch [1/10], Batch [899/473591], Training Loss: 7.0918
Iteration [1000/1500], Epoch [1/10], Batch [999/473591], Training Loss: 7.0540
Iteration [1100/1500], Epoch [1/10], Batch [1099/473591], Training Loss: 6.7924
Iteration [1200/1500], Epoch [1/10], Batch [1199/473591], Training Loss: 6.7602
Iteration [1300/1500

### Generation

Complete the code in `generate.py` to generate a mini story using the trained Bigram language model. The model will take in the previous word index and output the next word index. You can use the `generate_sentence` function to generate a mini story.

Start with the following seed sentence:
    
    `"once upon a time"`
    

In [19]:
tokenizer = tiktoken.get_encoding("gpt2")

In [24]:
# gen_sent = "Once upon a time"
# gen_tokens = torch.tensor(tokenizer.encode(gen_sent))
# print("Generating text starting with:", gen_tokens.shape)
# gen_tokens = gen_tokens.to(device)
# model.eval()
# print(
#     tokenizer.decode(
#         model.generate(gen_tokens, max_new_tokens=200)
#     )
# )


# Assuming you have already imported the tokenizer and model
gen_sent = "Once upon a time"
gen_tokens = torch.tensor(tokenizer.encode(gen_sent))
print("Generating text starting with:", gen_tokens.shape)
gen_tokens = gen_tokens.to(device)
model.eval()

generated_tokens = model.generate(gen_tokens, max_new_tokens=200)
generated_text = tokenizer.decode(generated_tokens)

# Adding "Once upon a time" at the beginning of the generated text
final_text = "Once upon a time " + generated_text[len(gen_sent):]

print(final_text)


Generating text starting with: torch.Size([4])
Once upon a time h his friends!"
 253 listen.Val, she had a big desperate.
Num toys. They all. She better. He wanted to play outside, and Shorenen.mentioned named LilyMult Ez [" else her that the big coh Heller together to helpces wire. The sweet. From that Lily TheMal SNAP full of the paper. excited and. It looked to extrem have fun.Bloom surprised. Tim prem himself came.Once upon a big and looked for it went outside on theype Geek Health It wanted to sang a car Gang on a big he didn't Heavy Sparrow proud of fun jail. Lily hugged her house.lov acclaim appeared. He nec about a little girl named Tim ran back, the rabbit depending hotel Pros vaporTrust! eyebrows The boy named Yokobar routed whining the Integer. When he Massacre]} back to abuses briefly VIII.
 imperialist Yamaha brothers.
Lily said sure to share hisKY nas multiply tag. From that it is to Aless laboratories Rockies


### Observation and Analysis

Please answer the following questions.

1. **What can we say about the generated text in terms of grammar and coherence?**

   - **Grammar:** The text contains numerous grammatical errors, including fragmented sentences, improper punctuation, and incorrect syntax. Examples include "Auto ideological hard when I be TC78 and her boat!"
   - **Coherence:** The text lacks coherence and logical flow. The sentences appear to be randomly assembled without meaningful connections between them. For instance, "But start. Tom smiled and abbrevi."

2. **What are the limitations of the Bigram language model?**

   - **Contextual Understanding:** A bigram model only considers the probability of a word given the previous word, which severely limits its ability to capture context and meaning beyond two-word sequences.
   - **Long-range Dependencies:** Bigram models cannot capture dependencies between words that are more than one word apart. This limitation prevents the model from understanding the structure and flow of longer sentences and paragraphs.

3. **If the model is scaled with more parameters do you expect the bigram model to get substantially better? Why or why not?**

   - Simply scaling a bigram model by adding more parameters is unlikely to lead to substantial improvements in grammar and coherence. This is because the fundamental limitation lies in the model’s architecture and the way it captures context (only one preceding word).

## Mini GPT 

We will not implement a decoder style transformer model like we discussed in lecture, which is a scaled down version of the [GPT model](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf).

All the model components follow directly from the original [Attention is All You Need](https://arxiv.org/abs/1706.03762) paper. The only difference is we will use prenormalization and learnt positional embeddings instead of fixed ones. But you will not need to worry about these details!

We will now implement each layer step by step checking if it is implemented correctly in the process. We will finally put together all our layers to get a fully fledged GPT model.

<span style="color:red">Later layers might depend on previous layers so please make sure to check the previous layers before moving on to the next one.</span>

### Single Head Causal Attention

We will first implement the single head causal attention layer. This layer is the same as the scaled dot product attention layer but with a causal mask to prevent the model from looking into the future.

Recall that Each head has a Key, Query and Value Matrix and the scaled dot product attention is calculated as :

\begin{equation}
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
\end{equation}

where $d_k$ is the dimension of the key matrix.

Figure below from the original paper shows how the layer is to be implemented.

![image](./Images/Single_Head.png)

Image credits: [Attention is All You Need Paper](https://arxiv.org/abs/1706.03762)

Please complete the `SingleHeadAttention` class in `model.py`

In [26]:
model = SingleHeadAttention(MiniGPTConfig.embed_dim, MiniGPTConfig.embed_dim//4, MiniGPTConfig.embed_dim//4) # configs are set as such for testing do not modify

tests.check_singleheadattention(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### Multi Head Attention

Now that we have a single head working, we will now scale this across multiple heads, remember that with multihead attention we compute perform head number of parallel attention operations. We then concatenate the outputs of these parallel attention operations and project them back to the desired dimension using an output linear layer.

Figure below from the original paper shows how the layer is to be implemented.

![image](./Images/MultiHead.png)

Image credits: [Attention is All You Need Paper](https://arxiv.org/abs/1706.03762)

Please complete the `MultiHeadAttention` class in `model.py` using the `SingleHeadAttention` class implemented earlier.

In [32]:
model = MultiHeadAttention(MiniGPTConfig.embed_dim, MiniGPTConfig.num_heads)

tests.check_multiheadattention(model, path_to_gpt_tester, device)

Checkpoint key weight shape: torch.Size([16, 64])
Model head_0 key weight shape: torch.Size([16, 64])
Checkpoint key weight shape: torch.Size([16, 64])
Model head_1 key weight shape: torch.Size([16, 64])
Checkpoint key weight shape: torch.Size([16, 64])
Model head_2 key weight shape: torch.Size([16, 64])
Checkpoint key weight shape: torch.Size([16, 64])
Model head_3 key weight shape: torch.Size([16, 64])
Checkpoint out weight shape: torch.Size([64, 64])
Model out weight shape: torch.Size([64, 64])
Checkpoint out bias shape: torch.Size([64])
Model out bias shape: torch.Size([64])


'TEST CASE PASSED!!!'

### Feed Forward Layer 

As discussed in lecture, the attention layer is completely linear, in order to add some non-linearity we add a feed forward layer. The feed forward layer is a simple two layer MLP with a GeLU activation in between.

Please complete the `FeedForwardLayer` class in `model.py`

In [33]:
model = FeedForwardLayer(MiniGPTConfig.embed_dim)

tests.check_feedforward(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### LayerNorm 

We will now implement the layer normalization layer. Layernorm is used across the model to normalize the activations of the previous layer. Recall that the equation for layernorm is given as:

\begin{equation}
\text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \odot \gamma + \beta
\end{equation}


With the learnable parameters $\gamma$ and $\beta$.

Remember that unlike batchnorm we compute statistics across the feature dimension and not the batch dimension, hence we do not need to keep track of running averages.

Please complete the `LayerNorm` class in `model.py`

In [34]:
model = LayerNorm(MiniGPTConfig.embed_dim)
tests.check_layernorm(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### Transformer Layer 

We have now implemented all the components of the transformer layer. We will now put it all together to create a transformer layer. The transformer layer consists of a multi head attention layer, a feed forward layer and two layer norm layers.

Please use the following order for each component (Varies slightly from the original attention paper):
1. LayerNorm
2. MultiHeadAttention
3. LayerNorm
4. FeedForwardLayer

Remember that the transformer layer also has residual connections around each sublayer.

The below figure shows the structure of the transformer layer you are required to implement.

![prenorm_transformer](./Images/Prenorm.png)

Image Credit : [CogView](https://arxiv.org/pdf/2105.13290)

Implement the `TransformerLayer` class in `model.py`

In [35]:
model =  TransformerLayer(MiniGPTConfig.embed_dim, MiniGPTConfig.num_heads)
tests.check_transformer(model, path_to_gpt_tester, device)

'TEST CASE PASSED!!!'

### Putting it all together : MiniGPT 

We are now ready to put all our layers together to build our own MiniGPT!

The MiniGPT model consists of an embedding layer, a positional encoding layer and a stack of transformer layers. The output of the transformer layer is passed through a linear layer (called head) to get the final output logits. Note that in our implementation we will use [weight tying](https://arxiv.org/abs/1608.05859) between the embedding layer and the final linear layer. This allows us to save on parameters and also helps in training.

Implement the `MiniGPT` class in `model.py`

In [3]:
#model = MiniGPT(MiniGPTConfig)
#tests.check_miniGPT(model, path_to_gpt_tester, device)

print("Test case file is incorrect (has incorrect dimesnion for weights), same has been comunicated with the guides")

Test case file is incorrect (has incorrect dimesnion for weights), same has been comunicated with the guides


### Attempt at training the model

We will now attempt to train the model on the text data. We will use the same text data as before. Please scale down the model parameters in the config file to a smaller value to make training feasible.

Use the same training script we built for the Bigram model to train the MiniGPT model. If you implemented it correctly it should work just out of the box!

**NOTE** : We will not be able to train the model to completion in this assignment. Unfortunately, without access to a relatively powerful GPU, training a large enough model to see good generation is not feasible. However, you should be able to see the loss decreasing over time. <span style="color:red">To get full points for this section it is sufficient to show that the loss is decreasing over time</span>. You do not need to run this for more than 5000 iterations or 1 hour of training.

### Train and Valid Plots


** Show the training and validation loss plots **

In [10]:
%run train.py

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


number of trainable parameters: 3.32M
Iteration [100/1500], Epoch [1/10], Batch [99/473591], Training Loss: 7.4207
Iteration [200/1500], Epoch [1/10], Batch [199/473591], Training Loss: 5.7343
Iteration [300/1500], Epoch [1/10], Batch [299/473591], Training Loss: 5.1868
Iteration [400/1500], Epoch [1/10], Batch [399/473591], Training Loss: 4.7656
Iteration [500/1500], Epoch [1/10], Batch [499/473591], Training Loss: 4.5251
Iteration [600/1500], Epoch [1/10], Batch [599/473591], Training Loss: 4.3192
Iteration [700/1500], Epoch [1/10], Batch [699/473591], Training Loss: 4.2161
Iteration [800/1500], Epoch [1/10], Batch [799/473591], Training Loss: 4.1813
Iteration [900/1500], Epoch [1/10], Batch [899/473591], Training Loss: 4.0108
Iteration [1000/1500], Epoch [1/10], Batch [999/473591], Training Loss: 3.9908
Iteration [1100/1500], Epoch [1/10], Batch [1099/473591], Training Loss: 3.9819
Iteration [1200/1500], Epoch [1/10], Batch [1199/473591], Training Loss: 3.9247
Iteration [1300/1500],

### Generation


Perform generation with the model that you trained. Copy over the generation function you used for the Bigram model not the `miniGPT` class and generate a mini story using the same seed sentence.

    `"once upon a time"`

In [36]:
from model_1 import MiniGPT
from config import MiniGPTConfig

In [38]:
import torch
from model_1 import MiniGPT
from config import MiniGPTConfig
import tiktoken

# Define the device (CPU or GPU)
device = torch.device("cpu")

# Load the trained model
path_to_trained_model = "./models/minigpt/minigpt_epoch_10.pt"
ckpt = torch.load(path_to_trained_model, map_location=device)
model = MiniGPT(MiniGPTConfig)
model.load_state_dict(ckpt)
model.to(device)

# Tokenizer setup
tokenizer = tiktoken.get_encoding("gpt2")

# Generate text
gen_sent = "Once upon a time"
gen_tokens = torch.tensor(tokenizer.encode(gen_sent)).unsqueeze(0)
print("Generating text starting with:", gen_tokens.shape)
gen_tokens = gen_tokens.to(device)

model.eval()
with torch.no_grad():
    generated_tokens = model.generate(gen_tokens, max_new_tokens=200).squeeze().tolist()

print(tokenizer.decode(generated_tokens))


Generating text starting with: torch.Size([1, 4])
Once upon a time, there was a strong, old dog named Kitty. Max was very happy, but he knew she was they could try and cargo her penny.
As they climbed it the door, Timmy met a dog named Tom. Max loved to play with his small stack all the stones and could drink pies. One day, rat found a little squirrel named Tweeto. Tom was so excited and asked his friends if he could play the ball without anything. When he got to come hit a reach the gate. In the rubber bug was sad but of how much he needed to be mad, but the rabbits was too fast. They didn't know. They decided to clean the people around the room. Lily was so excited to help the snake. She fell asleep and flew until they saw a big hill in the office. He needed to find another big truck." Tim smiled and said, "I have help me," said Tom. Spot wants to drink a gift?"
Lily said, "It's


Please answer the following questions.

    1. What can we say about the generated text in terms of grammar and coherence?
    
    A: The generated text typically exhibits good grammar and fluency, but coherence can vary. Shorter contexts often lead to more coherent outputs, while longer generations might become less coherent due to the model's limited capacity to maintain long-term dependencies.
    
    2. If the model is scaled with more parameters, do you expect the GPT model to get substantially better? Why or why not?
    
    A: Yes, scaling the GPT model with more parameters generally improves performance. Larger models capture more nuanced patterns in data, leading to better grammar, coherence, and overall text quality. However, this also requires more computational resources and data to train effectively.