# Coding an LLM architecture

- Models like GPT, Gemma, Phi, Mistral, Llama etc. generate words sequentially and are based on the decoder part of the original transformer architecture. Hence refered to as decoder-only LLMs.
- Compared to conventional deep learning models, LLMs are larger, mainly due to their vast number of parameters, not the amount of code
- In this notebook, we consider embedding and model sizes akin to a small GPT-2 mode. The smallest GPT-2 model (124 million parameters).
- Models like Llama and others are very similar to this model, since they are all based on the same core concepts

<img src="./metadata/05.png" alt="tokenization example" style="display: block; margin: 0 auto; width:600px; height:auto;" />

**The differrence between the architectures**
**GPT vs Llama**

<img src="./metadata/06.png" alt="tokenization example" style="display: block; margin: 0 auto; width:800px; height:auto;" />


In [1]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.0,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

### 1. Coding the GPT model

- We are almost there: now let's plug in the transformer block into the architecture we coded at the very beginning of this notebook so that we obtain a useable GPT architecture
- Note that the transformer block is repeated multiple times; in the case of the smallest 124M GPT-2 model, we repeat it 12 times

<img src="./metadata/07.png" alt=" " style="display: block; margin: 0 auto; width:700px; height:auto;" />

In [10]:
import torch
import torch.nn as nn 
from supplementary import TransformerBlock, LayerNorm

class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg['vocab_size'], cfg['emb_dim'])
        self.pos_emb = nn.Embedding(cfg['context_length'], cfg['emb_dim'])
        self.drop_emb = nn.Dropout(cfg['drop_rate'])

        # Transformer blocks
        self.trf_blocks = nn.Sequential(*[TransformerBlock(cfg) for _ in range(cfg['n_layers'])])

        self.final_norm = LayerNorm(cfg['emb_dim'])
        self.out_head = nn.Linear(
            cfg['emb_dim'], cfg['vocab_size'], bias=False
        )
    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embds = self.tok_emb(in_idx)
        pos_embds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embds+pos_embds # Shape [batch_size, num_tokens, emb_size]

        # we pass x into dropout, transformer blocks and then final norm layer 
        # This passthrough is what we see in the image above of GPT
        
        x = self.drop_emb(x) # Dropout
        x = self.trf_blocks(x) # The transformer blocks
        x = self.final_norm(x) # The final norm layer
        logits = self.out_head(x) # Logits ( later softmax wil be applied on this )
        return logits

Using the configuration of the 124M parameter model, we can now instantiate this GPT model with random initial weights as follows:

In [11]:
import torch
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

batch = []

txt1 = "Every effort moves you"
txt2 = "everyday holds a"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))

# converting batch into tensor so that it can be passed into model as input
batch = torch.stack(batch, dim=0)
print(batch)

tensor([[ 6109,  3626,  6100,   345],
        [16833,   820,  6622,   257]])


**Initilize the GPT-2 model**

In [13]:
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)

#Output
out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)

Input batch:
 tensor([[ 6109,  3626,  6100,   345],
        [16833,   820,  6622,   257]])

Output shape: torch.Size([2, 4, 50257])
tensor([[[ 0.0642,  0.2044, -0.1695,  ...,  0.1789,  0.2192, -0.5815],
         [ 0.3774, -0.4255, -0.6587,  ..., -0.2505,  0.4655, -0.2576],
         [ 0.8900, -0.1377,  0.1475,  ...,  0.1777, -0.1202, -0.1890],
         [-0.9728,  0.0973, -0.2542,  ...,  1.1035,  0.3764, -0.5901]],

        [[-0.0755, -0.5548,  0.3469,  ...,  0.1516,  0.5015, -1.1592],
         [-0.0472, -0.0415,  0.5132,  ..., -0.1308,  0.1974,  0.1685],
         [ 0.8582,  0.8072,  0.0093,  ...,  0.8838,  0.1597, -0.1031],
         [-0.0436,  0.0515,  0.5518,  ...,  1.1580, -0.2668,  0.0348]]],
       grad_fn=<UnsafeViewBackward0>)


### 2. Generating text

- The following `generate_simple_text` function implements greedy decoding, which is a simple and fast method to generate text
- In greedy decoding, at each step, the model chooses the word (or token) with the highest probability as its next output (the highest logit - corresponds to the highest probability, so we technically wouldn't even have to compute the softmax function explicitly)
- The figure below depicts how the GPT model, given an input context, generates the next word token.

<img src="./metadata/08.png" alt=" " style="display: block; margin: 0 auto; width:900px; height:auto;" />

In [29]:
#  implements an iterative process, where it creates one token at a time

def generate_simple_text(model, idx, context_size,max_new_tokens=6):
    # idx is (batch, n_tokens) array of indices in the current context
    for _ in range(max_new_tokens):
        # Crop current context if it exceeds the supported context size, crop the last part
        # E.g., if LLM supports only 5 tokens, and the context size is 10.
        # then only the last 5 tokens are used as context.

        idx_cond = idx[:, -context_size:]

        # Get the predictions
        with torch.no_grad():
            logits = model(idx_cond)

        # Focus only on the last time step (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :] 

        # Apply softmax to get probabilities
        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)

        # Get the idx of the vocab entry with the highest probability value
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

    return idx

### 3. Generate sample text

In [30]:
model.eval();  # disable dropout

In [31]:
start_context = "Hello, I am"

encoded = tokenizer.encode(start_context)
print("encoded:", encoded)

encoded_tensor = torch.tensor(encoded).unsqueeze(0)
print("encoded_tensor.shape:", encoded_tensor.shape)

encoded: [15496, 11, 314, 716]
encoded_tensor.shape: torch.Size([1, 4])


In [32]:
out = generate_simple_text(
    model=model,
    idx=encoded_tensor, 
    max_new_tokens=6, 
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output:", out)
print("Output length:", len(out[0]))

Output: tensor([[15496,    11,   314,   716, 27018, 24086, 47843, 30961, 42348,  7267]])
Output length: 10


In [33]:
decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)

Hello, I am Featureiman Byeswickattribute argue
