# 4. Pretraining LLMs

In [4]:
import tiktoken
import torch
import matplotlib 
import numpy

### Using GPT to generate text

- Initialize the GPT model.
- We use dropout of 0.1 or above, but it is reletavely common to train LLM without dropout nowadays. 
- Modern LLMs don't use bias vectors in the `nn.Linear` layer for the QKV materices, hence we set `qkv_bias: False`.
- We reduce the context length (context_length) of only 256 tokens to reduce the computational resource requirements for training the model, whereas the original 124 million parameter GPT-2 model used 1024 tokens.
- Next, we use the generate_text_simple function from the previous chapter to generate text.
- In addition, we define two convenience functions, `text_to_token_ids` and `token_ids_to_text`, for converting between token and text representations that we use throughout this chapter

In [5]:
import torch 
from supplementary import GPTModel

GPT_CONFIG_124M = {
    "vocab_size": 50257,   # Vocabulary size
    "context_length": 256, # Shortened context length (orig: 1024)
    "emb_dim": 768,        # Embedding dimension
    "n_heads": 12,         # Number of attention heads
    "n_layers": 12,        # Number of layers
    "drop_rate": 0.1,      # Dropout rate
    "qkv_bias": False      # Query-key-value bias
}

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval();  # Disable dropout during inference

<img src="./metadata/09.png" alt=" " style="display: block; margin: 0 auto; width:800px; height:auto;" />


In [6]:
import tiktoken
from supplementary import generate_text_simple

def text_to_token(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<endoftext>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimesion
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0) # Remove batch dim
    return tokenizer.decode(flat.tolist())

In [7]:
start_context = "every effor moves you"
tokenizer = tiktoken.get_encoding('gpt2')

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token(tokenizer=tokenizer, text=start_context),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output text \n", token_ids_to_text(token_ids, tokenizer))

Output text 
 every effor moves you (% LGBT Telegram Superman communities Observatoryelse Constant berelevision


- ohhh..that's one way to generate text....
- As we can see above, the model does not produce good text because it has not been trained yet
- How do we measure or capture what "good text" is, in a numeric form, to track it during training?
- The next subsection introduces metrics to calculate a loss metric for the generated outputs that we can use to measure the training progress

### 2. Preparing the dataset loaders
 
- We use a relatively small dataset for training the LLM (in fact, only one short story)

In [10]:
with open("datasets/the-verdict.txt", "r", encoding="utf-8") as file:
    text_data = file.read()

# First 50 characters
print(text_data[:50])

I HAD always thought Jack Gisburn rather a cheap g


In [11]:
total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))

print("Characters:", total_characters)
print("Tokens:", total_tokens)

Characters: 20479
Tokens: 5145
