<a href="https://colab.research.google.com/github/calmrocks/master-machine-learning-engineer/blob/main/GenAI/BasicLLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training a Small Language Model in Google Colab

## 1. Setup and Installation
First, we need to install the required libraries:

In [1]:
# Install the necessary libraries
!pip install transformers datasets torch accelerate

# transformers: Hugging Face's library for state-of-the-art NLP models
# datasets: Library for easily accessing and processing datasets
# torch: PyTorch deep learning framework
# accelerate: Library for distributed training

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [2]:
# Import required libraries and set up GPU
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np

# Check if GPU is available and set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# This is important because training LLMs requires significant computational resources
# Even a small model will train much faster on GPU than CPU

Using device: cpu


## 2. Model and Tokenizer Initialization
We'll use GPT-2 small, which has 124M parameters:

In [3]:
# Load the model and tokenizer
model_name = "gpt2"  # This is the smallest GPT-2 model
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name).to(device)

# Add padding token to tokenizer
tokenizer.pad_token = tokenizer.eos_token

# Explanation:
# - GPT2Tokenizer: Converts text to tokens that the model can understand
# - GPT2LMHeadModel: The actual language model with a language modeling head
# - .to(device): Moves the model to GPU if available
# - pad_token: Required for batching sequences of different lengths

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## 3. Data Preparation
Let's load and prepare a small dataset (Shakespeare text as an example):

In [4]:
# Download the Shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

from torch.utils.data import Dataset

class TextDataset(Dataset):
    def __init__(self, file_path, tokenizer, block_size=128):
        # Read the text file
        with open(file_path, 'r', encoding='utf-8') as f:
            text = f.read()

        # Tokenize the entire text
        self.encodings = tokenizer(
            text,
            truncation=True,
            max_length=block_size,
            return_overflowing_tokens=True,
            return_length=True
        )

    def __len__(self):
        return len(self.encodings['input_ids'])

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

# Create the dataset
dataset = TextDataset('input.txt', tokenizer)


--2025-01-06 06:55:22--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2025-01-06 06:55:23 (27.5 MB/s) - ‘input.txt’ saved [1115394/1115394]



## 4. Training Configuration
Set up the training parameters and initialize the trainer:

In [5]:
# Configure training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-shakespeare",          # Directory to save model checkpoints
    overwrite_output_dir=True,               # Overwrite existing checkpoint directory
    num_train_epochs=1,                      # Number of training epochs
    per_device_train_batch_size=4,           # Batch size per GPU/CPU
    save_steps=500,                          # Save checkpoint every X steps
    save_total_limit=2,                      # Maximum number of checkpoints to keep
    logging_steps=100,                       # Log metrics every X steps
    learning_rate=3e-5,                      # Learning rate
    warmup_steps=500,                        # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,                       # Weight decay for regularization
)

# Create data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # We're not using masked language modeling (like BERT)
)

# Initialize trainer
trainer = Trainer(
    model=model,                         # The instantiated model to be trained
    args=training_args,                  # Training arguments
    data_collator=data_collator,        # Data collator for creating batches
    train_dataset=dataset,              # Training dataset
)


## 5. Training Process
Now we can start the training:

In [None]:
# Start training
print("Starting training...")
trainer.train()
print("Training completed!")

# The trainer will:
# 1. Create batches from the dataset
# 2. Move data to GPU if available
# 3. Perform forward and backward passes
# 4. Update model parameters
# 5. Log metrics and save checkpoints



Starting training...


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

## 6. Inference
Test the trained model with text generation:

In [None]:
def generate_text(prompt, max_length=100):
    """
    Generate text from a prompt using the trained model.

    Args:
        prompt (str): The input text to start generation from
        max_length (int): Maximum length of generated text

    Returns:
        str: Generated text
    """
    # Encode the prompt to token IDs
    inputs = tokenizer.encode(prompt, return_tensors='pt').to(device)

    # Generate text
    outputs = model.generate(
        inputs,
        max_length=max_length,
        num_return_sequences=1,
        no_repeat_ngram_size=2,  # Prevent repetition of n-grams
        temperature=0.7,         # Control randomness (higher = more random)
        top_k=50,               # Sample from top K most likely tokens
        top_p=0.95,            # Nucleus sampling parameter
    )

    # Decode the generated tokens back to text
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test the model
prompt = "To be or not to be"
generated_text = generate_text(prompt)
print(f"Prompt: {prompt}")
print(f"Generated: {generated_text}")

## 7. Model Evaluation and Monitoring

Let's add some basic monitoring of the model's performance:

In [None]:
# Calculate perplexity on a test sequence
def calculate_perplexity(text):
    encodings = tokenizer(text, return_tensors='pt').to(device)
    with torch.no_grad():
        outputs = model(**encodings)
        loss = outputs.loss
    return torch.exp(loss).item()

# Test perplexity
test_text = "Friends, Romans, countrymen, lend me your ears;"
perplexity = calculate_perplexity(test_text)
print(f"Perplexity on test text: {perplexity}")

# Lower perplexity indicates better model performance

## Important Notes and Best Practices:

1. **Memory Management**:
   - Monitor GPU memory usage in Colab
   - Use smaller batch sizes if running out of memory
   - Consider gradient checkpointing for larger models

2. **Training Time**:
   - Even small models can take significant time to train
   - Start with small datasets for testing
   - Increase dataset size gradually

3. **Model Size**:
   - GPT-2 small is already 124M parameters
   - Larger models need more GPU memory and training time
   - Consider using quantization for larger models

4. **Hyperparameter Tuning**:
   - Learning rate is crucial for stable training
   - Adjust batch size based on available memory
   - Monitor loss to detect training issues