<div align="center">
  <img src="logo_branding.png" width="250" alt="kavi.ai Logo">
  <h1>Language Modeling Foundations</h1>
  <p><b>A Premium Training Module by kavi.ai</b></p>
</div>

---

### 💎 **Smarter Overview**
Understanding the mechanics of Autoregressive Language Modeling, from character-level tokenization to the recursive 'Next Token Prediction' training objective.

### 🚀 **Enterprise Use Case**
Building proprietary domain-specific tokenizers for non-English languages or specialized scientific notation where standard tokenizers fail.

### 📈 **Strategic Advantages**
- **Ground-Up Mastery**: Deep understanding of the Transformer bottleneck and attention mechanisms.
- **Custom Tokenization**: Optimizing input density for specialized data formats.
- **Raw Training Control**: Learn to manage the core optimization loop before moving to high-level APIs.

---

## Step 1: Install Dependencies

### **Purpose:**
To build our AI, we need three core pillars:
- `transformers`: Hugging Face's library for state-of-the-art Natural Language Processing.
- `datasets`: A high-performance library for loading large-scale NLP datasets.
- `torch`: The PyTorch deep learning framework that handles all the heavy tensor mathematics.

### **Line-by-Line Breakdown:**
- `!pip install transformers`: Installs the library containing the GPT-2 model logic.
- `!pip install datasets`: Installs the library to download and manage the training data.
- `!pip install torch`: Installs the engine that runs the complex math on your GPU.

In [None]:
# Install the Hugging Face Transformers library for model architecture
!pip install transformers
# Install the Datasets library to fetch training data from the hub
!pip install datasets
# Install PyTorch, the core deep learning framework
!pip install torch

## Step 2: Import Libraries

### **Purpose:**
Loading the specific tools needed for timing, randomness, and neural networking.

### **Line-by-Line Breakdown:**
- `import datetime`: Used to track how long training takes.
- `import random`: Used to pick random starting points for text generation.
- `import torch`: Imports the core tensor operations.
- `from torch.utils.data import ...`: Imports tools to batch and shuffle data.
- `from transformers import ...`: Imports the actual GPT-2 building blocks.

In [None]:
import datetime # To calculate and format training duration
import random   # To add variability in sampling and seeds
import torch    # The primary library for tensor computations
# Tools for organizing data into batches and handling shuffling
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
# Specific GPT-2 classes for Model, Tokenizer, and Configuration
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config

## Step 3: Initialize Tokenizer

<img src="tokenization_detail.png" width="60%" align="right" style="margin-left: 20px;">

### **Purpose:**
Building the dictionary that translates human words into ID numbers. 

### **Why 'pre-trained'?**
We use `from_pretrained('gpt2')` to ensure our data is tokenized the exact same way OpenAI's original model expects. If we used a different tokenizer, the model's pre-trained knowledge would be useless.

### **Line-by-Line Breakdown:**
- `GPT2Tokenizer.from_pretrained('gpt2')`: Loads the standard vocabulary used by GPT-2.
- `bos_token`: A special marker for the model to know where a new writing sample begins.
- `eos_token`: A marker telling the model to stop generating.
- `pad_token`: Ensures all sentences in a batch are the same length.

<div style="clear: both;"></div>

In [None]:
# Load the pre-trained GPT-2 tokenizer and define special control tokens
tokenizer = GPT2Tokenizer.from_pretrained('gpt2',
                                          bos_token='<|startoftext|>', # Beginning of sentence
                                          eos_token='<|endoftext|>',   # End of sentence
                                          pad_token='<|pad|>')         # Padding for batches

# Test the tokenizer with sample sentences to verify numeric conversion
tokenizer(["Lets tokenize this text",
           "also this"], return_attention_mask=True)

## Step 4: Load and Preprocess Dataset

<img src="dataset_prep.png" width="50%" align="right" style="margin-left: 20px;">

### **Purpose:**
Preparing the "Literature" dataset to be fed into the model for style adaptation.

### **Methodology:**
1.  **Download**: Fetching the `fineweb-literature-100k` corpus.
2.  **Filter**: Removing texts that are too long for our GPU memory to handle.
3.  **Tokenize**: Breaking words into IDs and adding our new `<|startoftext|>` and `<|endoftext|>` markers to every sample.

<div style="clear: both;"></div>

In [None]:
from datasets import load_dataset
# 1. Download the literature dataset and select exactly 1000 records for speed on CPU
dataset = load_dataset("BEE-spoke-data/fineweb-literature-100k")
dataset["train"] = dataset["train"].select(range(1000))

# 2. Set a reasonable sequence length (512 is plenty for literature snippets)
max_length = 512

# 3. Remove any text examples that are too short (optional) or just filter by length
dataset = dataset.filter(lambda example: len(example["text"].split()) < max_length)

# 4. Process the text: using batched=True is MUCH faster
def tokenize_function(examples):
    return tokenizer(
        ['<|startoftext|>' + text + '<|endoftext|>' for text in examples['text']],
        truncation=True,
        max_length=max_length,
        padding="max_length"
    )

dataset = dataset.map(tokenize_function, batched=True)


## Step 5: Training-Test Split

### **Purpose:**
Separating data for learning vs. verification.

### **Line-by-Line Breakdown:**
- `train_test_split(test_size=0.1)`: Takes 10% of the data and hides it from the model during training.

In [None]:
# Split the data into 90% for training and 10% for validation/testing
splitted_ds = dataset["train"].train_test_split(test_size=0.1)

## Step 6: Define Custom Dataset Wrapper

### **Purpose:**
Creating a bridge that feeds data into the GPU format.

### **Line-by-Line Breakdown:**
- `class GPT2Dataset(Dataset)`: Inherits from PyTorch's base data class.
- `__init__`: Saves our processed dataset inside the class.
- `__getitem__`: Returns a single training example as a numeric `tensor`.

In [None]:
class GPT2Dataset(Dataset):
  def __init__(self, dataset):
      self.dataset = dataset

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, idx):
    # Return tensors directly from the dataset columns
    return (torch.tensor(self.dataset[idx]["input_ids"]),
           torch.tensor(self.dataset[idx]["attention_mask"]))

## Step 7: Configure DataLoaders

### **Purpose:**
Managing the batching process to efficiently use the GPU.

### **Line-by-Line Breakdown:**
- `DataLoader(...)`: Creates a generator that handles batching.
- `RandomSampler`: Shuffles the data every epoch.
- `batch_size = 16`: Grabs 16 sentences at once for parallel processing.

In [None]:
# Pack our splits into the custom GPT2Dataset wrapper
train_dataset = GPT2Dataset(splitted_ds["train"])
val_dataset = GPT2Dataset(splitted_ds["test"])

# Set the number of examples to process simultaneously
batch_size = 16

# Create the training loader with random shuffling enabled
train_dataloader = DataLoader(
            train_dataset,
            sampler = RandomSampler(train_dataset),
            batch_size = batch_size
        )

# Create the validation loader with sequential (fixed) ordering
validation_dataloader = DataLoader(
            val_dataset,
            sampler = SequentialSampler(val_dataset),
            batch_size = batch_size
        )

## Step 8: Load Pre-trained Model

### **Purpose:**
Loading the "Pre-trained Brain" from OpenAI.

### **Why `.from_pretrained`?**
By using this command, we download millions of learned patterns (weights) that GPT-2 already possesses. This allows us to start training with a model that already knows English, rather than one that knows nothing.

### **Line-by-Line Breakdown:**
- `GPT2LMHeadModel.from_pretrained`: Loads the pre-trained neural network weights.
- `model.resize_token_embeddings`: Adds our custom padding token to the existing pre-trained vocabulary.
- `model.cuda()`: Moves the brain to the GPU for speed.

In [None]:
# 1. Load the GPT-2 architecture configuration
configuration = GPT2Config.from_pretrained('gpt2', output_hidden_states=False)

# 2. LOAD PRE-TRAINED WEIGHTS: This makes it a Fine-Tuning project, not from scratch
model = GPT2LMHeadModel.from_pretrained("gpt2", config=configuration)

# 3. Important: Expand the model's vocabulary to include our custom <|pad|> token
model.resize_token_embeddings(len(tokenizer))

# 4. Auto-detect device (GPU or CPU) and move the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 5. Set reproducibility seeds for consistent training results
seed_val = 42
random.seed(seed_val)
torch.manual_seed(seed_val)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed_val)

## Step 9: Optimizer & Fine-tuning Strategy

### **Purpose:**
Defining how carefully the model should update its knowledge. During fine-tuning, we usually use a **lower learning rate** so we don't accidentally "erase" the pre-trained knowledge the model already has.

In [None]:
# Set the speed of the weight updates (5e-4 = 0.0005)
learning_rate = 5e-4
# Set how many times the model sees the entire dataset
epochs = 1
# How often to print the current loss to the screen
output_loss_every_steps = 100

# Initialize the AdamW optimizer using the modern torch.optim implementation
optimizer = torch.optim.AdamW(model.parameters(),
                              lr = learning_rate)

## Step 10: The Fine-tuning Loop

<img src="training_loop.png" width="50%" align="right" style="margin-left: 20px;">

### **Purpose:**
Executing the style adaptation. The pre-trained brain reads the literature samples and slightly adjusts its neural pathways to match this new style.

<div style="clear: both;"></div>

In [None]:
from tqdm.auto import tqdm
import datetime

training_stats = []
model = model.to(device)

# --- TRAINING PHASE ---
for epoch_i in range(0, epochs):
    print(f'Starting Epoch {epoch_i + 1} / {epochs}')
    total_train_loss = 0
    model.train() # Enable gradient tracking and dropout

    for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader), desc="Training"):
        # 1. Transport data batches to the GPU
        input_ids = batch[0].to(device)
        labels = batch[0].to(device)
        masks = batch[1].to(device)
        
        # 2. Reset gradients so they don't accumulate
        model.zero_grad()
        # 3. Forward Pass: Model processes input and calculates loss against labels
        outputs = model(input_ids, labels=labels, attention_mask=masks)
        loss = outputs[0]
        
        # 4. Track daily loss total
        batch_loss = loss.item()
        total_train_loss += batch_loss

        # 5. Backward Pass: Compute gradients (errors) for every single weight
        loss.backward()
        # 6. Weight Update: Optimization algorithm nudges weights to improve
        optimizer.step()

        if step % output_loss_every_steps == 0 and not step == 0:
            print(f"Batch Loss: {batch_loss}")

    # Calculate the average error for this epoch
    avg_train_loss = total_train_loss / len(train_dataloader)

    # --- VALIDATION PHASE ---
    model.eval() # Disable dropout for accurate evaluation
    total_eval_loss = 0
    for batch in tqdm(validation_dataloader, total=len(validation_dataloader), desc="Validating"):
        input_ids = batch[0].to(device)
        labels = batch[0].to(device)
        masks = batch[1].to(device)
        with torch.no_grad(): # Disable gradient tracking to save memory
            outputs = model(input_ids, attention_mask=masks, labels=labels)
            loss = outputs[0]
        total_eval_loss += loss.item()

    avg_val_loss = total_eval_loss / len(validation_dataloader)
    print(f"Validation Loss: {avg_val_loss}")


## Step 11: Style-Aware Text Generation

### **Purpose:**
Testing if the model has successfully adopted the literary style. We give it a random starting point and let it generate text using its combined pre-trained knowledge and newly fine-tuned stylistic adjustments.

In [None]:
model.eval() # Switch to evaluation mode
for i in range(5):
    # 1. Ask the model to generate a sequence based on a random starting token
    sample_outputs = model.generate(
                            bos_token_id=random.randint(1,30000), # Random starting word
                            do_sample=True,      # Let the model be creative
                            top_k=50,            # Pick from the top 50 likely words
                            max_length = 200,    # Cap the story at 200 tokens
                            top_p=0.95,          # Filter out the extremely unlikely words
                            num_return_sequences=1
                        )
    # 2. Decode the numeric output IDs into human language strings
    for i, sample_output in enumerate(sample_outputs):
            print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
