# Tiny LLM Story Generator — Training Notebook

**Purpose:** This notebook trains a compact GPT-2 style language model to generate short children’s stories using the **TinyStories** dataset. It covers data loading, tokenization, model configuration, custom training, checkpointing, and sampling from saved checkpoints.

## What this notebook does
1. **Setup (Colab + Dependencies):** Mount Google Drive for persistent storage and import core libraries (`transformers`, `datasets`, `torch`, etc.).  
2. **Data:** Load `roneneldan/TinyStories` via Hugging Face Datasets and perform lightweight preprocessing/tokenization suitable for small-context language modeling.  
3. **Model:** Initialize a small GPT-2 configuration (tokenizer + `GPT2LMHeadModel`) tailored for fast prototyping on limited resources.  
4. **Training Loop:** Train with `AdamW`, gradient clipping, and mini-batches using `DataLoader`/`IterableDataset`; track loss and save periodic checkpoints.  
5. **Logging & Plots:** Record training history (e.g., loss) and visualize progression to validate convergence.  
6. **Checkpointing:** Persist tokenizer/model to Drive for later reuse and reproducibility.  
7. **Inference:** Load a chosen checkpoint and generate stories to qualitatively evaluate results.

## Why TinyStories?
TinyStories is a curated corpus of short, simple narratives designed for training and evaluating small language models. It enables rapid experiments while demonstrating end-to-end LM training and text generation.

## Requirements
- Python 3.x, PyTorch, Transformers, Datasets, TQDM, Matplotlib  
- Sufficient GPU (e.g., Colab T4/A100) recommended

## Reproducibility & Tips
- Fix random seeds for consistent runs.  
- Start with a small context length and batch size; scale up gradually.  
- Monitor loss curves; stop early if overfitting.  
- Keep checkpoints versioned (e.g., `tinygpt2_epochN`).

> **Reference Dataset:** `roneneldan/TinyStories` (Hugging Face Datasets).  
> **Author:** Ashish (Data Science Mentor) — YYYY-MM-DD.


### 1. Google Drive Mount

Mounts Google Drive in Colab to access and save files directly from your Drive.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

### 2. Library Installation and Data Loading

- Installs the **`datasets`** library.  
- Suppresses warning messages for cleaner output.  
- Imports essential libraries for data handling, tokenization, visualization, and model building.  
- Loads the **TinyStories** dataset in streaming mode for training.  


In [None]:
# !pip install datasets

import warnings
warnings.filterwarnings("ignore")

import re
import torch
import random
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
from datasets import load_dataset
from transformers import GPT2Tokenizer

dataset = load_dataset("roneneldan/TinyStories", split="train", streaming=True)

### 3. TinyStoriesStreamDataset Class

- Creates a **streaming PyTorch dataset** for TinyStories text.  
- Steps performed for each story:
  1. **Skip short samples:** Stories shorter than `min_length` are ignored.  
  2. **Clean text:**  
     - Removes extra spaces and unwanted characters.  
     - Replaces fancy quotes with standard quotes.  
  3. **Tokenize:** Converts text into token IDs using a GPT-2 tokenizer.  
  4. **Prepare training inputs:**  
     - `input_ids`: All tokens except the last one.  
     - `labels`: All tokens except the first one (for next-token prediction).  
     - `attention_mask`: Marks which tokens are real vs. padding.  



#### Example
    **Input text:**  
    `"  “The dog runs!” said Tom.  "`  

    **After cleaning:**  
    `"The dog runs!" said Tom.`  

    **Tokenization output (IDs):**  
    `[50256, 464, 3290, 1101, 0, 616, 640, 13]`  

    **Prepared for training:**  
    | input_ids                | labels                    |
    |--------------------------|---------------------------|
    | [50256, 464, 3290, 1101] | [464, 3290, 1101, 0]      |

    This way, the model learns to predict the **next token** at each position.  

In [None]:
from torch.utils.data import IterableDataset

class TinyStoriesStreamDataset(IterableDataset):
    def __init__(self, dataset_stream, tokenizer, block_size=512, min_length=30):
        self.dataset = dataset_stream
        self.tokenizer = tokenizer
        self.block_size = block_size
        self.min_length = min_length

    def __iter__(self):
        for sample in self.dataset:
            text = sample["text"].strip()
            if len(text) < self.min_length:
                continue

            text = re.sub(r'\s+', ' ', text)
            text = re.sub(r'[“”]', '"', text)
            text = re.sub(r"[‘’]", "'", text)
            text = re.sub(r'[^a-zA-Z0-9.,!?\'"\s]', '', text)

            tokenized = self.tokenizer(
                text,
                truncation=True,
                add_special_tokens=True,
                padding="max_length",
                max_length=self.block_size,
                return_tensors="pt"
            )

            input_ids = tokenized["input_ids"][0]
            attention_mask = tokenized["attention_mask"][0]

            yield {
                "input_ids": input_ids[:-1],
                "labels": input_ids[1:],
                "attention_mask": attention_mask[:-1]
            }

### 4. Load Tokenizer, DataLoader, Model, and Optimizer Setup

1. **Training size & batching**
   - Define total samples and `batch_size`; compute `max_batches_per_epoch` for progress tracking.

2. **Tokenizer**
   - Load GPT-2 tokenizer and set the **pad token** to EOS for consistent padding.

3. **Streaming dataset → DataLoader**
   - Wrap `TinyStoriesStreamDataset` with a `DataLoader` to yield mini-batches for training.

4. **Model configuration**
   - Build a **small GPT-2**:
     - `vocab_size = len(tokenizer)`
     - Context length: `n_positions = n_ctx = 512`
     - Model width: `n_embd = 256`
     - Depth/heads: `n_layer = 4`, `n_head = 4`
     - Use tokenizer’s `pad_token_id`

5. **Device placement**
   - Move model to **GPU** if available; enable **DataParallel** when multiple GPUs exist.

6. **Optimizer**
   - Initialize **AdamW** with learning rate `5e-5` for stable transformer training.

In [None]:
from transformers import GPT2Tokenizer
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import GPT2Config, GPT2LMHeadModel
from tqdm.auto import tqdm
import torch


total_samples = 2119719
batch_size = 52 # This will be overridden in the training loop
max_batches_per_epoch = total_samples // batch_size # This will be overridden in the training loop


tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

stream_dataset = TinyStoriesStreamDataset(dataset, tokenizer, block_size=128) # Pass reduced block_size
train_loader = DataLoader(stream_dataset, batch_size=batch_size) # batch_size here doesn't matter due to streaming dataset


config = GPT2Config(
    vocab_size=len(tokenizer),
    n_positions=128, # Reduced n_positions
    n_ctx=128, # Reduced n_ctx
    n_embd=32, # Drastically reduced embedding dimension
    n_layer=1, # Minimal number of layers
    n_head=1, # Minimal number of attention heads
    pad_token_id=tokenizer.pad_token_id)


model = GPT2LMHeadModel(config)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)


if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs")
    model = torch.nn.DataParallel(model)

optimizer = AdamW(model.parameters(), lr=5e-5)

### 5. Training Loop, Checkpointing, and Sampling

1. **Setup**
   - Define a checkpoint folder on Google Drive.
   - Set number of epochs and initialize a loss history list.
   - Switch model to training mode.

2. **Epoch training**
   - For each epoch:
     - Iterate over mini-batches up to `max_batches_per_epoch`.
     - Move tensors to the selected device (CPU/GPU).
     - Compute loss with labels for next-token prediction.
     - Zero gradients → backpropagate → clip gradients (max norm = 1.0) → optimizer step.
     - Accumulate batch losses.

3. **Track progress**
   - Compute and log **average loss** per epoch.
   - Append the epoch’s average loss to `history`.

4. **Checkpointing**
   - Create an epoch-specific folder (e.g., `tinygpt2_epochN`).
   - Save both the **model** and **tokenizer** to Drive after every epoch.

5. **Qualitative check (sampling)**
   - Temporarily switch to eval mode.
   - Generate a short continuation from the prompt *“Once upon a time”*.
   - Print the generated text to inspect model quality, then return to train mode.

6. **Persist training history**
   - Save the list of epoch losses to `training_history.json` on Drive for later plotting or review.


In [None]:
from pathlib import Path
import json
from tqdm.auto import tqdm
from torch.nn.utils import clip_grad_norm_

# Define checkpoint directory
checkpoint_dir = Path("/content/drive/MyDrive/TinyLLM/model/")

epochs = 10
history = []

model.train()

total_samples = 2119719
batch_size = 8 # Revert batch size to a reasonable value for training
gradient_accumulation_steps = 8 # Revert accumulation steps to a reasonable value for training
simulated_batch_size = batch_size * gradient_accumulation_steps
max_batches_per_epoch = total_samples // simulated_batch_size


for epoch in range(epochs):
    print(f"\nEpoch {epoch + 1}/{epochs}")
    epoch_loss = 0.0

    for i, batch in enumerate(tqdm(train_loader, total=max_batches_per_epoch)):
        if i >= max_batches_per_epoch:
            break

        input_ids = batch["input_ids"].to(device)
        labels = batch["labels"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        outputs = model(input_ids=input_ids, labels=labels, attention_mask=attention_mask)
        loss = outputs.loss / gradient_accumulation_steps  # Normalize loss

        loss.backward()

        if (i + 1) % gradient_accumulation_steps == 0:
            clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            optimizer.zero_grad()

        epoch_loss += loss.item() * gradient_accumulation_steps # Scale loss back for reporting

    # Perform a step after the loop if there are remaining gradients
    if (i + 1) % gradient_accumulation_steps != 0:
        clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        optimizer.zero_grad()


    avg_loss = epoch_loss / max_batches_per_epoch
    history.append(avg_loss)
    print(f"Average Loss: {avg_loss:.4f}")

    # Save model after every epoch
    epoch_checkpoint = checkpoint_dir / f"tinygpt2_epoch{epoch+1}"
    epoch_checkpoint.mkdir(parents=True, exist_ok=True)
    model.save_pretrained(epoch_checkpoint)
    tokenizer.save_pretrained(epoch_checkpoint)
    print(f"Model checkpoint saved at {epoch_checkpoint}")

    # Generate sample output
    model.eval()
    sample_input = tokenizer.encode("Once upon a time", return_tensors="pt").to(device)
    generated_ids = model.generate(
        sample_input,
        max_length=50, # This should ideally also be <= the new block_size
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )
    generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    print(f"Sample Output:\n{generated_text}")
    model.train()

history_path = Path("/content/drive/MyDrive/TinyLLM/training_history.json")
with open(history_path, "w") as f:
    json.dump(history, f)
print(f"\nTraining history saved to {history_path}")

### 6. Resume Training from Checkpoint

1. **Load checkpoint**
   - Restore the model and tokenizer from `tinygpt2_epoch6`.

2. **Configure training**
   - Recreate optimizer, device placement (GPU if available), and batching parameters.

3. **Continue epochs**
   - Train from epoch 7 onward (up to the target `epochs`), repeating the standard loop:
     - Forward pass → loss
     - Zero grads → backward pass
     - Gradient clipping (max norm = 1.0)
     - Optimizer step

4. **Checkpoint each epoch**
   - Save model and tokenizer to `tinygpt2_epoch{N}` after every epoch.

5. **Quick qualitative check**
   - Switch to eval, generate a short continuation from “Once upon a time”, print sample, then return to train mode.


In [None]:
from pathlib import Path
from tqdm.auto import tqdm
from torch.nn.utils import clip_grad_norm_
from transformers import GPT2Tokenizer
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import GPT2Config, GPT2LMHeadModel
from tqdm.auto import tqdm
import torch

# Load model and tokenizer from checkpoint (epoch 6)
checkpoint_path = Path("/content/drive/MyDrive/TinyLLM/model/tinygpt2_epoch6")
model = GPT2LMHeadModel.from_pretrained(checkpoint_path)
tokenizer = GPT2Tokenizer.from_pretrained(checkpoint_path)

total_samples = 2119719
batch_size = 52
max_batches_per_epoch = total_samples // batch_size

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs")
    model = torch.nn.DataParallel(model)

# Optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Training parameters
checkpoint_dir = Path("/content/drive/MyDrive/TinyLLM/model/")
epochs = 12  # Continue up to epoch 10
start_epoch = 6  # Start from epoch 6

model.train()

for epoch in range(start_epoch, epochs):
    print(f"\nEpoch {epoch + 1}/{epochs}")
    epoch_loss = 0.0

    for i, batch in enumerate(tqdm(train_loader, total=max_batches_per_epoch)):
        if i >= max_batches_per_epoch:
            break

        input_ids = batch["input_ids"].to(device)
        labels = batch["labels"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        outputs = model(input_ids=input_ids, labels=labels, attention_mask=attention_mask)
        loss = outputs.loss

        optimizer.zero_grad()
        loss.backward()
        clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

        epoch_loss += loss.item()

    avg_loss = epoch_loss / max_batches_per_epoch
    print(f"Average Loss: {avg_loss:.4f}")

    # Save model after each epoch
    epoch_checkpoint = checkpoint_dir / f"tinygpt2_epoch{epoch+1}"
    epoch_checkpoint.mkdir(parents=True, exist_ok=True)
    model.save_pretrained(epoch_checkpoint)
    tokenizer.save_pretrained(epoch_checkpoint)
    print(f"Model checkpoint saved at {epoch_checkpoint}")

    # Generate sample output
    model.eval()
    sample_input = tokenizer.encode("Once upon a time", return_tensors="pt").to(device)
    generated_ids = model.generate(
        sample_input,
        max_length=50,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )
    generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    print(f"Sample Output:\n{generated_text}")
    model.train()

### 7. Generate Text from a Saved GPT-2 Checkpoint

1. **Load model and tokenizer**
   - Load tokenizer and model from a custom-trained checkpoint (`epoch_5`).

2. **Define generation function**
   - Encodes input text with attention masks.
   - Uses `model.generate` to produce a continuation up to `max_len`.

3. **Run examples**
   - Generate short story snippets for several starting prompts (e.g., "Once there was little boy", "Once there was a cute little").

- **Related Work:** A Kaggle-hosted version of this project is available here: [TinyStoryLLM by Ashish Jangra](https://www.kaggle.com/models/ashishjangra27/tinystoryllm)

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

model_directory = "epoch_5"

tokenizer = GPT2Tokenizer.from_pretrained(model_directory)
model = GPT2LMHeadModel.from_pretrained(model_directory)


def generate(input_text, max_len):

  tokenizer.pad_token = tokenizer.eos_token

  inputs = tokenizer(
      input_text,
      return_tensors='pt',
      padding=True,
      return_attention_mask=True
  )

  output = model.generate(
      input_ids=inputs['input_ids'],
      attention_mask=inputs['attention_mask'],
      max_length=max_len
  )

  generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
  return generated_text

print(generate("Once there was little boy",30))
print(generate("Once there was little girl",30))
print(generate("Once there was a cute",30))
print(generate("Once there was a cute little",30))
print(generate("Once there was a handsome",30))

### 8. Inference with Pretrained TinyStories Model

1. **Load pretrained models**
   - `AutoModelForCausalLM`: Loads the `roneneldan/TinyStories-3M` causal language model.  
   - `AutoTokenizer`: Uses `EleutherAI/gpt-neo-125M` tokenizer for text processing.

2. **Prepare input**
   - Encode a simple prompt: `"Once upon a time there was"`.

3. **Generate text**
   - Use `model.generate` with `max_length=1000` to produce a story continuation.

4. **Decode output**
   - Convert token IDs back to readable text and print the generated story.


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

model = AutoModelForCausalLM.from_pretrained('roneneldan/TinyStories-3M')

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M")

prompt = "Once upon a time there was"


def generate(input_text, max_len):

  tokenizer.pad_token = tokenizer.eos_token

  inputs = tokenizer(
      input_text,
      return_tensors='pt',
      padding=True,
      return_attention_mask=True
  )

  output = model.generate(
      input_ids=inputs['input_ids'],
      attention_mask=inputs['attention_mask'],
      max_length=max_len
  )

  generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
  return generated_text

  return output_text

print(generate("Once there was little boy",30))
print(generate("Once there was little girl",30))
print(generate("Once there was a cute",30))
print(generate("Once there was a cute little",30))
print(generate("Once there was a handsome",30))

### Assignment: Code-Focused Inference

Your task is to load a pre-trained GPT-2 model and configure it to answer *only* questions related to Python coding.

1. **Load Model and Tokenizer:** Load a suitable pre-trained GPT-2 model and its corresponding tokenizer. You can use `transformers.AutoModelForCausalLM` and `transformers.AutoTokenizer`. A smaller model like `gpt2` or `gpt2-medium` might be sufficient.
2. **Implement a Filtering Mechanism:** Before generating a response, check if the input prompt is related to Python coding. You can use simple keyword matching (e.g., "Python", "code", "function", "class", "import") or a more sophisticated approach using a text classification model (optional).
3. **Generate Response:** If the prompt is deemed a Python coding question, generate a response using the loaded GPT-2 model.
4. **Handle Non-Coding Questions:** If the prompt is not related to Python coding, return a predefined message indicating that the model can only answer coding questions.
5. **Test:** Test your implementation with various prompts, including both Python coding questions and non-coding questions, to ensure the filtering mechanism works correctly.

# Assignment:

#### 1. Load Model and Tokenizer

We'll load the `gpt2` model and its tokenizer using the `transformers` library.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set pad token id for generation
tokenizer.pad_token = tokenizer.eos_token

print(f"Loaded model: {model_name}")

#### 2. Implement Filtering Mechanism

A simple keyword-based filtering mechanism is implemented to check if the input prompt is related to Python coding.

In [None]:
def is_python_coding_question(prompt):
    """Checks if the prompt is likely a Python coding question using keywords."""
    python_keywords = ["python", "code", "function", "class", "import", "def ", "lambda", "list", "dict", "tuple", "set", "string", "int", "float", "bool", "loop", "if", "else", "elif", "for", "while", "try", "except", "finally", "with", "open", "module", "package", "install", "pip", "environment", "variable", "syntax", "error", "debug"]
    prompt_lower = prompt.lower()
    for keyword in python_keywords:
        if keyword in prompt_lower:
            return True
    return False

#### 3. Generate Response and Handle Non-Coding Questions

This function combines the filtering and generation steps. It generates a response only if the prompt is related to Python coding.

In [None]:
def generate_coding_response(prompt, max_length=100):
    """
    Generates a response to a Python coding question using GPT-2.
    Returns a predefined message for non-coding questions.
    """
    if not is_python_coding_question(prompt):
        return "I can only answer questions related to Python coding."

    # Encode the input prompt
    inputs = tokenizer(
        prompt,
        return_tensors='pt',
        padding=True,
        truncation=True,
        max_length=max_length - 20, # Leave some space for generation
        return_attention_mask=True
    )

    # Generate a response
    output = model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=max_length,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

    # Decode and return the generated text
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

    # Post-process: Find the prompt in the generated text and return the continuation
    # This helps to remove the echoed prompt at the beginning of the generation
    prompt_index = generated_text.lower().find(prompt.lower())
    if prompt_index != -1:
        generated_text = generated_text[prompt_index + len(prompt):].strip()

    # Add a prefix to indicate it's a generated coding response
    return "Coding Response: " + generated_text

#### 4. Test the Implementation

Let's test with some example prompts.

In [None]:
# Test cases
coding_question_1 = "How to define a function in Python?"
coding_question_2 = "Explain list comprehensions in Python."
non_coding_question_1 = "What is the weather today?"
non_coding_question_2 = "Tell me a story about a dragon."
coding_question_3 = "import pandas as pd" # Test with just an import statement
coding_question_4 = "Write a Python code snippet for a for loop."


print(f"Prompt: {coding_question_1}")
print(generate_coding_response(coding_question_1))
print("-" * 30)

print(f"Prompt: {coding_question_2}")
print(generate_coding_response(coding_question_2))
print("-" * 30)

print(f"Prompt: {non_coding_question_1}")
print(generate_coding_response(non_coding_question_1))
print("-" * 30)

print(f"Prompt: {non_coding_question_2}")
print(generate_coding_response(non_coding_question_2))
print("-" * 30)

print(f"Prompt: {coding_question_3}")
print(generate_coding_response(coding_question_3))
print("-" * 30)

print(f"Prompt: {coding_question_4}")
print(generate_coding_response(coding_question_4))
print("-" * 30)

### Observations and Insights

*   The keyword-based filtering is simple but effective for basic cases. It can be easily fooled by prompts containing keywords but not related to coding. A more robust approach would involve a dedicated text classification model trained on coding vs. non-coding text.
*   The `gpt2` model, while relatively small, can generate plausible continuations for simple Python coding questions.
*   The `model.generate` parameters like `max_length` and `no_repeat_ngram_size` are important for controlling the output quality and preventing repetitive text.
*   Post-processing the generated text to remove the echoed prompt is necessary for cleaner output.
*   For more complex coding questions, a larger, fine-tuned model or a model specifically trained on code (like CodeGPT or similar) would be required to provide accurate and helpful responses.