# Lesson 5 — Fine-tune a Small Pretrained Model (Hugging Face)
**Goal:** Use the `transformers` library to fine-tune a compact GPT-style model on your custom corpus.

**How this connects to previous lessons**
- Lessons 1–4 taught you how to build each component from scratch. Now you’ll stand on the shoulders of a pretrained model and *nudge* it toward your style of text.
- Fine-tuning adjusts millions of weights slightly so the model prefers your data (e.g., Minecraft diaries or sci-fi mission logs).

**Vocabulary check**
- **Pretrained model:** A network already trained on a giant dataset. We just adapt it.
- **Checkpoint:** A saved bundle of model weights and configuration.
- **Dataset / Data collator:** Objects that feed tokenized text into the trainer with the right shapes.
- **Learning rate:** Controls how big each weight update is. Too high → instability, too low → very slow progress.
- **Overfitting:** When the model memorizes the training sentences instead of learning their style. Monitor validation loss or sample generations to catch this.

**Plan of attack**
1. Pick a tiny pretrained model (e.g., `sshleifer/tiny-gpt2`) so it trains quickly on CPU.
2. Load or build a tokenizer. You can reuse Hugging Face’s tokenizer or plug in the one you made earlier.
3. Create a dataset from your `.txt` files in `data/`, chunk it into blocks, and feed it into the Trainer API.
4. Configure training arguments (epochs, batch size, learning rate) and launch training.
5. After fine-tuning, generate text and compare it to the base model’s output—does it sound more like your stories?

> 💡 Tip: Keep runs short at first (e.g., 100–300 steps). Quick experiments teach you how hyperparameters affect results without waiting forever.

> ⚠️ This requires internet (to download the model) and a Python environment with `transformers`, `datasets`, and `accelerate`. A CPU will work for a tiny run; GPU is faster.
> If you *don’t* have internet, skip to the fallback at the bottom (use the tiny transformer from Lesson 4).

In [None]:
# If needed, add these dependencies with uv (run in a terminal or a notebook shell cell):
# !uv add transformers datasets accelerate


In [None]:

from pathlib import Path
data_dir = Path("data")
corpus_files = ["space.txt","animals.txt","minecraft.txt"]
text = ""
for f in corpus_files:
    text += (data_dir/f).read_text(encoding="utf-8") + "\n"

open("data/all_corpus.txt","w",encoding="utf-8").write(text)
print("Prepared all_corpus.txt with length:", len(text))


In [None]:

from datasets import Dataset
ds = Dataset.from_dict({"text":[open("data/all_corpus.txt","r",encoding="utf-8").read()]})
ds


In [None]:

from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments
model_name = "distilgpt2"  # tiny, good for demo
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_name)


In [None]:

# Tokenize dataset
def tokenize(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)
tok_ds = ds.map(tokenize, batched=True, remove_columns=["text"])

# Data collator
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

args = TrainingArguments(
    output_dir="../models/distilgpt2_finetune",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    learning_rate=2e-4,
    weight_decay=0.01,
    logging_steps=5,
    save_steps=50,
    save_total_limit=1,
    report_to=[],
)

trainer = Trainer(model=model, args=args, train_dataset=tok_ds, data_collator=collator)
trainer.train()
trainer.save_model("../models/distilgpt2_finetune")
tokenizer.save_pretrained("../models/distilgpt2_finetune")


In [None]:

from transformers import pipeline
gen = pipeline("text-generation", model="../models/distilgpt2_finetune", tokenizer=tokenizer)
print(gen("In the village library", max_length=50, num_return_sequences=1)[0]["generated_text"])


## Fallback (no internet)
If you don't have internet, reuse the **Lesson 4** tiny Transformer:
1. Train it on your base corpus and save checkpoints using `torch.save(model.state_dict(), ...)`.
2. When you have new text, reload the weights, continue training for a few more epochs, and sample again.
3. Compare outputs before and after the extra training. Even a scratch-built model can be fine-tuned!

> 📦 Bonus: Package your model by saving both the tokenizer (merge list) and weights so friends can try it on their machines.