
# Lesson 5 — Fine-tune a Small Pretrained Model (Hugging Face)
**Goal:** Use `transformers` to fine-tune a compact GPT on your custom corpus.

**What you'll learn**
- Tokenizer + model loading
- Build a dataset from local `.txt` files
- Fine-tune for a few hundred steps
- Generate text from your fine-tuned model

> ⚠️ This requires internet (to download the model) and a Python env with `transformers`, `datasets`, and `accelerate`. A CPU will work for a tiny run; GPU is faster.
> If you *don’t* have internet, skip to the fallback at the bottom (use the tiny transformer from Lesson 4).

In [None]:

# If needed:
# !pip install transformers datasets accelerate


In [None]:

from pathlib import Path
data_dir = Path("../data")
corpus_files = ["space.txt","animals.txt","minecraft.txt"]
text = ""
for f in corpus_files:
    text += (data_dir/f).read_text(encoding="utf-8") + "\n"

open("../data/all_corpus.txt","w",encoding="utf-8").write(text)
print("Prepared all_corpus.txt with length:", len(text))


In [None]:

from datasets import Dataset
ds = Dataset.from_dict({"text":[open("../data/all_corpus.txt","r",encoding="utf-8").read()]})
ds


In [None]:

from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments
model_name = "distilgpt2"  # tiny, good for demo
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_name)


In [None]:

# Tokenize dataset
def tokenize(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)
tok_ds = ds.map(tokenize, batched=True, remove_columns=["text"])

# Data collator
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

args = TrainingArguments(
    output_dir="../models/distilgpt2_finetune",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    learning_rate=2e-4,
    weight_decay=0.01,
    logging_steps=5,
    save_steps=50,
    save_total_limit=1,
    report_to=[],
)

trainer = Trainer(model=model, args=args, train_dataset=tok_ds, data_collator=collator)
trainer.train()
trainer.save_model("../models/distilgpt2_finetune")
tokenizer.save_pretrained("../models/distilgpt2_finetune")


In [None]:

from transformers import pipeline
gen = pipeline("text-generation", model="../models/distilgpt2_finetune", tokenizer=tokenizer)
print(gen("In the village library", max_length=50, num_return_sequences=1)[0]["generated_text"])



## Fallback (no internet)
If you don't have internet, use the **Lesson 4** tiny transformer. You can load its weights (if saved) and continue training on new text, or just re-run Lesson 4 with updated `data/*.txt`.
