# Fine-Tuning an LLM with Hugging Face Transformers

This notebook demonstrates how to fine-tune a pretrained language model (e.g., GPT-2) on a custom text dataset using the Hugging Face `transformers` Trainer API.

## 1. Install Dependencies

In [None]:
# !pip install --upgrade transformers datasets accelerate

## 2. Imports

In [None]:
import os
from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    Trainer, TrainingArguments,
    DataCollatorForLanguageModeling
)


## 3. Load and Inspect Dataset
Replace `'wikitext'` with your dataset or use local files.

In [None]:
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1')
dataset

## 4. Preprocess and Tokenize
Tokenize the text and group into blocks for language modeling.

In [None]:
model_checkpoint = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True)

tokenized = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=['text']
)

# Group texts into chunks of fixed length
block_size = 128
def group_texts(examples):
    # concatenate and chunk both input_ids & attention_mask
    concatenated = {
        k: sum(examples[k], []) for k in ["input_ids", "attention_mask"]
    }
    total_len = (len(concatenated["input_ids"]) // block_size) * block_size
    result = {"input_ids": [], "attention_mask": []}
    for i in range(0, total_len, block_size):
        result["input_ids"].append(concatenated["input_ids"][i : i + block_size])
        result["attention_mask"].append(concatenated["attention_mask"][i : i + block_size])
    return result

lm_datasets = tokenized.map(
    group_texts,
    batched=True,
    remove_columns=tokenized["train"].column_names,  # remove all old columns on that split
)


## 5. Initialize Model and Data Collator

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)


## 6. Training Arguments

In [None]:

training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    save_steps=500,
    save_total_limit=2,
    logging_steps=100,
    bf16=True,          # use bf16 on MPS instead of fp16
    push_to_hub=False,
)


## 7. Initialize Trainer and Train

In [15]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets['train'],
    tokenizer=tokenizer,
    data_collator=data_collator
)

trainer.train()


## 8. Save and Evaluate
```python
model.save_pretrained('fine-tuned-model')
tokenizer.save_pretrained('fine-tuned-model')
```

## 9. Generate Sample Text
```python
from transformers import pipeline
generator = pipeline('text-generation', model='fine-tuned-model', tokenizer='fine-tuned-model')
print(generator('Once upon a time', max_length=50))
```

## 10. Next Steps
- Use your own dataset via `load_dataset('text', data_files=...)`.
- Experiment with learning rates and batch sizes.
- Try different model checkpoints (e.g., `gpt2-medium`).