# Fine-Tuning a Language Model

## What You'll Learn
In this notebook, we will demonstrate how to fine-tune a pre-trained language model on a specific dataset. By the end, you'll understand:
- How to prepare data for fine-tuning
- How to customize a pre-trained model for your specific use case
- How to evaluate and test your fine-tuned model

## Prerequisites
- Basic Python knowledge
- Understanding of what language models are
- Familiarity with Jupyter notebooks

## What is Fine-Tuning?
Fine-tuning is the process of taking a pre-trained model (like GPT-2) and training it further on your specific dataset. This helps the model learn to respond in the style and format you want.

We'll use the `distilgpt2` model from Hugging Face Transformers library and adapt it to answer campus-related questions.

In [None]:
# Install required packages

# Optional: Install TensorFlow Keras
# %pip install tf-keras

In [None]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

print("Setting up the model and tokenizer...")

# Check if GPU is available and set device
# GPU training is much faster than CPU, but CPU will work fine for this demo
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Load the tokenizer and model
# Tokenizer: Converts text to numbers that the model can understand
# Model: The actual neural network that generates text
print("Loading DistilGPT-2 model...")
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
model = GPT2LMHeadModel.from_pretrained('distilgpt2').to(device)

# GPT-2 doesn't have a padding token by default, so we need to add one
# Padding tokens help us process multiple texts of different lengths together
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

print(f"✓ Model loaded successfully!")
print(f"✓ Padding token set to: {tokenizer.pad_token}")
print(f"✓ Model is on: {device}")

## Dataset Preparation

We'll load a campus FAQ dataset containing questions and answers. The model will learn to answer questions in the same style as the training data.

In [None]:
import pandas as pd
import json

print("Loading campus FAQ dataset...")

# Load the dataset and extract questions and answers
data_path = '../data/campus_faq.json'

try:
    with open(data_path) as f:
        data = json.load(f)
    print("✓ Dataset loaded successfully!")
except FileNotFoundError:
    print("❌ Error: Could not find the dataset file. Please check the path.")
    raise

# Extract questions and answers from the nested structure
questions = []
answers = []

for item in data['faq']:
    questions.append(item['question'])
    answers.append(item['answer'])

# Create DataFrame for easier data manipulation
df = pd.DataFrame({
    'question': questions,
    'answer': answers
})

print(f"Dataset contains {len(df)} question-answer pairs")
print("\nFirst 3 examples:")
for i in range(min(3, len(df))):
    print(f"\nQ: {df.iloc[i]['question']}")
    print(f"A: {df.iloc[i]['answer']}")

print(f"\nDataset shape: {df.shape}")

In [None]:
print("Splitting data into training and validation sets...")
from sklearn.model_selection import train_test_split

# Split data: 80% for training, 20% for validation
train_questions, val_questions, train_answers, val_answers = train_test_split(
    df['question'].tolist(),
    df['answer'].tolist(),# Add this cell right after the training cell to debug the generation issue

    test_size=0.2,
    random_state=42  # For reproducible results
)

print(f"Training examples: {len(train_questions)}")
print(f"Validation examples: {len(val_questions)}")

## Fine-Tuning the Model

We'll create a custom dataset class and set up the training process using Hugging Face's Trainer API.

In [None]:
from transformers import Trainer, TrainingArguments

class QADataset(torch.utils.data.Dataset):
    """
    A simple dataset class for question-answer pairs.
    
    This class converts our Q&A data into a format the model can learn from.
    """
    
    def __init__(self, questions, answers, tokenizer, max_length=512):
        self.questions = questions
        self.answers = answers
        self.tokenizer = tokenizer
        self.max_length = max_length
        
        print(f"Created dataset with {len(questions)} examples")

    def __len__(self):
        """Return the number of examples in our dataset"""
        return len(self.questions)

    def __getitem__(self, idx):
        # Get the question and answer for the given index
        question = self.questions[idx]
        answer   = self.answers[idx]
        
        # Combine question and answer into a single string, following the prompt format used for training
        full_text = f"Question: {question} Answer: {answer}{self.tokenizer.eos_token}"
        
        # Tokenize the combined text, pad/truncate to max_length, and return as tensors
        enc = self.tokenizer(
            full_text,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        # Squeeze to remove the batch dimension
        input_ids      = enc['input_ids'].squeeze()
        attention_mask = enc['attention_mask'].squeeze()
        
        # Create labels for language modeling
        labels = input_ids.clone()
        
        # Mask out the question part so that loss is only computed on the answer
        question_prefix = self.tokenizer.encode(f"Question: {question} Answer:")
        labels[:len(question_prefix)] = -100  # -100 is ignored by PyTorch loss
        
        # Mask out padding tokens in the labels as well
        labels[labels == self.tokenizer.pad_token_id] = -100
        
        # Return a dictionary of tensors for the Trainer
        return {
            'input_ids':      input_ids,
            'attention_mask': attention_mask,
            'labels':         labels,
        }

# Create dataset objects
train_dataset = QADataset(train_questions, train_answers, tokenizer)
val_dataset = QADataset(val_questions, val_answers, tokenizer)

Created dataset with 24 examples
Created dataset with 6 examples


In [20]:
# Set up training configuration
print("Setting up training parameters...")

training_args = TrainingArguments(
    output_dir='./results',           # Where to save the model
    num_train_epochs=3,               # How many times to go through all data
    per_device_train_batch_size=2,    # How many examples to process at once
    per_device_eval_batch_size=2,     # Batch size for validation
    warmup_steps=100,                 # Gradual learning rate increase
    weight_decay=0.01,                # Regularization to prevent overfitting
    logging_dir='./logs',             # Where to save training logs
    logging_steps=5,                  # Log progress every 5 steps
    evaluation_strategy='epoch',      # Evaluate after each epoch
    save_strategy='epoch',            # Save model after each epoch
    load_best_model_at_end=True,      # Load the best model when done
    metric_for_best_model='eval_loss', # Use validation loss to pick best model
    report_to=None,                   # Don't report to wandb/tensorboard
)

print("Training configuration set up!")
print(f"Will train for {training_args.num_train_epochs} epochs")
print(f"Batch size: {training_args.per_device_train_batch_size}")

Setting up training parameters...
Training configuration set up!
Will train for 3 epochs
Batch size: 2




### Training the Model

Now we'll create a Trainer instance and start the training process. This might take a few minutes depending on your hardware.

In [None]:
print("Creating trainer...")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

print("Starting training... This might take a few minutes.")
print("You'll see training progress below:")

# Start training
training_result = trainer.train()

print("Training completed! 🎉")
print(f"Final training loss: {training_result.training_loss:.4f}")

Creating trainer...
Starting training... This might take a few minutes.
You'll see training progress below:


  trainer = Trainer(


  0%|          | 0/36 [00:00<?, ?it/s]

{'loss': 5.6355, 'grad_norm': 105.2732925415039, 'learning_rate': 2.5e-06, 'epoch': 0.42}
{'loss': 4.993, 'grad_norm': 88.20294189453125, 'learning_rate': 5e-06, 'epoch': 0.83}


  0%|          | 0/3 [00:00<?, ?it/s]

{'eval_loss': 4.415099143981934, 'eval_runtime': 0.0986, 'eval_samples_per_second': 60.868, 'eval_steps_per_second': 30.434, 'epoch': 1.0}
{'loss': 4.4254, 'grad_norm': 61.644622802734375, 'learning_rate': 7.5e-06, 'epoch': 1.25}
{'loss': 3.4686, 'grad_norm': 24.273456573486328, 'learning_rate': 1e-05, 'epoch': 1.67}


  0%|          | 0/3 [00:00<?, ?it/s]

{'eval_loss': 3.2102718353271484, 'eval_runtime': 0.1067, 'eval_samples_per_second': 56.258, 'eval_steps_per_second': 28.129, 'epoch': 2.0}
{'loss': 2.9814, 'grad_norm': 26.527381896972656, 'learning_rate': 1.25e-05, 'epoch': 2.08}
{'loss': 2.7553, 'grad_norm': 26.958595275878906, 'learning_rate': 1.5e-05, 'epoch': 2.5}
{'loss': 2.7174, 'grad_norm': 26.189979553222656, 'learning_rate': 1.75e-05, 'epoch': 2.92}


  0%|          | 0/3 [00:00<?, ?it/s]

{'eval_loss': 2.97796893119812, 'eval_runtime': 0.0673, 'eval_samples_per_second': 89.201, 'eval_steps_per_second': 44.6, 'epoch': 3.0}


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


{'train_runtime': 8.2583, 'train_samples_per_second': 8.719, 'train_steps_per_second': 4.359, 'train_loss': 3.83468324608273, 'epoch': 3.0}
Training completed! 🎉
Final training loss: 3.8347


## Evaluation

Now let's test our fine-tuned model! We'll create a function to generate answers and test it with various questions.

In [None]:
def generate_answer(question, max_new_tokens=100, temperature=0.7):
    """
    Generate an answer for a given question using our fine-tuned model.
    
    Args:
        question: The question to answer
        max_new_tokens: Maximum length of the answer
        temperature: Controls randomness (0.1 = focused, 1.0 = creative)
    """
    # Format the question the same way we did during training
    prompt = f"Question: {question} Answer:"
    
    # Convert text to model input
    inputs = tokenizer.encode_plus(
        prompt,
        return_tensors='pt',
        truncation=True,
        return_attention_mask=True
    )
    
    # Move to same device as model
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)
    
    # Generate answer
    model.eval()  # Set model to evaluation mode
    with torch.no_grad():
        outputs = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,  # Mask to avoid attending to padding tokens
            max_new_tokens=max_new_tokens,  # Maximum number of tokens to generate
            min_new_tokens=5,               # Don't allow EOS until at least 5 tokens
            do_sample=True,                 # If True, sample from the distribution (more creative); if False, use greedy decoding
            temperature=temperature,        # Controls randomness of sampling
            num_beams=3,                    # Number of beams for beam search (higher = more thorough search, but slower)
            pad_token_id=tokenizer.eos_token_id,  # Token ID used for padding (set to EOS for GPT-2)
        )
    # Extract only the generated part (remove the input prompt)
    generated_tokens = outputs[0][len(input_ids[0]):]
    answer = tokenizer.decode(generated_tokens, skip_special_tokens=True)
    
    # Clean up the answer
    answer = answer.strip()
    if answer.startswith("Answer:"):
        answer = answer[7:].strip()
    
    return answer

In [27]:
# Test with questions from our training data
print("Testing the fine-tuned model with training examples:")
print("=" * 60)

training_questions = [
    "What are the library hours?",
    "How can I access campus Wi-Fi?",
    "Where is the cafeteria located?"
]

for question in training_questions:
    print(f"❓ Question: {question}")
    answer = generate_answer(question)
    print(f"🤖 Answer: {answer}")
    print("-" * 40)

# Test with new questions (not in training data)
print("\nTesting with NEW questions (not in training data):")
print("=" * 60)

new_questions = [
    "What time does the gym open?",
    "How do I contact the IT help desk?",
    "Where can I study late at night?"
]

for question in new_questions:
    print(f"❓ Question: {question}")
    answer = generate_answer(question)
    print(f"🤖 Answer: {answer}")
    print("-" * 40)

Testing the fine-tuned model with training examples:
❓ Question: What are the library hours?
🤖 Answer: The library is open every Monday through Friday from 9 a.m. to 5 p.m.
----------------------------------------
❓ Question: How can I access campus Wi-Fi?
🤖 Answer: You can access campus Wi-Fi through the campus Wi-Fi portal.
----------------------------------------
❓ Question: Where is the cafeteria located?
🤖 Answer: The cafeteria is located in the cafeteria area of the cafeteria area of the cafeteria area of the cafeteria area
----------------------------------------

Testing with NEW questions (not in training data):
❓ Question: What time does the gym open?
🤖 Answer: 7:00 a.m. to 8:00 p.m. to 9:00 p
----------------------------------------
❓ Question: How do I contact the IT help desk?
🤖 Answer: You can contact the IT support desk at the IT support desk at the IT support desk at the IT
----------------------------------------
❓ Question: Where can I study late at night?
🤖 Answer: Y

In [None]:
print("🔍 Debugging model generation...")

# Test the base model generation first
def debug_generation(question):
    """Debug function to see what's happening during generation"""
    prompt = f"Question: {question} Answer:"
    print(f"Input prompt: '{prompt}'")
    
    # Tokenize
    inputs = tokenizer.encode_plus(
        prompt,
        return_tensors='pt',
        return_attention_mask=True
    )
    
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)
    
    print(f"Input token IDs shape: {input_ids.shape}")
    print(f"Input tokens: {input_ids[0].tolist()}")
    print(f"Decoded input: '{tokenizer.decode(input_ids[0])}'")
    
    # Generate with simpler parameters first
    model.eval()
    with torch.no_grad():
        outputs = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=20,
            min_new_tokens=5,           # don’t allow EOS until at least 5 tokens
            do_sample=False,            # you can mix with or without sampling
            num_beams=3,
            pad_token_id=tokenizer.eos_token_id,
        )

    
    print(f"Output shape: {outputs.shape}")
    print(f"Full output tokens: {outputs[0].tolist()}")
    print(f"Full decoded output: '{tokenizer.decode(outputs[0])}'")
    
    # Extract generated part
    generated_tokens = outputs[0][len(input_ids[0]):]
    print(f"Generated tokens only: {generated_tokens.tolist()}")
    answer = tokenizer.decode(generated_tokens, skip_special_tokens=True)
    print(f"Generated answer: '{answer}'")
    
    return answer

# Test with a simple question
test_answer = debug_generation("What are the library hours?")
print(f"\nFinal answer: '{test_answer}'")

🔍 Debugging model generation...
Input prompt: 'Question: What are the library hours? Answer:'
Input token IDs shape: torch.Size([1, 10])
Input tokens: [24361, 25, 1867, 389, 262, 5888, 2250, 30, 23998, 25]
Decoded input: 'Question: What are the library hours? Answer:'
Output shape: torch.Size([1, 30])
Full output tokens: [24361, 25, 1867, 389, 262, 5888, 2250, 30, 23998, 25, 383, 5888, 318, 1280, 790, 3321, 832, 3217, 422, 860, 257, 13, 76, 13, 284, 642, 279, 13, 76, 13]
Full decoded output: 'Question: What are the library hours? Answer: The library is open every Monday through Friday from 9 a.m. to 5 p.m.'
Generated tokens only: [383, 5888, 318, 1280, 790, 3321, 832, 3217, 422, 860, 257, 13, 76, 13, 284, 642, 279, 13, 76, 13]
Generated answer: ' The library is open every Monday through Friday from 9 a.m. to 5 p.m.'

Final answer: ' The library is open every Monday through Friday from 9 a.m. to 5 p.m.'


## Conclusion and Next Steps

Congratulations! 🎉 You've successfully fine-tuned a language model. Here's what you accomplished:

### What You Learned:
1. **Data Preparation**: How to format Q&A data for training
2. **Model Setup**: Loading and configuring a pre-trained model
3. **Fine-tuning**: Training the model on your specific dataset
4. **Evaluation**: Testing the model with both seen and unseen questions

### How to Improve Your Model:
1. **More Data**: Add more question-answer pairs to your dataset
2. **Better Prompts**: Experiment with different question formats
3. **Hyperparameter Tuning**: Adjust learning rate, batch size, epochs
4. **Longer Training**: Train for more epochs (but watch for overfitting)
5. **Temperature Tuning**: Adjust temperature in generation for different creativity levels

### Understanding the Results:
- **Good answers on training questions**: Shows the model learned the training data
- **Reasonable answers on new questions**: Shows the model can generalize
- **Repetitive or strange answers**: May need more training data or different hyperparameters

### Next Steps:
- Try fine-tuning on a different domain (customer service, technical docs, etc.)
- Experiment with larger models like GPT-3.5 or Llama
- Learn about parameter-efficient fine-tuning (LoRA, QLoRA)
- Deploy your model as a web API

### Common Issues and Solutions:
- **Model gives strange answers**: Try adjusting temperature or adding more training data
- **Repetitive responses**: Increase `repetition_penalty` parameter
- **Too slow**: Reduce batch size or use a smaller model
- **Out of memory**: Reduce `max_length` or batch size
- **Empty answers**: Check your data formatting and prompt structure

Happy fine-tuning! 🚀