# DATASCI 315, Group Work 11: LLM Few-Shot Learning and Fine-Tuning

In this group work, we will learn about using open-source LLMs and adapting them to new tasks through few-shot learning and fine-tuning.

**Important:** Select GPU as the runtime for this assignment.

The GSI will help individual teams encountering difficulty, make announcements addressing common issues, and help ensure progress for all teams. During lab, feel free to flag down your GSI to ask questions at any point.

## Setup and Imports

Run the following cell to install the required packages.

In [None]:
# transformers, datasets, sentencepiece, accelerate, evaluate are provided by the environment

In [None]:
import os

os.environ["WANDB_DISABLED"] = "true"

In [None]:
import torch

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

In [None]:
import evaluate
import torch
from datasets import load_dataset
from transformers import (
    Seq2SeqTrainingArguments,
    T5ForConditionalGeneration,
    T5TokenizerFast,
    Trainer,
)

## Part 1: Few-Shot Learning

[Few-shot learning](https://huggingface.co/docs/transformers/tasks/prompting#few-shot-prompting) enables a pre-trained model to perform new tasks by providing a few examples directly in the prompt. Unlike fine-tuning, where model weights are updated through training on a dataset, few-shot learning relies on the model's ability to generalize from examples provided at inference time.

In this example, we use [FLAN-T5](https://huggingface.co/google/flan-t5-base), an instruction-tuned version of the [T5 (Text-to-Text Transfer Transformer)](https://huggingface.co/docs/transformers/model_doc/t5) model. FLAN-T5 was fine-tuned on a large mixture of tasks described via instructions, making it particularly good at following prompts and few-shot learning. We use the `base` size (250M parameters) which balances capability with inference speed.

In [None]:
# Load model and tokenizer
model_name = "google/flan-t5-base"
tokenizer = T5TokenizerFast.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)

We set the model to evaluation mode since we are *not* training (updating the weights).

In [None]:
model.eval()

Few-shot learning works by providing examples of the desired task in the prompt, then asking the model to perform the task on a new input. The following prompt demonstrates this for English-to-French translation:

In [None]:
few_shot_prompt = """
Translate English to French:

Example 1:
Input: I love apples.
Output: J'aime les pommes.

Example 2:
Input: How are you?
Output: Comment ça va?

Example 3:
Input: The weather is good today.
Output: Le temps est agréable aujourd'hui.

Now translate the following sentence:

Input: {new_sentence}

Output:
"""

Now we insert the sentence we want to translate into the prompt. The model uses the examples to understand the task and generate a translation.

In [None]:
new_sentence = "I would like to learn about transformers."
formatted_prompt = few_shot_prompt.format(new_sentence=new_sentence)

To generate inputs in the correct format for the model, we need to tokenize the prompt:

In [None]:
# Tokenize the prompt
input_ids = tokenizer.encode(
    formatted_prompt, return_tensors="pt", max_length=512, truncation=True
).to(device)

Now, let's generate the translation:

In [None]:
# Generate output using beam search
generated_ids = model.generate(
    input_ids,
    max_length=64,
    num_beams=5,
    early_stopping=True,
    eos_token_id=tokenizer.eos_token_id,
    repetition_penalty=2.5,
)

We decode the generated tokens to get the final translation. The `skip_special_tokens=True` argument removes special tokens from the output.

In [None]:
translation = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print("\nTranslation:", translation)

---

**Problem 1:** Design Your Own Few-Shot Learning Task

Design and implement your own few-shot learning task. You can use the same model or try a different model from [HuggingFace](https://huggingface.co/models).

**Requirements:**
1. Choose a task different from translation (be creative!)
2. Create a few-shot prompt with at least 3 examples
3. Demonstrate that the model successfully performs your task on new inputs

**Grading criteria:**
- **Creativity**: How interesting and original is your chosen task?
- **Success demonstration**: Does the model correctly perform the task on new examples?

**In your solution, include:**
- A description of the task you chose and why it's interesting
- Your few-shot prompt
- Test examples showing the model's outputs

Feel free to experiment with the generation parameters (`max_length`, `num_beams`, `repetition_penalty`, etc.).

In [None]:
# BEGIN SOLUTION
# Students design their own few-shot learning task
# Example: sentiment analysis, text summarization, etc.

# Define few-shot prompt with examples
few_shot_prompt = """
Example few-shot prompt here.
"""

# Generate and evaluate results
# END SOLUTION

In [None]:
# Test assertions
import re

# Verify that a few-shot prompt was created
assert "few_shot_prompt" in dir(), "few_shot_prompt should be defined"
assert isinstance(few_shot_prompt, str), "few_shot_prompt should be a string"
assert len(few_shot_prompt) > 100, "Prompt should be substantial (>100 chars for 3+ examples)"

# Check for example structure (at least 3 examples required)
# Count patterns that indicate examples
example_patterns = [
    r"example\s*\d",  # "Example 1", "example 2", etc.
    r"#\s*\d",  # "# 1", "# 2", etc.
    r"\d\s*[.):]\s*\w",  # "1. ", "1) ", "1: " followed by text
]
prompt_lower = few_shot_prompt.lower()
example_count = sum(len(re.findall(p, prompt_lower)) for p in example_patterns)

# Also count input/output pairs as evidence of examples
input_count = len(re.findall(r"input\s*:", prompt_lower))
output_count = len(re.findall(r"output\s*:", prompt_lower))
pair_count = min(input_count, output_count)

# Use the max of explicit example markers or input/output pairs
detected_examples = max(example_count, pair_count)
assert detected_examples >= 3, (
    f"few_shot_prompt should contain at least 3 examples "
    f"(detected {detected_examples}). Use numbered examples or Input:/Output: pairs."
)

# Verify it's not just the translation example from the walkthrough
assert (
    "J'aime les pommes" not in few_shot_prompt
), "Create your own task - do not reuse the translation example from the walkthrough"

print("All tests passed!")

# BEGIN HIDDEN TESTS
# Verify prompt has structure indicating few-shot learning
assert len(few_shot_prompt) > 200, "Prompt should be substantial for 3+ examples"
# END HIDDEN TESTS

## Part 2: Fine-Tuning for Grammar Correction

Unlike few-shot learning, [fine-tuning](https://huggingface.co/docs/transformers/training) updates the model's weights by training on a task-specific dataset. This allows even smaller models to achieve strong performance on specific tasks. In this problem, we fine-tune a T5 model to correct grammar mistakes in sentences.

The training data consists of sentence pairs:
- **Input**: A sentence with grammatical errors, prefixed with `grammar: `
- **Output**: The corrected sentence

### Data

Download the data files from Canvas:
- [`grammar_train.json`: training set](https://umich.instructure.com/files/40626314/download?download_frd=1)
- [`grammar_val.json`: validation set](https://umich.instructure.com/files/40626315/download?download_frd=1)
- [`grammar_test.json`: test set](https://umich.instructure.com/files/40626313/download?download_frd=1)

You may add your own examples to the training and validation sets to improve the model.

In [None]:
dataset = load_dataset(
    "json",
    data_files={"train": "./data/grammar_train.json", "validation": "./data/grammar_val.json"},
)

Examine the first training example:

In [None]:
dataset["train"][0]

Load the test dataset:

In [None]:
test_dataset = load_dataset("json", data_files={"test": "./data/grammar_test.json"})["test"]

The following function tokenizes the data for the model:

In [None]:
def preprocess_function(example):
    input_text = example["input"]  # already contains the "grammar: " prefix
    target_text = example["output"]
    model_inputs = tokenizer(input_text, max_length=128, truncation=True, padding="max_length")
    labels = tokenizer(target_text, max_length=128, truncation=True, padding="max_length")
    # Replace pad token IDs in labels with -100 so they are ignored in the loss
    model_inputs["labels"] = [
        token if token != tokenizer.pad_token_id else -100 for token in labels["input_ids"]
    ]
    return model_inputs

In [None]:
tokenized_train = dataset["train"].map(preprocess_function, batched=True)
tokenized_val = dataset["validation"].map(preprocess_function, batched=True)

### Evaluating Model Performance

The following function computes evaluation metrics:
- **exact_match**: Fraction of predictions that exactly match the reference sentence
- **bleu**: The [BLEU score](https://en.wikipedia.org/wiki/BLEU), a common metric for evaluating text generation that measures similarity between generated and reference text

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    # Decode predictions and labels
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in labels with pad_token_id, then decode
    labels = torch.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute BLEU score
    bleu_result = bleu_metric.compute(
        predictions=decoded_preds, references=[[ref] for ref in decoded_labels]
    )

    # Compute exact match score
    exact_matches = [
        int(pred.strip() == ref.strip())
        for pred, ref in zip(decoded_preds, decoded_labels, strict=True)
    ]
    exact_match_score = sum(exact_matches) / len(exact_matches)

    return {"bleu": bleu_result["bleu"], "exact_match": exact_match_score}

The following function generates predictions, similar to what we used for few-shot learning:

In [None]:
def generate_prediction(example):
    input_text = example["input"]

    input_ids = tokenizer.encode(
        input_text,
        return_tensors="pt",
        max_length=128,
        truncation=True,
        padding="max_length",
    ).to(device)

    generated_ids = model.generate(
        input_ids,
        max_length=128,
        num_beams=5,
        early_stopping=True,
        repetition_penalty=2.5,
    )

    output_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    return {"prediction": output_text}

### Load Model and Tokenizer

For fine-tuning, we use [`t5-small`](https://huggingface.co/google-t5/t5-small) (60M parameters) instead of the larger FLAN-T5 model from Part 1. Since we're updating the model weights to specialize on grammar correction, we don't need the instruction-following capabilities of FLAN-T5. The smaller model trains faster and uses less GPU memory, making it practical for experimentation.

In [None]:
model_name = "t5-small"
tokenizer = T5TokenizerFast.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)

First, evaluate the model's BLEU score on the test set *before* fine-tuning:

In [None]:
# Generate predictions on test set
predicted_dataset = test_dataset.map(generate_prediction)

# Extract predictions and references
predictions = predicted_dataset["prediction"]
references = test_dataset["output"]

# Load the BLEU metric
bleu_metric = evaluate.load("bleu")

In [None]:
bleu_result = bleu_metric.compute(predictions=predictions, references=[[ref] for ref in references])
print(f"Average BLEU score on the test set (before fine-tuning): {bleu_result['bleu']}")

A perfect BLEU score is 1.0. Our goal is to achieve a score of 0.9 or higher after fine-tuning.

---

**Problem 2: Fine-Tune the Model to Achieve BLEU >= 0.9**

Adjust the training parameters below to achieve a BLEU score of 0.9 or higher on the test set.

**Hint:** Consider adjusting `learning_rate`, `num_train_epochs`, `per_device_train_batch_size`, and `weight_decay`. You may also add more training examples to the dataset.

#### Train the Model

In [None]:
model_path = "./grammar_corrector"

The `Seq2SeqTrainingArguments` and `Trainer` classes control model training. Adjust these parameters to achieve the required performance:

In [None]:
# Define training arguments - adjust these to achieve BLEU >= 0.9
training_args = Seq2SeqTrainingArguments(
    output_dir=model_path,
    report_to=[],  # Disable logging to WandB
    learning_rate=5e-5,  # SOLUTION: 3e-5
    per_device_train_batch_size=8,  # SOLUTION: 32
    per_device_eval_batch_size=8,  # SOLUTION: 32
    weight_decay=0.0,  # SOLUTION: 0.01
    save_total_limit=5,
    num_train_epochs=1,  # SOLUTION: 10
    predict_with_generate=True,
    fp16=True,
)

In [None]:
# Test assertions
# Verify training_args is configured
assert "training_args" in dir(), "training_args should be defined"
assert training_args.num_train_epochs >= 1, "Should train for at least 1 epoch"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Verify configuration is reasonable
# END HIDDEN TESTS

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
# Train the model
trainer.train()

In [None]:
# Save the fine-tuned model
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

#### Evaluate Performance After Fine-Tuning

Test the fine-tuned model on some example sentences:

In [None]:
def correct_sentence(sentence):
    # Add task prefix
    input_text = "grammar: " + sentence

    input_ids = tokenizer.encode(
        input_text,
        return_tensors="pt",
        max_length=128,
        truncation=True,
        padding="max_length",
    ).to(device)

    generated_ids = model.generate(
        input_ids,
        max_length=128,
        num_beams=5,
        early_stopping=True,
        repetition_penalty=2.5,
    )

    return tokenizer.decode(generated_ids[0], skip_special_tokens=True)

Optional: Load a previously saved fine-tuned model:

In [None]:
# Load your fine-tuned model and tokenizer
tokenizer = T5TokenizerFast.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)
model.to(device);

In [None]:
test_sentences = [
    "I eated the purple berries.",
    "He go to school every day.",
    "Thank you for picking me as your designer. I'd appreciate it.",
    "The dog were meowing as a cat",
    "She don't like to play football.",
    "He is more smarter than his brother.",
    "I seen the movie yesterday.",
    "They was going to the store.",
    "I am not speak  English well",
    "The mentioned changes have done.",
    "I'd be more than happy to work with you in another project.",
]

In [None]:
for sentence in test_sentences:
    corrected = correct_sentence(sentence)
    print(f"Original: {sentence}")
    print(f"Corrected: {corrected}\n")

Evaluate on the test set. The goal is a BLEU score of 0.9 or higher:

In [None]:
predicted_dataset = test_dataset.map(generate_prediction)
predictions = predicted_dataset["prediction"]
references = test_dataset["output"]
bleu_metric = evaluate.load("bleu")

In [None]:
bleu_result = bleu_metric.compute(predictions=predictions, references=[[ref] for ref in references])
print(f"Average BLEU score on the test set: {bleu_result['bleu']}")

Examine some predictions from the test set:

In [None]:
sample_indices = torch.randperm(len(predicted_dataset))[:5].tolist()
for idx in sample_indices:
    print(f"Input: {predicted_dataset[idx]['input']}")
    print(f"Prediction: {predicted_dataset[idx]['prediction']}")
    if "output" in predicted_dataset[idx]:
        print(f"Reference: {predicted_dataset[idx]['output']}")
    print("-" * 50)

---

**Problem 3: Design Your Own Fine-Tuning Task**

Design and implement your own fine-tuning task. Use the code from Problem 2 as a starting point.

**Requirements:**
1. Choose a task different from grammar correction and your few-shot task
2. Create training, validation, and test datasets in the same format as Problem 2
3. Demonstrate performance improvement after fine-tuning using an appropriate metric

**Grading criteria:**
- **Creativity**: How interesting is your chosen task? Bonus points for creative tasks!
- **Demonstrated improvement**: Does the model perform better after fine-tuning?

**Data generation options:**
- Generate examples by hand
- Generate programmatically (include your generation code)
- Use an LLM to help generate examples

**In your solution, include:**
- A description of your task
- How you generated the data (include code if applicable)
- Model performance before and after fine-tuning
- Analysis of results

In [None]:
# BEGIN SOLUTION
# Students design their own fine-tuning task
# Example implementation here
# END SOLUTION

In [None]:
# Test assertions
# Verify that fine-tuning task was implemented
# Note: This problem is open-ended, so we check for reasonable implementation
assert True, "Fine-tuning task should be implemented"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Verify implementation has required components
assert True, "Verify implementation"
# END HIDDEN TESTS