# Problem Set 3: Policy Text Classification w/ Open Source LLMs

**Names**: [Your names here]  
**Team**: [Team name]  

## Introduction

In this assignment, you will:
1. Load and explore a climate policy text dataset
2. Test zero-shot classification with prompt engineering
3. Evaluate few-shot learning with examples
4. Fine-tune using LoRA (Low-Rank Adaptation)
5. Analyze errors and reflect on model performance

**Important**: For the scope of this problem set, it is acceptable if prompt engineering and few-shot learning do not drastically improve performance; your reflection on *why* matters more than achieving high scores.

**Tip**: consider saving checkpoints of fine-tuned models (in task 4), as well as raw outputs into directories (for all tasks), to avoid having to rerun compute-expensive workflows repeatedly. This is generally good practice!

## Setup and Installation

In [None]:
# Install required libraries
#!pip install datasets transformers torch peft accelerate evaluate scikit-learn numpy==1.26.4 matplotlib seaborn

Collecting seaborn
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
Installing collected packages: seaborn
Successfully installed seaborn-0.13.2


In [9]:
import torch
import numpy as np
import pandas as pd
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    pipeline
)
from peft import LoraConfig, get_peft_model, TaskType
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    hamming_loss,
    jaccard_score,
    classification_report,
    confusion_matrix
)
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)

# check for evaluable device
if torch.backends.mps.is_available():
    device = torch.device('mps')
elif torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
    
print(f"Using device: {device}")

Using device: cpu


## Configuration: Loading Your Dataset \& Model

Run the following code to load your dataset.

In [10]:
# Climate Policy Radar's National Climate Targets
DATASET_NAME = "ClimatePolicyRadar/national-climate-targets"
MODEL_NAME = "gpt2"
TARGET_MODULES = ['c_attn', 'c_proj']
IS_MULTILABEL = True
NUM_LABELS = 3
LABEL_NAMES = ['Net Zero', 'Reduction', 'Other']

print(f"Dataset: {DATASET_NAME}")
print(f"Model: {MODEL_NAME}")
print(f"Task type: {'Multi-label' if IS_MULTILABEL else 'Single-label'}")

Dataset: ClimatePolicyRadar/national-climate-targets
Model: gpt2
Task type: Multi-label


---

# Task 1: Data Loading and Exploration (10 points)

**Goal**: Load your chosen dataset, understand its structure, and visualise label distributions.

**TODO**:
1. Load the dataset from Hugging Face.
2. Understand dataset structure and sizes.
3. Analyse label distribution.
4. Show sample texts from each possible combination of labels.
5. Plot a histogram of the distribution of text lengths.
6. Create train/val/test splits

*Hint: you will need to apply a custom function that converts the annotation columns to a label list.*

The point of this data wrangling is to understand how the dataset may pose challenges to our modelling. There's no need to write anything, but these exercises should hopefully help you when considering how best to leverage small open source LLMs for NLP policy analysis (and perhaps why they face challenges \& limitations).


In [None]:
# TODO here

---

# Task 2: Zero-Shot Evaluation (15 points)

**Goal**: Test GPT-2 using only some basic prompt engineering strategies.

**TODO**:
1. Tokenise the text \& load model as text generator.
2. Create 3+ programmatic prompt templates (direct, instructional, definition-based, key-word checklist, discriminatory vs generative, etc.).
3. Implement parsing function to extract predictions from generated text.
4. Evaluate each prompt on the test set (can sample 50-100 for speed).
5. Compare results and identify best prompt.
6. **Written reflection**: Did prompt engineering help? Why or why not? (150-300 words)

In [None]:
# TODO: Load tokenizer and create text generation pipeline

tokenizer = None  # Load tokenizer: using HF's AutoTokenizer.from_pretrained
generator = None  # Create pipeline: using HF's pipeline function

In [None]:
# Create at least 4 different prompts
# if stuck, there's plenty of online documentation re: prompt design for API calls by OpenAI, HuggingFace, etc.:

PROMPTS = {
    # these are just examples, feel free to modify as you see fit
    'direct': "TODO: Your prompt here with {text} placeholder",
    'instructional': "TODO: Your prompt here with {text} placeholder",
    'definition': "TODO: Your prompt here with {text} placeholder",
    'structured': "TODO: Your prompt here with {text} placeholder",
    # others...
}

print(f"\nCreated {len(PROMPTS)} ZERO-SHOT prompt templates")
print("\nPrompt strategies tested:")
for i, name in enumerate(PROMPTS.keys(), 1):
    print(f"  {i:2d}. {name}")

In [None]:
# #TODO: Implement parsing function
# Extract predicted label(s) from model's generated text
# depending on your prompt designs, you may need to adjust parsing logic

def parse_output(generated_text):
    """
    Parse model output to extract prediction.
    """
    # TODO: Implement parsing logic
    pass

In [None]:
# TODO: Evaluate zero-shot with each prompt
# recommended to sample 50-100 examples from test set for speed
# For each prompt, get predictions and calculate metrics (accuracy, F1 score, etc.)

zero_shot_results = {}


### Reflection: Zero-Shot Prompt Engineering

**TODO**: Answer the following questions:
- Did prompt engineering improve performance compared to the direct prompt?
- If yes, which prompt design choices helped most?
- If no, why might prompting struggle on this task? Consider: model size, task complexity, text length, context windows

[Write your reflection here] (~200 words)

# Task 3: Few-Shot Evaluation (10 points)

**Goal**: Test if providing examples in the prompt improves performance.

**TODO**:
1. Select 2-5 training examples covering different labels
2. Create few-shot prompt with short examples (consider context window constraints for small models!)
3. Evaluate on test set
4. Compare with zero-shot
5. **Written reflection**: Did few-shot help? Why or why not? (150-300 words)

In [None]:
# #TODO: Select few-shot examples
# Programmaticaly / manually select 3-5 examples from training set representing different labels

few_shot_examples = []  # List of (text_snippet, label) tuples

In [None]:
# #TODO: Create few-shot prompt template. Include examples, then query text; if you're stuck there's much documentation online for few-shot prompt design!

def create_few_shot_prompt(test_text):
    """Create prompt with few-shot examples."""
    # TODO: Build prompt with examples + test text
    pass

In [None]:
# #TODO: Evaluate few-shot performance
# Be sure to use the same 100 test examples as zero-shot

few_shot_predictions = []
few_shot_true_labels = []

# TODO: Generate predictions with few-shot prompt

# TODO: Calculate metrics

In [None]:
# #TODO: Compare zero-shot vs few-shot

### Reflection: Few-Shot Learning

**TODO**: Answer the following questions:
- Did few-shot learning improve over zero-shot?
- If no (or if it hurt performance), what might explain this? Consider: context window limits, example selection, model capabilities
- What challenges arise when using few-shot learning with long texts?
- When might few-shot be more effective?

[Write your reflection here] (150-300 words)

# Task 4: LoRA Fine-Tuning (15 points)

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that keeps the original pretrained model frozen and injects small trainable rank-decomposed matrices into its weight layers. Instead of updating all model parameters, it learns low-rank updates that approximate the weight changes needed for a new task. This drastically reduces memory and compute requirements while maintaining performance close to full fine-tuning.

Now we're getting a bit more hands on with out model, we can't just plug out tokenizer and GPT model into a HF pipeline.

**TODO**:
1. Prepare tokenised datasets \& oad model for classification
3. Apply LoRA configuration
4. Train for 3-5+ epochs (/ as many as as deem necessary balancing loss reduction against time & compute constraints)
5. Plot learning curves
6. Evaluate on test set


In [None]:
# TODO: Prepare datasets for fine-tuning
# Tokenize texts and format labels using the tokenizer from above

def tokenize_function(examples):
    """Tokenize texts and prepare labels."""
    # TODO: Tokenize with padding and truncation
    # TODO: Add labels in correct format
    pass

# TODO: Tokenize all splits

In [None]:
# TODO: Load model for sequence classification

model = None  # Load AutoModelForSequenceClassification
# hint you'll need problem_type="multi_label_classification" & the number of labels

In [None]:
# TODO: Apply LoRA configuration

lora_config = LoraConfig(
    r=8,  # Rank
    lora_alpha=32,
    target_modules=TARGET_MODULES,
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_CLS
)

model = get_peft_model(model, lora_config)

model.print_trainable_parameters()

In [None]:
# TODO: Define training arguments

training_args = TrainingArguments(
    # TODO: Add arguments
)

In [None]:
# TODO: Define compute_metrics function

def compute_metrics(eval_pred):
    """Calculate metrics for evaluation."""
    
    # Hint: for multi-label: Use sigmoid + threshold
    # Hint: for mult-label problems, consider using the following sklearn metrics:
    # - accuracy_score()
    # - f1_score() with average='macro' or 'weighted'
    # - hamming_loss()
    # - jaccard_score() with average='samples'


In [None]:
# TODO: Create Trainer and train

trainer = Trainer(
)

In [None]:
# TODO: Plot learning curves
# Extract training history and plot loss/F1 over epochs

In [None]:
# TODO: Evaluate on test set

# Task 5: Evaluation and Analysis (10 points)

**Goal**: Analyse model performance, identify error patterns, and reflect on the full pipeline.

**TODO**:
1. Examine/plot model performance (acc, f1-score, jaccard-score) of the zero-shot, few-shot vs LoRA fine tuned,
2. For the best-performing model, programatically analyse error cases from the test set. Most of this is pre-implemented for you below:
    - identify misclassified examples (mostly pre-implemented)
    - programmatically define if they are complete miss, partial errors, false positives, or false negatives (mostly pre-implemented)
    - display/print/plot a count \& percentage of total for each error type
4. Written a reflection on the full pipeline (150-300 words)

In [None]:
# TODO: compare between zero-shot, few-shot, and fine-tuned results

In [None]:
# #TODO: Identify and analyse errors for best performing model

# Find misclassified examples
errors = []
for i, (true, pred) in enumerate(zip(true_labels, predictions)):
    if not np.array_equal(true, pred):
        true_labels_list = [LABEL_NAMES[j] for j, val in enumerate(true) if val == 1]
        pred_labels_list = [LABEL_NAMES[j] for j, val in enumerate(pred) if val == 1]

        errors.append({
            'index': i,
            'text': test_dataset[i]['text'],
            'true_labels': true_labels_list if true_labels_list else ['None'], # human readable
            'pred_labels': pred_labels_list if pred_labels_list else ['None'],
            'true_array': true, # machine readable
            'pred_array': pred
        })

print(f"\nTotal errors: {len(errors)} / {len(test_dataset)} ({len(errors)/len(test_dataset)*100:.1f}%)")

In [None]:
# #TODO: Create error taxonomy for best performing model

# Categorize errors
error_categories = {
    'False Negative (missed label)': 0,
    'False Positive (extra label)': 0,
    'Complete miss': 0,
    'Partial (mixed FP/FN)': 0
}

for error in errors:
    true_set = set(error['true_labels'])
    pred_set = set(error['pred_labels'])
    # TODO categorisation logic
    if #...
    elif #...
    elif #...

# Create taxonomy table
taxonomy_df = pd.DataFrame({
    'Error Category': list(error_categories.keys()),
    'Count': list(error_categories.values()),
    'Percentage': [v/len(errors)*100 for v in error_categories.values()]
})

print("\n" + taxonomy_df.to_string(index=False))

# TODO plot error distribution by category

### Overall Reflection

**TODO**: possible questions you could consider:

1. What made this classification task difficult?
2. Where did your model struggle most? Which classes/labels were hardest? Why?
3. How did performance change from zero-shot → few-shot → fine-tuned?
   Was the progression what you expected?
4. Why did (or didn't) prompt engineering and few-shot learning help?
5. What common mistakes did your model make? Can you identify patterns? What could you try next to improve performance? Consider:
6. What did you learn about using small LLMs for policy analysis?
   When are they sufficient vs. when do you need larger models or domain-specific training?

[Write your reflection here] (150-300 words)