# Problem Set 3: Policy Text Classification w/ Open Source LLMs

**Names**: [Your names here]  
**Team**: [Team name]  
**Dataset Choice**: [Option A or Option B]

## Introduction

In this assignment, you will:
1. Load and explore a climate policy text dataset
2. Test zero-shot classification with prompt engineering
3. Evaluate few-shot learning with examples
4. Fine-tune using LoRA (Low-Rank Adaptation)
5. Analyze errors and reflect on model performance

**Important**: For the scope of this pset, it's acceptable if prompt engineering and few-shot learning don't drastically improve performance; your reflection on *why* matters more than achieving high scores.

**Tip**: consider saving checkpoints of fine-tuned models (in task 4), as well as raw outputs into directories (for all tasks), to avoid having to rerun compute-expensive workflows repeatedly. This is generally good practice!

## Setup and Installation

In [None]:
# Install required libraries
# !pip install datasets transformers torch peft accelerate evaluate scikit-learn

In [None]:
import torch
import numpy as np
import pandas as pd
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    pipeline
)
from peft import LoraConfig, get_peft_model, TaskType
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    hamming_loss,
    jaccard_score,
    classification_report,
    confusion_matrix
)
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## Configuration: Choose Your Dataset

Uncomment ONE of the two options below based on your choice.

In [None]:
# #TODO: Choose your dataset by uncommenting ONE option

# # OPTION A: National Climate Targets
# DATASET_NAME = "ClimatePolicyRadar/national-climate-targets"
# MODEL_NAME = "gpt2"
# TARGET_MODULES = ['c_attn', 'c_proj']
# IS_MULTILABEL = True
# NUM_LABELS = 3
# LABEL_NAMES = ['Net Zero', 'Reduction', 'Other']

# # OPTION B: TCFD Corporate Disclosure
# DATASET_NAME = "climatebert/tcfd_recommendations"
# MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# TARGET_MODULES = ['q_proj', 'v_proj', 'k_proj', 'o_proj']
# IS_MULTILABEL = False
# NUM_LABELS = 5
# LABEL_NAMES = None  # Will be loaded from dataset

print(f"Dataset: {DATASET_NAME}")
print(f"Model: {MODEL_NAME}")
print(f"Task type: {'Multi-label' if IS_MULTILABEL else 'Single-label'}")

# Task 1: Data Loading and Exploration (10 points)

**Goal**: Load your chosen dataset, understand its structure, and visualize label distributions.

**TODO**:
1. Load the dataset from Hugging Face
2. Understand dataset structure and sizes
3. Analyse label distribution
4. Show 2-3 example texts with annotations
5. Create train/val/test splits

NB: for option A you will need to apply a custom function that converts the annotation columns to a label list.

In [None]:
# TODO here

# Task 2: Zero-Shot Evaluation (15 points)

**Goal**: Test your model without training using different prompt strategies.

**TODO**:
1. Load model as text generator
2. Create 3+ prompt templates (direct, instructional, definition-based, etc.)
3. Implement parsing function to extract predictions from generated text
4. Evaluate each prompt on test set (sample 50-100 for speed)
5. Compare results and identify best prompt
6. **Written reflection**: Did prompt engineering help? Why or why not?

In [None]:
# TODO: Load tokenizer and create text generation pipeline

tokenizer = None  # Load tokenizer
generator = None  # Create pipeline

In [None]:
# Create at least 4 different prompts
# for examples there's plenty of online documentation re: prompt design for API calls by OpenAI, HuggingFace, etc.:

PROMPTS = {
    # these are just examples, feel free to modify as you see fit
    'direct': "TODO: Your prompt here with {text} placeholder",
    'instructional': "TODO: Your prompt",
    'definition': "TODO: Your prompt",
    'structured': "TODO: Your prompt",
    # others...
}

In [None]:
# #TODO: Implement parsing function
# Extract predicted label(s) from model's generated text
# depending on your prompt designs, you may need to adjust parsing logic

def parse_output(generated_text):
    """
    Parse model output to extract prediction.

    For Option A: Return list [0/1, 0/1, 0/1] for [Net Zero, Reduction, Other]
    For Option B: Return integer 0-4 for class index
    """
    # TODO: Implement parsing logic
    pass

In [None]:
# TODO: Evaluate zero-shot with each prompt
# Sample 100 test examples for speed
# For each prompt, get predictions and calculate metrics

zero_shot_results = {}

# TODO: Loop through prompts
# TODO: For each example, generate prediction
# TODO: Calculate accuracy and F1 score

In [None]:
# TODO: Compare prompt performance
# Create table or visualisation showing which prompt worked best

In [None]:
# #TODO: Print / plot best prompt type and its performance

### Reflection: Zero-Shot Prompt Engineering

**TODO**: Answer the following questions:
- Did prompt engineering improve performance compared to the direct prompt?
- If yes, which prompt design choices helped most?
- If no, why might prompting struggle on this task? Consider: model size, task complexity, text length, context windows

[Write your reflection here]

# Task 3: Few-Shot Evaluation (10 points)

**Goal**: Test if providing examples in the prompt improves performance.

**TODO**:
1. Select 2-5 training examples covering different labels
2. Create few-shot prompt with short examples (consider context window constraints for small models!)
3. Evaluate on test set
4. Compare with zero-shot
5. **Written reflection**: Did few-shot help? Why or why not?

In [None]:
# #TODO: Select few-shot examples
# Programmaticaly / manually select 3-5 examples from training set representing different labels
# IMPORTANT: consider length constraints of model context window

few_shot_examples = []  # List of (text_snippet, label) tuples

In [None]:

# #TODO: Create few-shot prompt template
# Include examples, then query text
# again, much documentation online for few-shot prompt design!

def create_few_shot_prompt(test_text):
    """Create prompt with few-shot examples."""
    # TODO: Build prompt with examples + test text
    pass

In [None]:
# #TODO: Evaluate few-shot performance
# Use same 100 test examples as zero-shot

few_shot_predictions = []
few_shot_true_labels = []

# TODO: Generate predictions with few-shot prompt

# TODO: Calculate metrics

In [None]:
# #TODO: Compare zero-shot vs few-shot
# Show side-by-side comparison of best zero-shot vs few-shot

### Reflection: Few-Shot Learning

**TODO**: Answer the following questions:
- Did few-shot learning improve over zero-shot?
- If no (or if it hurt performance), what might explain this? Consider: context window limits, example selection, model capabilities
- What challenges arise when using few-shot learning with long texts?
- When might few-shot be more effective?

[Write your reflection here]

# Task 4: LoRA Fine-Tuning (15 points)

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that keeps the original pretrained model frozen and injects small trainable rank-decomposed matrices into its weight layers. Instead of updating all model parameters, it learns low-rank updates that approximate the weight changes needed for a new task. This drastically reduces memory and compute requirements while maintaining performance close to full fine-tuning.

**TODO**:
1. Prepare tokenized datasets
2. Load model for classification
3. Apply LoRA configuration
4. Train for 3-5+ epochs (/ as many as as deem necessary balancing loss reduction against time & compute constraints)
5. Plot learning curves
6. Evaluate on test set


In [None]:
# TODO: Prepare datasets for fine-tuning
# Tokenise texts and format labels

def tokenize_function(examples):
    """Tokenize texts and prepare labels."""
    # TODO: Tokenize with padding and truncation
    # TODO: Add labels in correct format
    pass

# TODO: Tokenize all splits

In [None]:
# TODO: Load model for sequence classification

model = None  # Load AutoModelForSequenceClassification

In [None]:
# TODO: Apply LoRA configuration

lora_config = LoraConfig(
    r=8,  # Rank
    lora_alpha=32,
    target_modules=TARGET_MODULES,
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_CLS
)

# TODO: Apply LoRA to model
# TODO: Print trainable parameters

In [None]:
# TODO: Define training arguments

training_args = TrainingArguments(
    # TODO: Add arguments
)

In [None]:
# TODO: Define compute_metrics function

def compute_metrics(eval_pred):
    """Calculate metrics for evaluation."""
    logits, labels = eval_pred

    # TODO: Get predictions from logits
    # For multi-label: Use sigmoid + threshold
    # For single-label: Use argmax

    # TODO: Calculate accuracy, F1, etc.

    return {
        'accuracy': 0.0,  # Replace with actual
        'f1': 0.0
    }

In [None]:
# TODO: Create Trainer and train

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# TODO: Train model
# train_result = trainer.train()

In [None]:
# TODO: Plot learning curves
# Extract training history and plot loss/F1 over epochs

In [None]:
# TODO: Evaluate on test set
# accuracy & f1 necessary for both
# hamming loss and jaccard for multi-label

# Task 5: Evaluation and Analysis (10 points)

**Goal**: Analyse model performance, identify error patterns, and reflect on the full pipeline.

**TODO**:
1. Create confusion matrix (Option B) or per-label breakdown (Option A)
2. Analyze 5-10 error cases from test set
3. Create error taxonomy table
4. Written reflection

In [None]:
# TODO: Get predictions on test set

predictions = None  # Get final predictions
true_labels = None  # Get true labels

In [None]:
# TODO: Create confusion matrix or per-label metrics

# Option B: Confusion matrix
# Option A: Per-label precision/recall/F1 table

In [None]:
# #TODO: Identify and analyse errors
# Find 5-10 misclassified examples

# Find misclassified examples
errors = []
for i, (true, pred) in enumerate(zip(true_labels, predictions)):
    if not np.array_equal(true, pred):
        true_labels_list = [LABEL_NAMES[j] for j, val in enumerate(true) if val == 1]
        pred_labels_list = [LABEL_NAMES[j] for j, val in enumerate(pred) if val == 1]

        errors.append({
            'index': i,
            'text': test_dataset[i]['text'],
            'true_labels': true_labels_list if true_labels_list else ['None'],
            'pred_labels': pred_labels_list if pred_labels_list else ['None'],
            'true_array': true,
            'pred_array': pred
        })

print(f"\nTotal errors: {len(errors)} / {len(test_dataset)} ({len(errors)/len(test_dataset)*100:.1f}%)")

# TODO: For each, print text and consider what went wrong


In [None]:
# #TODO: Create error taxonomy
# Categorize errors by type (e.g., ambiguous language, rare class, etc.)

# Categorize errors
error_categories = {
    # TODO complete custom categories
}

for error in errors:
    true_set = set(error['true_labels'])
    pred_set = set(error['pred_labels'])
    # TODO categorisation logic
    if #...
    elif #...
    elif #...

# Create taxonomy table
taxonomy_df = pd.DataFrame({
    'Error Category': list(error_categories.keys()),
    'Count': list(error_categories.values()),
    'Percentage': [v/len(errors)*100 for v in error_categories.values()]
})

print("\n" + taxonomy_df.to_string(index=False))

# TODO plot error distribution by category

### Comprehensive Reflection

**TODO**: Address the following questions in your reflection:

1. What made this classification task difficult?

2. Where did your model struggle most? Which classes/labels were hardest? Why?

3. How did performance change from zero-shot → few-shot → fine-tuned?
   Was the progression what you expected?

4. Why did (or didn't) prompt engineering and few-shot learning help?

5. What common mistakes did your model make? Can you identify patterns? What could you try next to improve performance? Consider:

6. What did you learn about using small LLMs for policy analysis?
   When are they sufficient vs. when do you need larger models or domain-specific training?

[Write your reflection here (you may also do this programmatically if you wish)]