# Sequence scorer finetuning with pairwise comparisons annotations

This notebook is based on an approach for fine-tuning a text item-level scoring model based on pairwise comparison annotations presented in

> Licht, Hauke, Rupak Sarkar, Patrick Y. Wu, et al. 2025. “Measuring Scalar Constructs in Social Science with LLMs.” arXiv:2509.03116. Preprint, arXiv, September 3. https://doi.org/10.48550/arXiv.2509.03116.


<a target="_blank" href="https://colab.research.google.com/github/haukelicht/advanced_text_analysis/blob/main/notebooks/encoder_finetuning/finetune_sequence_scorer_pairwise.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> 


## Background

### Pairwise comparison

**goal:** use pairwise data to estimate items' scores 

methodological motivation:

- many concepts are latent and conceptually continuous
- text items' locations on this scale are hard to observe directly
- human annotators are bad at directly scoring items on the scales (inconsistent over-time, different scales, different anchors)
- comparison captures relative differences between pairs of items
- underlying scale can be estimated from pairwise comparisons data (e.g., Bradley--Terry model)


Characteristics:

- _text pair-level_ classification: assign pairs of documents (e.g., sentences) to categories
- single-label classification: assign pair to one and of two categories ("first" or "second" text)

### methods papers and research applications

- Licht, Hauke, Rupak Sarkar, Patrick Y. Wu, et al. 2025. “Measuring Scalar Constructs in Social Science with LLMs.” arXiv:2509.03116. Preprint, arXiv, September 3. https://doi.org/10.48550/arXiv.2509.03116.
- Wu P, Nagler J, Tucker JA, Messing S. Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison Scaling of Texts with Large Language Models. 2023. working paper
- Benoit K, Munger K, Spirling A. Measuring and Explaining Political So- phistication through Textual Complexity. 2019. *American Journal of Political Science*; 63(2): 491–508.
- Carlson D, Montgomery JM. A Pairwise Comparison Framework for Fast, Flexible, and Reliable Human Coding of Political Texts. 2017. *American Political Science Review*; 111(4): 835–843.

### pairwise comparison implementation

#### fine-tuning

given 

- $n$ training examples ($x_i$, $x_j$, $y_{ij}$) where 
	- $x_i$ is item i's text, 
	- $x_j$ is item j's text, and
	- $y_{ii}$ is "1" if first item wins and "2" if second item wins
- a (pre-trained) model that 
	- intakes text representations ($x_k$)
	- outputs score $\hat{s}_k$ that estimates item $k$'s strength
- loss function: $\log(\sigma(s_w - s_l))$ ("reward loss") the smaller, the bigger the difference between the "winner" and "loser" items' predicted scores

## pairwise comparison implementation

![Illustration of fine-tuning scoring model from pairwise comparisons data](../.assets/task_types-pairwise_comparison.svg)


#### prompting

given 

- $n$ input pairs ($x_i$, $x_j$) where 
	- $x_i$ is item i's text and
	- $x_j$ is item j's text
- a pre-trained generative LLM
- prompt 
    - explaining comparison task
    - presenting texts 1 and 2
    - giving choice options "1" and "2"

produces pairwise comparison annotations for item score scaling (e.g., with Bradley-Terry model)


## Setup

#### Colab

In [None]:
# check if on Colab
COLAB = True
try:
  from google import colab
except:
  COLAB = False

if COLAB:
    # shallow clone of current state of main branch 
    !git clone --branch main --single-branch --depth 1 --filter=blob:none https://github.com/haukelicht/advanced_text_analysis.git
    
    # make repo root findable for python
    import sys
    sys.path.append("/content/advanced_text_analysis/")

    !pip install -q -U accelerate bitsandbytes trl

#### Install required packages

In [None]:
from pathlib import Path
import numpy as np
import pandas as pd

from src.utils.io import read_jsonlines
from src.finetuning import unpair_data

from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    pipeline
)
from transformers.pipelines.pt_utils import KeyDataset

from trl import (
    ModelConfig,
    RewardTrainer,
    RewardConfig, # similar to transformers' TrainingArguments
    get_kbit_device_map,
    get_peft_config,
    get_quantization_config,
)

from tqdm.auto import tqdm

In [None]:
MODEL_NAME = 'answerdotai/ModernBERT-base'

In [None]:
base_path = Path("/content/advanced_text_analysis/" if COLAB else "../../")

## Load and prepare the data

In [None]:
data_path = base_path / "data" / "labeled" / "carlson_pairwise_2017"
fp = data_path /  "carlson_pairwise_2017-immigration_fear.jsonl"
if not fp.exists():
    url = "https://cta-text-datasets.s3.eu-central-1.amazonaws.com/labeled/carlson_pairwise_2017/carlson_pairwise_2017-immigration_fear.jsonl"
    df = pd.read_json(url, lines=True)
    fp.parent.mkdir(parents=True, exist_ok=True)
    df.to_json(fp, lines=True, orient='records', force_ascii=False)

In [None]:
data = read_jsonlines(fp)

In [None]:
# keep only relevant fields
fields = ['pair_id', 'id1', 'id2', 'text1', 'text2', 'label']
data = [{field: d[field] for field in fields} for d in data]

In [None]:
# note: a single example has the texts that were compared, their IDs, and the label
# label = 2 if text2>text1, 1 if text1>text2, 0 otherwise ("tie")
data[0]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
if not tokenizer.pad_token:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

In [None]:
max_tokens = max(tokenizer([d['text'] for d in unpair_data(data)], return_length=True, truncation=False)['length'])
max_tokens

In [None]:
from src.finetuning import split_data
data_splits = split_data(data, test_size=0.2, dev_size=0.2, seed=42, return_dict=True)

In [None]:
datasets = DatasetDict({
    s: Dataset.from_list(data)
    for s, data in data_splits.items()
})

In [None]:
assert max_tokens <= tokenizer.model_max_length, f"Error: The max_seq_length passed ({max_tokens}) is larger than the maximum length for the model ({tokenizer.model_max_length})."
max_seq_length = min(max_tokens, tokenizer.model_max_length)

In [None]:
def paired_preprocess_function(examples):
    new_examples = {
        "input_ids_chosen": [],
        "attention_mask_chosen": [],
        "input_ids_rejected": [],
        "attention_mask_rejected": [],
        "is_tie": [],
    }
    for text1, text2, label in zip(examples["text1"], examples["text2"], examples["label"]):
        _tokenize = lambda x: tokenizer(x, max_length=max_seq_length, truncation=True)
        is_tie = label == 0
        if label == 1 or label == 0:
            tokenized_chosen, tokenized_rejected = _tokenize(text1), _tokenize(text2)
        elif label == 2:
            tokenized_chosen, tokenized_rejected = _tokenize(text2), _tokenize(text1)
        else:
            raise ValueError("Label must be `1` or `2` to indicate index of chosen item, `0` for ties.")

        new_examples["input_ids_chosen"].append(tokenized_chosen["input_ids"])
        new_examples["attention_mask_chosen"].append(tokenized_chosen["attention_mask"])
        new_examples["input_ids_rejected"].append(tokenized_rejected["input_ids"])
        new_examples["attention_mask_rejected"].append(tokenized_rejected["attention_mask"])
        new_examples["is_tie"].append(is_tie)
    return new_examples

In [None]:
datasets = datasets.map(paired_preprocess_function, batched=True, num_proc=4)

In [None]:
datasets.num_rows

## setup the reward modeling

In [None]:
model_path = base_path / "models" / "carlson_pairwise_2017-immigration_fear-scorer"

reward_config = RewardConfig(
    output_dir=model_path,
    max_length=max_seq_length,
    
    do_train=True,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    learning_rate=5e-05,
    optim='adamw_torch',
    
    do_eval=True,
    eval_strategy='epoch',
    per_device_eval_batch_size=16,
    
    load_best_model_at_end=True,
    metric_for_best_model='f1',
    save_strategy='epoch',
    save_total_limit=2,

    seed=42,
    data_seed=42,
)

In [None]:
model_config = ModelConfig(MODEL_NAME, lora_task_type='SEQ_CLS')

In [None]:
quantization_config = get_quantization_config(model_config)
model_kwargs = dict(
    revision=model_config.model_revision,
    trust_remote_code=model_config.trust_remote_code,
    device_map=get_kbit_device_map() if quantization_config is not None else "auto",
    quantization_config=quantization_config,
)

In [None]:
import warnings
from sklearn.metrics import accuracy_score, f1_score
from typing import Dict

def compute_metrics(eval_pred) -> Dict[str, float]:
    # taken from: https://github.com/huggingface/trl/blob/dcee683d968444179f57bffa5a49a7ec13f57654/trl/trainer/utils.py#L634
    predictions, labels = eval_pred
    # Here, predictions is rewards_chosen and rewards_rejected.
    # We want to see how much of the time rewards_chosen > rewards_rejected.
    if np.array(predictions[:, 0] == predictions[:, 1], dtype=float).sum() > 0:
        warnings.warn(
            f"There are {np.array(predictions[:, 0] == predictions[:, 1]).sum()} out of {len(predictions[:, 0])} instances where the predictions for both options are equal. As a consequence the accuracy can be misleading."
        )
    predictions = np.argmax(predictions, axis=1)

    scores = {
        "accuracy": accuracy_score(y_true=labels, y_pred=predictions),
        # NOTE: the "chosen" text (i.e. the one selected in pairwise comparison) is always put first, so we need to look at label 0
        "f1": f1_score(y_true=labels, y_pred=predictions, average=None, labels=[0])[0],
    }
    
    return scores

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_config.model_name_or_path, num_labels=1, **model_kwargs)
model.config.pad_token_id = tokenizer.pad_token_id

trainer = RewardTrainer(
    model=model,
    processing_class=tokenizer,
    args=reward_config,
    train_dataset=datasets['train'],
    eval_dataset=datasets['dev'],
    compute_metrics=compute_metrics,
    peft_config=get_peft_config(model_config),
)

## Finetune

In [None]:
train_result = trainer.train(ignore_keys_for_eval=["hidden_states"])

In [None]:
trainer.evaluate(datasets['test'], metric_key_prefix='test')

### item-level scoring

Remember that the underyling model is a sequence classification model with a single output neuron (i.e., a scoring/regression model).

In [None]:
print("model type:", type(trainer.model).__name__)
print("number of output neurons:", trainer.model.config.num_labels)

So we can pass it indiviudal texts to score them with the fine-tuned model.

For this, we first need to unnest/unpack/unpair the text pair-level data into individual text items:

In [None]:
# unpack the pair-level data to item-level data
items_dataset = unpair_data(datasets['test'])

# look at the first item
items_dataset[0]

# make it a Dataset instance
items_dataset = Dataset.from_list(items_dataset)

In [None]:
# NOTE: for convenicene we use huggingface's pipeline class for inference with out scoring/regression model 
scorer = pipeline(task="text-classification", model=trainer.model, tokenizer=tokenizer, batch_size=32)

In [None]:
# example
scorer(items_dataset['text'][0])

In [None]:
# for batch inference (see https://huggingface.co/docs/transformers/pipeline_tutorial#batch-inference)
kd = KeyDataset(items_dataset, "text")

pred_scores = np.array([p['score'] for p in tqdm(scorer(kd, batch_size=32), total=len(items_dataset))])

In [None]:
# get indexes of 5 highest scoring items
idxs = np.argsort(pred_scores)[-5:]
print(*items_dataset.select(idxs)['text'], sep='\n')

In [None]:
# get indexes of 5 lowest scoring items
idxs = np.argsort(pred_scores)[:5]
print(*items_dataset.select(idxs)['text'], sep='\n')