# Fine-tune token classifier for social group mention detection and extraction

In this notebook, we use annotations from 

> Licht, Hauke, and Ronja Sczepanski. 2025. “Detecting Group Mentions in Political Rhetoric A Supervised Learning Approach.” British Journal of Political Science 55: e119. https://doi.org/10.1017/S0007123424000954.

to finetune a classifier capable of identifying and extracting phrases in texts that refer to social groups

<a target="_blank" href="https://colab.research.google.com/github/haukelicht/advanced_text_analysis/blob/main/notebooks/encoder_finetuning/finetune_token_classifier.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> 

## Backgrounf

span extraction:

- **task**: extract spans of words from a text that are mentions/indicators of a target concept
- **approaches**
    - supervised learning: token classification
    - prompting: prompt to respond with structured output of list of strings (verbatim extractions from input text, see Kasner et al, [2025](https://doi.org/10.48550/arXiv.2504.08697)) 

### examples

- **named entities** (place, organization, person)

    > "This year, **COMPTEXT** was in **Vienna**."

- **group mention** (social group, political group, etc.)

    > "**Labour** fights for **hard-working people**."

- **policy pledge**

    > "We will **lower taxes by 50%**."
    
- **valence attack** (i.e., criticism of political opponent's character/abilit/credibility)

    <!-- > "The Prime Minister **has not been honest with us**. -->

    > "The government has **betrayed the people** over and over again."


### token classification

![Illustration of token classification](../.assets/task_types-token_classification.svg){ height=50% }


- _token_-level classification: assign each token documents (e.g., sentences) categories
- single-label classification: assign each token to one and only one category

## methods papers and research applications

- Licht, Hauke, and Ronja Sczepanski. 2025. “Detecting Group Mentions in Political Rhetoric A Supervised Learning Approach.” British Journal of Political Science 55: e119. https://doi.org/10.1017/S0007123424000954.
- Kasner, Zdeněk, Vilém Zouhar, Patrícia Schmidtová, et al. 2025. “Large Language Models as Span Annotators.” arXiv:2504.08697. Version 1. Preprint, arXiv, April 11. https://doi.org/10.48550/arXiv.2504.08697.
- Klamm C, Rehbein I, Ponzetto SP. Our kind of people? Detecting populist references in political debates. 2023. *Findings of the Association for Computational Linguistics: EACL 2023*. 1227–1243. doi:[10.18653/v1/2023.findings-eacl.91](https://doi.org/10.18653/v1/2023.findings-eacl.91)
- Skorupa Parolin E, Hosseini MS, Hu Y, Khan L, Brandt PT, Osorio J, D'Orazio V. Multi-CoPED: A Multilingual Multi-Task Approach for Coding Political Event Data on Conflict and Mediation Domain. 2022. *Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society (AIES '22)*. 700–711. doi:[10.1145/3514094.3534178](https://doi.org/10.1145/3514094.3534178)


## Setup

### Setup Colab (if using Colab)

In [None]:
# check if on Colab
COLAB = True
try:
  from google import colab
except:
  COLAB = False

In [None]:
# install soft-seqeval (latest version)
!pip install -q --upgrade --force-reinstall --no-deps git+https://github.com/haukelicht/soft-seqeval.git@main

### Load required libraries

In [None]:
from pathlib import Path
import shutil

import numpy as np
import pandas as pd

import torch
import transformers
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification,
    EarlyStoppingCallback,
    set_seed,
)

from soft_seqeval.metrics import compute_sequence_metrics

In [None]:
import os
from nltk.data import find as nltk_find
from nltk import download as nltk_download
nltk_res = ['punkt', 'punkt_tab']
for res in nltk_res:
    try:
        nltk_find(os.path.join('tokenizers', res))
    except LookupError:
        nltk_download(nltk_res)

In [None]:
import json
from pathlib import Path

from soft_seqeval.classes import LabeledSequence, Entities, Entity
from collections import OrderedDict

from typing import List, Dict, Any, Mapping, Union

def read_jsonlines_corpus(
    file: str, 
    id_field: str='id', 
    text_field: str='text', 
    annotations_field: str='label', 
    remove_unsure: bool=True,
    lang: str='english'
) -> Mapping[Union[str, int], LabeledSequence]:
    """Read a jsonlines corpus and return a dictionary of LabeledSequence objects.
    Args:
        file (str): Path to the jsonlines file.
        id_field (str): Name of the field containing the document ID.
        text_field (str): Name of the field containing the document text.
        annotations_field (str): Name of the field containing the annotations.
        remove_unsure (bool): Whether to remove annotations that end with 'unsure'.
        lang (str): Language of the documents.
    
    Returns:
        Mapping[Union[str, int], LabeledSequence]: A dictionary mapping document IDs to LabeledSequence objects.
    """
    with open(file, 'r') as f:
        data = [] 
        for line in f:
            try: 
                line = json.loads(line)
                data.append(line)
            except json.JSONDecodeError:
                pass
    
    documents = [
        # doc[text_field]:
        LabeledSequence(
            text=doc[text_field],
            entities=Entities([
                Entity(*lab) 
                for lab in doc[annotations_field] 
                if (not lab[2].lower().endswith('unsure') if remove_unsure else True)
            ]),
            id = doc[id_field],
            lang=lang
        )
        for doc in data
    ]
    
    return documents


In [None]:
MODEL_NAME = "answerdotai/ModernBERT-base"

In [None]:
device = 'cuda:0' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
device

In [None]:
SEED = 42

In [None]:
set_seed(SEED)

In [None]:
base_path = Path("/content/advanced_text_analysis/" if COLAB else "../../")

## Load and prepare the data

In [None]:
data_path = base_path / "data" / "labeled" / "licht_detecting_2025"

In [None]:
fp = data_path / "licht_detecting_2025-uk_manifestos.jsonl"
if not fp.exists():
    url = "https://cta-text-datasets.s3.eu-central-1.amazonaws.com/labeled/licht_detecting_2025/licht_detecting_2025-uk_manifestos.jsonl"
    df = pd.read_json(url, lines=True)
    fp.parent.mkdir(parents=True, exist_ok=True)
    df.to_json(fp, lines=True, orient='records', force_ascii=False)

In [None]:
corpus = read_jsonlines_corpus(fp)

In [None]:
corpus[1]

In [None]:
from src.finetuning import split_data

data_splits = split_data(corpus, dev_size=0.1, test_size=0.15, seed=SEED, return_dict=True)

In [None]:
# from the annotations, get all entity "types" and construct a label2id mapping
#  where the labels are the IOB2-scheme for each entity type

types = list(set(ent.type for dataset in data_splits.values() for doc in dataset for ent in doc.entities))
scheme = ['O'] + ['I-'+t for t in types] + ['B-'+t for t in types]
label2id = {l: i for i, l in enumerate(scheme)}
id2label = {i: l for i, l in enumerate(scheme)}
NUM_LABELS = len(label2id)

label2id
# NOTE: the span-level annotations will be converted to token-level annotations using the IOB2 scheme.append
#       This means that 
#        - a word that are not part of any entity will be labeled as "O",
#        - a word at the beginning of a span will be labeled as "B-<entity_type>", and 
#        - a word inside a span will be labeled as "I-<entity_type>"

In [None]:
# NOTE: here we use the LabeledSequence instances to_labeled_tokens methods to convert 
#       the span-level annotations to token-level annotations
#       This method returns a LabeledTokens instance with the token-level annotations, 
#       which we then convert into a dictionary with fields 'tokens' and 'labels'.
data_splits = {
    s: [doc.to_labeled_tokens(label2id).to_dict() for doc in dataset]
    for s, dataset in data_splits.items()
}

In [None]:
# uncomment to show example
i = 2
for t, l in zip(data_splits['train'][i]['tokens'], data_splits['train'][i]['labels']):
    print(repr(t), '==>', id2label[l])

In [None]:
from src.finetuning import create_token_classification_dataset
from datasets import DatasetDict

# use custom function defined above to convert corpus to a datasets.Dataset instance (used by transformers' Trainer below)
datasets = DatasetDict({
    s: create_token_classification_dataset(dataset)
    for s, dataset in data_splits.items()
})

In [None]:
datasets.num_rows

In [None]:
datasets['train'][0]

In [None]:
from src.finetuning import preprocess_token_classification_dataset

# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True, add_prefix_space=True)
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

# apply the custom function defined above to set subword tokens' labels to -100
# this is necessary because the tokenization may split a word into multiple subwords
datasets = datasets.map(lambda example: preprocess_token_classification_dataset(example, tokenizer=tokenizer), batched=True)

In [None]:
# uncomment to show example
example = datasets['train'][2]
for t, l in zip(example['input_ids'], example['labels']):
    if t == tokenizer.pad_token_id:
        break
    print(l, '\t', repr(tokenizer.decode(t)))

In [None]:
# NOTE: after tokenization, text tokens are represented with their token IDs
#        so we can remove them from the dataset (need to load these to the GPU)
datasets = datasets.remove_columns(['tokens']) 

## Prepare the model fine-tuning

In [None]:
from transformers import AutoConfig
# NOTE: the `model_init` function is used by the Trainer to initialize the model
#   and is called each time before training starts.
#  So we define it here to load the model from the Huggingface model hub
#   and set the number of labels to the number of unique labels in the dataset
#   and the label2id and id2label mappings
def model_init():
    config = AutoConfig.from_pretrained(MODEL_NAME)
    config.num_labels = NUM_LABELS
    config.label2id = label2id
    config.id2label = id2label
    return AutoModelForTokenClassification.from_pretrained(MODEL_NAME, config=config, device_map='auto')

In [None]:
# NOTE: we define a custom function for computing the fine-tuned model's performance in its 
#       prediction output for the dev or test set examples

# uncomment for example (with perfect scores)
y_true = datasets['test']['labels'][:25]
y_pred = datasets['test']['labels'][:25]

compute_sequence_metrics(y_true, y_pred, id2label, flatten_output=True)

In [None]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)
    # convert predictions and labels to list of lists of ints
    predictions = predictions.astype(int).tolist()
    labels = labels.astype(int).tolist()
    return compute_sequence_metrics(y_true=labels, y_pred=predictions, id2label=id2label, flatten_output=True)

In [None]:
# NOTE: at the beginning of the script, we have defined args.metric as the metric to be used for early stopping
#       and model selection among saved checkpoints after stopping
#       This metric must be available in the output of our `compute_metrics` function defined above
#       So let's check this

ex = ['O', 'B-social group', 'I-social group', 'O']
scores = compute_sequence_metrics([ex], [ex], id2label, flatten_output=True)
if args.metric not in scores.keys():
    raise ValueError(f"Invalid metric: {args.metric}, valid metrics are: {', '.join(scores.keys())}")

### Define the training arguments

In [None]:
model_path = base_path / "models" / "licht_detecting_2025-group_mention_detector"

In [None]:
out_dir = model_path
checkpoints_dir = out_dir / 'checkpoints'
logs_dir = out_dir / 'logs'

training_args = TrainingArguments(
    
    # hyperparameters
    num_train_epochs=10,
    learning_rate=4e-5,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    per_device_eval_batch_size=32,
    weight_decay=0.3,
    optim='adamw_torch',
    
    # when to evaluate
    eval_strategy='epoch',
    # how to select "best" model
    do_eval=bool('dev' in datasets),
    metric_for_best_model="seqeval-macro_f1",
    load_best_model_at_end=True,
    # when to save
    save_strategy='epoch',
    save_total_limit=2 if 'dev' in datasets else None, # don't save all model checkpoints
    # where to store results
    output_dir=checkpoints_dir,
    overwrite_output_dir=True,
    
    # logging
    logging_dir=logs_dir,
    logging_strategy='epoch',
    
    # reproducibility
    seed=SEED,
    data_seed=SEED,
    full_determinism=True
)


# build callbacks
callbacks = []
if 'dev' in datasets:
    callbacks.append(EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.03))


### Create the trainer

In [None]:
trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=datasets['train'],
    eval_dataset=datasets['dev'] if 'dev' in datasets else None,
    processing_class=tokenizer,
    data_collator=DataCollatorForTokenClassification(tokenizer),
    compute_metrics=compute_metrics,
    callbacks=callbacks
)

## Train

In [None]:
print('Training ...')
train_hist = trainer.train()

### Evaluate

In [None]:
# apply the best model loaded after finishing training to the test set
print('Evaluating ...')
test_res = trainer.evaluate(datasets['test'], metric_key_prefix='test')

In [None]:
# create a more nice-to-loook-at output
out = pd.DataFrame(test_res, index=['value']).T
out = out.reset_index().rename(columns={'index': 'cat'})
out[['set', 'scheme', 'metric', 'misc']] = out.cat.str.split('_', expand=True)
out = out[out.misc.isnull()]
out = out[out.metric.notnull()]
out[['scheme', 'type']] = out.scheme.str.split('-', expand=True)
out = out.drop(columns=['set', 'cat', 'misc'])
out = out[['scheme', 'type', 'metric', 'value']]
out = out.pivot(index=['type', 'scheme', ], columns='metric', values='value')
keys = [
    (typ, scheme)
    for typ in types
    for scheme in ['seqeval', 'softseqeval', 'wordlevel', 'doclevel']
]
out.loc[keys, :]

### Inference

In [None]:
from datasets import Dataset

from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset

from tqdm import tqdm

We'll use the transformer's `pipeline` for inference (i.e., predicting spans in unlabeled data).

Specifically, we use the **NER** (named entity recognition) task and pass the fine-tuned model from the trainer.

In [None]:
extractor = pipeline(task='ner', model=trainer.model, tokenizer=tokenizer, batch_size=32, aggregation_strategy='simple')

In [None]:
fields = ['id', 'text']
df = pd.DataFrame([{f: doc[f] for f in fields} for doc in data_splits['test']])

In [None]:
# apply the extractor to the dataset
pred_ents = [p for p in extractor(df['text'].tolist())]

In [None]:
pred_ents[0]

For each text in the list of texts taken from `docs`, we get a list of dictionaries, here called `pred_ents`.

Each item in `pred_ents` is a dictionary with the following fields:

- start: character start index of the entity in the text
- end: character end index of the entity in the text
- score: confidence score of the prediction
- word: the text of the entity
- entity_group: the entity type (e.g., 'social group')


Let's use convert these annotations into one `Entities` instance and create a new `LabeledSequence` instance from this information for each text:  

In [None]:
from soft_seqeval.classes import Entity, Entities
from soft_seqeval.classes import LabeledSequence
from copy import deepcopy

def pipeline_output_to_entities(pred) -> Entities:
    """Take output from the NER pipeline and convert to Entities instance"""
    ents = []
    for ent in pred:
        ent = deepcopy(ent)
        if ent['word'][0] == ' ':
            ent['start'] += 1
        if ent['word'][-1] == ' ':
            ent['end'] -= 1
        ents.append(Entity(ent['start'], ent['end'], ent['entity_group']))
    return Entities(ents)

# iterate over the documents and predicted annotations to create a list of LabeledSequence instances
preds = [
    LabeledSequence(text=doc['text'], entities=pipeline_output_to_entities(pred), id=doc['id'], lang='english')
    for (_, doc), pred in zip(df.iterrows(), pred_ents)
]

In [None]:
# look at first 10 examples
preds[:10]


## Finally

#### Delete intermediate checkpoints and log files

In [None]:
# finally: clean up
if checkpoints_dir.exists():
    shutil.rmtree(checkpoints_dir)
if logs_dir.exists():
    shutil.rmtree(logs_dir)

#### Save the best model (if desired)

In [None]:
trainer.save_model(out_dir)
tokenizer.save_pretrained(out_dir)

### Free the GPU and remove large objects

In [None]:
import gc
trainer = trainer.model.to('cpu')
del trainer, tokenizer, data_splits, datasets
gc.collect()