# placeholder {.hidden}

# Application

In [None]:
#| eval: false
import os
from pathlib import Path
import gc

import pandas as pd
import numpy as np
base_path = Path.cwd().parent.parent

import sys
sys.path.append(str(base_path / 'paper'))

from reporting import *

from IPython.display import HTML
latex_table = lambda x, *args, **kwargs: HTML(x.to_html(**kwargs))

# Define base directory (paper/secs -> go up two levels to project root)
data_path = base_path / 'data'
labeled_path = data_path / 'labeled'
intermediate_path = data_path / 'intermediate'
manifestos_path = data_path / 'manifestos'
annotations_path = data_path / 'annotations' # / 'group_mention_categorization'
results_path = base_path / 'results'

In [None]:
#| cache: true
fp = manifestos_path / 'all_manifesto_sentences.tsv'
sentences_df = pd.read_csv(fp, sep='\t')

In [None]:
sentences_df.sentence_id.nunique()

In [None]:
#| cache: true
fp = labeled_path / 'labeled_mentions_with__party_metadata.pkl'
df = pd.read_pickle(fp)

In [None]:
df.country_iso3c.nunique()
df.year.describe().loc[["min", "max"]]
df.manifesto_id.nunique()

To demonstrate the scalability of our approach, we apply automated text analysis methods to, first, extract social group mentions from party manifestos, and second, classify these mentions according to the attribute categories they feature.

## Case selection and data

We focus our application on the social group mentions of populist radical-right (PRR) and Green parties and have compiled a corpus of the party manifestos these parties in 36 Western countries between 1966 and 2021.
The contrasting ideological profiles of the PRR and Green party families should provide for wide range of group references and therefore a  broad coverage of attribute categories. <!-- Further, this focus complements existing work on mainstream parties. -->
However, to allow for comparisons to mainstream parties, we have also included the party manifestos of mainstream center-left/social democratic and center-right/conservative parties in four selected countries (Germany, Sweden, United Kingdom, and the United States).
@fig-cases_overview shows and overview of the cases included in our application.

We have obtained the texts of party manifestos from secondary sources[^fn:manifesto_sources] and filled gaps through original data collection whenever possible.
We created sentence-level data from raw texts through automatic sentence segmentation[^fn:sentence_segmentation] and machine-translated all sentences into English using the open-weights M2M model [@fan_beyond_2020]. <!--by Helsinki NLP [@tiedemann_opus-mt_2020; tiedemann_democratizing_2024].-->
In total, we processed 495 party manifestos into a corpus comprising 436,984 sentences.

[^fn:manifesto_sources]: The _Manifesto Project Dataset_ [@lehmann_manifesto_2023] and PoliDoc [@benoit_challenges_2009].
[^fn:sentence_segmentation]: Using the `stanza` library [@qi_stanza_2020].

## Measurement

We proceeded in two steps to quantify the attributes used in social group mentions in this corpus.
We first identified social group mentions in manifesto sentences, applying the methods described in @licht_detecting_2025.
We then labeled the extracted social group mentions by classifying the attributes they contain with a custom multi-label classification approach.
In each step, we collected manual annotations for a sample of texts to produce labeled data for (few-shot) supervised machine learning.

### Social group mention detection

Studying what types of attributes political parties emphasize when they mention social groups in their manifestos presupposes data that records the verbatim social groups mentions contained in each manifesto sentence (if any) [@licht_detecting_2025; @horne_using_2025].
Given the broad coverage of our case selection, we could neither rely on full manual annotation nor on existing labeled data or pre-trained classifiers.

Accordingly, we produced the labeled data necessary for our application following the approach proposed by @licht_detecting_2025.
Their approach combines sequence annotation (i.e., marking relevant phrases in sentences) and supervised token classification to extract any verbatim words or phrases used in a sentence to refer to social groups.

<!-- ### Annotation, classifier training, and evaluatio -->

<!-- TODO: discuss whether to mention social group / organizational group distinction  -->

In particular, we proceeded in three annotation rounds, building on an active learning-like logic <!-- @miller_active_2020: https://doi.org/10.1017/pan.2020.4 --> with the goal of maximizing the reliability and generalization of supervised classifier trained on the collected annotations. 
In the first round, we applied a informativeness-based sampling [@kaufman_selecting_2024] to maximally diversify the selection of examples distributed for annotation, selecting 4,454 sentences for annotation (stratifying by manifesto, i.e., party and election). <!-- see ./code/sampling/sample_group_mention_annotation_batch_01.ipynb -->
To manually annotate social group mentions in this sample, we recruited a research assistant (RA) that already had experience with this annotation task from prior projects and proved very reliable.[^fn:mention_detection_ra].
To prepare the second annotation round, we used the RA's annotations from round one to train a preliminary token classifier, applied it to the remaining sentences in our corpus, and computed classification uncertainty for each sentence based on the predicted probabilities of the preliminary classifier.
We then sampled 2,472 sentences (again, stratifying by manifesto) with high prediction uncertainty for annotation for the second round. <!-- see ./code/mention-detection/sample_most_uncertain.ipynb -->
Importantly, this sampling strategy allowed us to progressively focus the human annotation effort on difficult cases. <!-- [cf. @licht_measuring_2025]. -->
We repeated this process one more time -- annotation, classifier training, uncertainty-based sampling -- to sample another 987 sentences for annotation in a third round. <!-- see ./code/mention-detection/sample_most_uncertain.ipynb -->


[^fn:mention_detection_ra]:
    In addition to being qualified for the task through prior experience, they received detailed coding instructions explaining the concept of social group mentions with definitions and examples and we performed two rounds of training with the RA using 231 respectively 238 sentences sampled from our target corpus.
    This allowed us to assert the coder's ability to identify social group mentions in the target data and provide them with feedback.
    <!--
    see 
    - data/annotations/group_mention_detection/group-menion-coder-training/annotations/sample_round1_emarie.jsonl (N=231)
    - data/annotations/group_mention_detection/group-menion-coder-training/annotations/sample_round1_emarie.jsonl (N=238)
    -->

We combined all sentences annotated in these three rounds, set aside 15% of sentences for evaluation, and trained a token classifier using the same model architecture and training procedure as described in @licht_detecting_2025.
The final classifier trained on our data achieved a span-level F1[^fn:seqeval] of 0.927 and a sentence-level F1 of 0.981 in held-out sentences. <!--, at least matching the performance reported by @licht_detecting_2025 on a similar task and dataset.-->

[^fn:seqeval]: The span-level F1 is 0.831 when only considering exact span matches [as per the strict `seqeval` metric, cf. @licht_detecting_2025].

<!-- We applied the social group mention classifier described in the previous section to all sentences in our corpus to predict which social group mentions each sentence contain (if any). -->
<!-- We then sampled group mentions from sentences with at least one predicted social group mention for annotation of the attributes contained in these mentions. -->


### Attribute classification

To classify the attributes contained in the social group mentions identified in the previous step, we collected multilabel annotations for a sample of mentions extracted by our supervised mention detection classifier.

#### Annotation

We recruited two RAs to annotate the attributes contained in the social group mentions identified in the previous step.
Our coding instrument showed one mention at a time in its respective sentence context, first asked whether the group mention qualifies as "universal" group reference, and, if not, prompted the annotator to select all veritical and horiztonal attributes conained in the focal group mention.[^fn:coding_instrument]
While a mention's classification as "universal" ruled out economic and non-economic attributes classification, non-universal mentions could be labeled with one or more of the available attribute categories, resulting in multi-label annotations [@erlich_multi-label_2022].

[^fn:coding_instrument]: Please refere to @sec-coding_instrument for a detailed description of our coding instrument.

<!-- `TODO: add a couple of meta sentences that explain the trade-off between intercoder reliability and choosing difficult examples, and the impact of strategic vs. random sampling on label class balance and how our three rounds of annotation reflect this; this will make it all appear less ad hoc; also, it mirrors what we did for group mention annotation.` -->
Based on prior empirical work on parties' group focus [e.g., @thau_how_2019; @huber_beyond_2021; @dolinsky_who_2025], we expected that some social attribute categories are much less prevalent in social group mentions than others.
We therefore again proceeded in several annotation rounds to hedge the risk of label class imbalance in our group attributes classification annotations and diversify our sample of annotated examples.[^fn:label_class_imbalance]
We describe these steps in detail in the Supporting Materials and highlight the main points here.
Our first annotation rounds focused on sampling diverse examples, tackling difficult cases, and balancing attributes' prevalence in the annotated data.
Inter-annotator agreement (ICA) was overall very high in these rounds and for most attribute categories, except in those rounds that focused on difficult examples and low-prevalence attribute categories for which estimating ICA is difficult (see @tbl-ica_overall and @tbl-ica_cats).
To further improve the quality of our annotations, the author team arbitrated cases with disagreeing annotations in each round.
Further they implemented a conceptually-driven annotation consolidation round and reviewed cases where an ensemble of classifiers tended to disagree with the then current annotations.
This process resulted in consolidated multi-label attribute annotations for 600 mention-in-context examples.

[^fn:label_class_imbalance]:
    Label class imbalance is a common problem in social science classification tasks that can harm classifiers' predictive performance.
    Iteratively sampling texts can be an effective strategy to mitigate this harm. <!-- [cf. @licht_measuring_2025; @erhardt_, @miller_active_2020].-->


#### Automation

The collected annotations only covered a small fraction of the 209,351 (predicted) social group mentions in our corpus, however.
Classifying which attributes are contained in each of these mentions thus required automating this classification task.
This proved practically challenging.
The high per-example effort required for full multi-label annotation during human annotation stressed our fixed annotation budget, so we could only annotate a 600 mention-in-context examples.
<!-- During annotation, it requires annotators to make multiple binary judgments for each example. -->
<!-- This raises the per-example annotation effort and thus creates a trade-off between full multilabel annotation and the total number of examples that can be annotated given a fixed annotation budget. -->
Yet, supervised machine learning for multi-label classification from few examples with many label classes and label class imbalance is a difficult problem [cf. @erlich_multi-label_2022].

<!-- In a next step, we used the attribute annotations to scale this measurement step to all of the 209,351 (predicted) mention-in-context examples in our corpus. -->
<!-- While conceptually appealing to capture attribute intersectionality in group mentions, automating multilabel classification . -->
<!-- This is mainly because multilabel annotation requires that annotators make multiple binary judgments for each example which raises the per-example effort and thus limits the total number of examples that can be annotated given a fixed annotation budget. -->

The current computational text analysis literature offers two approaches to address this problem: few-shot finetuning with "small" transformer encoder-only models and few-shot in-context learning [@brown_language_2020] with large decoder-only language models (LLMs).
We opted for the first option because it is more compute-efficient and has been shown to be effective [cf. @tunstall_efficient_2022; @laurer_less_2024; @burnham_political_2025], and leave the second option for future work.

Specifically, we opted for the few-shot sentence transformer finetuning (SetFit) approach proposed by @tunstall_efficient_2022.
In this framework, the labeled examples in the training set are first used to construct pairs for contrastive embedding model finetuning.
In our application, this effectively specializes a general-purpuse sentence embedding model to represent social group mentions.
In the second step, the finetuned embeddings serve as input to a classification head and the embedding model plus the classification head are trained end-to-end for multilabel classifiction of the training examples.[^fn:SetFit_advantage]

[^fn:SetFit_advantage]:
    @tunstall_efficient_2022 demonstrate that this two-pronged finetuning strategy enables impressively effective few-shot learning.
    A further advantage of SetFit is that it is very efficient at prediction time in contrast to NLI commonly applied for few-shot supervised classification in political science [@laurer_less_2024; @burnham_political_2025]. <!--[^fn:versus_nli]-->
    <!-- [^fn:versus_nli]: NLI relies on sentence pair classification. Each label class is verabalized as a hypothesis. To predict the labels of an example using NLI, its text is combine with each of the hypotheses for binary classifiction (true/false). With 18 label classes (or 33 in Horne et al.), this creates a large computation overhead because 18 inferences are needed to fully label a given input example. Our approach, with two classifiers -- one for economic, one for non-economic attributes -- requires only two prediction passes per example, which results in an about 9-fold speed up (_ceteris paribus_). -->

In [None]:
splits_path = annotations_path / 'group_mention_categorization' / 'splits' / 'model_selection'

tmp = pd.concat({
    split_file.parent.name: pd.read_pickle(split_file)
    for split_file in splits_path.glob("**/train.pkl")
}).reset_index(level=0, names=['fold'])
tmp.rename(columns={'economic__education_level': 'economic__education'}, inplace=True)
train_set_prevalences = tmp[label_cols].mean().astype(float).rename(index=attribute_category_names_map)
train_set_prevalences = train_set_prevalences.to_frame('prevalence').reset_index(names=['category'])

In [None]:
# model = "nomic-ai--modernbert-embed-base"
model = "sentence-transformers--all-mpnet-base-v2"
tasks = [
    "economic_attributes_classification",
    "noneconomic_attributes_classification"
]

discard_these = ["samples avg"] # "macro avg", 
metrics = ['f1-score', 'precision', 'recall']

idx_map = {
    'micro avg': 'micro average', 
    'weighted avg': 'weighted average',
    'macro avg': 'macro average', 
    **attribute_category_names_map
}

eval_res = {}
eval_res_tabs = {}

for task in tasks:
#task = tasks[0]
    results_dir = results_path / 'classifiers' / task  / "hp_search" / "setfit" / model / 'mention_text'
    res = pd.concat({
        fp.parts[-5:-1]: pd.read_json(fp).T.reset_index(level=0, names="what") 
        for fp in results_dir.glob("**/eval_results.json")
    })
    res.reset_index(level=[0,1,2,3], names=["method", "model_name", "strategy", "fold"], inplace=True)
    res = res.query("what not in @discard_these").copy()
    res["what"] = res["what"].replace({'economic__education_level': 'economic__education'})
    res["category"] = res["what"].replace(idx_map)
    eval_res[task] = res

    res_tab = res.groupby("category", observed=True)[metrics].mean()
    res_tab.columns = res_tab.columns.str.title()
    res_tab = res_tab.reset_index().merge(train_set_prevalences, on="category", how="left").round(3)
    res_tab['category'] = pd.Categorical(res_tab['category'], categories=list(idx_map.values()), ordered=True)
    res_tab = res_tab.sort_values('category', ascending=True)
    res_tab['category'] = res_tab['category'].astype(str).replace({v: rf"\quad \textit{{{v}}}" for v in attribute_category_names_map.values()})

    eval_res_tabs[task] = res_tab

In [None]:
#| label: tbl-attribute_classifier_eval_res
#| output: true
#| tbl-cap: "Evaluation results of group mention attribute classifiers. Values report the mean precision, recall, and F1-score averaged across five folds for each attribute category, along with the prevalence of each category in the training set."

# eval_res_tab = pd.concat({
#     fr"\textbf{{{d.replace('_attributes_classification', '').replace('non', 'non-')} attributes}}": 
#     tab.replace(np.nan, '').set_index('category')
#     for d, tab in eval_res_tabs.items()
# })
# eval_res_tab.index.names = [None] * len(eval_res_tab.index.names)
tab = pd.concat([
    pd.DataFrame({"category": r"\textbf{economic attributes}"}, index=[0]),
    eval_res_tabs[tasks[0]],
    pd.DataFrame({"category": r"\textbf{non-economic attributes}"}, index=[0]),
    eval_res_tabs[tasks[1]],
])
tab.set_index('category', inplace=True)
tab = tab.replace(np.nan, '')
tab.index.name = None
latex_table(tab, escape=False, index=True, multicolumn_cmidrules=[(10, 0, 4)])

We adopted the SetFit approach to train two separate multi-label classifiers, one for economic attributes and one for non-economic attributes. <!--, using the same training examples for both classifiers.-->
Since our application has not been examined in prior work, we have thoroughly evaluated several implementation choices such as the base embedding model, input formatting strategy, and hyperparameter choices, which we detail in the Supporting Materials.
@tbl-attribute_classifier_eval_res summarizes the performance of these classifiers averaged across five held-out test folds.[^fn:attribute_classifier_eval_res]
The economic attribute classifier achieves a macro-averaged (micro) F1 of 0.8 (0.785); the non-economic attribute classifier a macro-averaged (micro) F1 of 0.804 (0.829).
And while classification reliability varies across attribute categories, it is overall very good, with most categories achieving an F1 of 0.75 or higher.
We consider these strong results given the low number of few-shot training examples and strong label class imbalance [see the prevalence column, @erlich_multi-label_2022].
In particular, the classifiers prove similarly reliable as trained human annotators (see @tbl-ica_cats) in classifying economic attributes and even more reliable in classifying non-economic attributes.

[^fn:attribute_classifier_eval_res]: These are the results of models trained on folds' respective training examples using "optimal" hyper-parameters identified using stochastic hyperparameter grid search. Specifically, at the end of each hyper-parameter sweep on a fold's training examples, we selected the hyper-parameters that yielded the best macro-averaged F1 on the fold's validation examples, finetuned the model with these hyper-parameters, and evaluated it in the corresponding fold's test examples.

# APPENDIX

# Dataset

In [None]:
#| label: fig-cases_overview
#| cache: true
#| output: true
#| fig-cap: "Overview of cases included in out corpus. Each square represents a party in a given year, colored by its party family."

tmp = df[['country_iso3c', 'year', 'party_name', 'party_family']].drop_duplicates()
tmp.party_family = tmp.party_family.map(family_map)

countries = tmp['country_iso3c'].unique().tolist()

heights = tmp.groupby('country_iso3c').aggregate({'party_name': 'nunique'})['party_name']

n_col = 1
fig, axes = plt.subplots(
    len(countries)//n_col, n_col, 
    figsize=(5, heights.sum()/4), 
    height_ratios=heights.to_list(),
    sharex=True, 
    gridspec_kw={'hspace': 0.4}
    )

axes = axes.flatten()

scatter_kwargs = dict(
    x='year', 
    y='party_name', 
    hue='party_family', 
    marker='s', 
    palette=all_fam_palette,
    s=100, 
    legend=False
)
for i, (ctr, subdf) in enumerate(tmp.groupby("country_iso3c")):
    ax = axes[i]
    sns.scatterplot(data=subdf, ax=ax, **scatter_kwargs)
    # ax.set_ylim(1.15, -0.15)
    ax.set_ylabel(None)
    # add country label as y-axis label (second axis) on right hand side
    ax.xaxis.grid(False)
    for spine in ax.spines.values():
        spine.set_edgecolor(None)
        spine.set_visible(False)
    ax_right = ax.twinx()
    ax_right.set_ylabel(ctr, fontweight='bold', rotation=0, labelpad=15)
    ax_right.set_yticks([])
    # ax_right.spines['right'].set_visible(False)
    # increase y-limits so that squares are fully visible
    ax.set_ylim(-0.5, len(subdf['party_name'].unique()) - 0.5)
    # make plot backgorund light gray
    ax.set_facecolor('#f0f0f0')

In [None]:
len(econ_attrs), len(nonecon_attrs)

\clearpage

# Attribute classification

## Coding instrument {#sec-coding_instrument}

Our coding instructions introduce the vertical/horizontal distinction, each of the attribute categories, as well as our "universal" group mention category, explain the available coding choices, and give examples.
Our coding instrument, implemented as a Qualtrix survey, presents annotators independently with one social group mention at a time (i.e., per page) in their respective sentence context, marking the mention in bold.

Below the display of the sentence, the annotator is first asked to indicate whether the highlighted mention is an "universal" mention per our definition.
If the annotator chose "Yes" for this coding dimension, our coding instructions asked them to proceed with the next instance on the next page.

If not, the annotator proceeds with coding the attribute categories for the vertical and horizontal attribute dimensions in turn.
For each of the dimensions, we tasked the annotator to indicate which of the respective attribute categories was contained in the highlighted social group mention, displaying the attribute categories below each other[^1] in a multiple-choice grid with the answer options "Yes" or "Unsure" horizontally aligned.[^2]
This procedure results in 17 annotations[^3] per mention-in-sentence-context instance and annotator.
<!-- TODO: consider how to handle later omission of _class membership_ and _ecology of group_ categories !? -->

[^1]: We kept the categories' order fixed across examples to ease the cognitive load at annotation time.
[^2]: We omitted options for "No" because not choosing "Yes" or "Unsure" for a given attribute category implied this coding choice.
[^3]: 1$\times$ universal + 5$\times$ vertical attributes + 11$\times$ horizontal attributes


## Annotation

In a first round of annotations, we have sampled 300 mentions-in-sentence-context examples from all sentences with at least one predicted social group mention, again using an informativeness-based sampling strategy.
@tbl-ica_overall reports the micro inter-annotator agreement (ICA) by attribute dimension from this round and indicated that our coders produced overall reliable annotations.
To resolve examples with disagreeing annotations, the authors team reviewed any mention-in-sentence-context example with at least one disagreeing annotation to determine their final labels.<!--[^5]-->


<!-- [^5]: This lead to a few revisions of attribute classifications the annotators agreed on in annotations where RAs unanimous coding disagreed with experts' judgments.
    We resolved them manually and updated the final annotations accordingly. -->

<!-- TODO: mention when we introduced the shared values/mentalities class -->

However, as expected, the prevalence of attribute categories in the labels collected in this first round turned out to be very imbalanced and we observed variation in ICA estimates across categories that could only partially be explained by low prevalence (see @tbl-ica_cats).
We therefore dedicated a second annotation round to collecting more annotations for difficult examples.
To this end, we fine-tuned a multilabel classifier to predict mentions-in-sentence-context examples' binary labels on the universal, vertical, horizontal indicators based on the consolidated annotations collected in the first round.<!--[^6]-->
We applied this classifier to the mention-in-sentence-context instances not yet distributed for attribute annotation to obtain predicted probabilities.
We then computed classification uncertainty at the example level<!--[^7]--> and selected the 150 most uncertain examples into a second annotation batch.

<!-- [^6]: See section XX for the details of our few-shot fine-tuning strategy.
    The resulting model showed already strong classification performance for this high-level task considering how few examples we distributed for annotation in the first round.
-->

<!-- [^7]: Computing classification uncertainty as minimal closeness to one of the three 0.5 classification thresholds. -->

@tbl-ica_overall and @tbl-ica_cats show that, due to our focus on difficult examples, ICA was generally lower than in the first round.
We therefore used zero-shot in-context learning [@brown_language_2020] to generate large language model (LLM) annotations for the examples in the second-round batch.<!--[^8]-->
We then presented our annotators the instances where the LLM disagreed with their judgment and tasked them to (independently) judge which annotation they viewed as more valid while blinding them towards the source of the respective annotations.
The author team arbitrated the cases in which our annotators' independent judgments disagreed.
All instances from this second round were then consolidated into final labels and added to those of the first round.

In [None]:
## report reliability by round

annoation_rounds = {
    '1': 'social-group-mention-categorization-coder-training',
    '2': 'social-group-mention-categorization-round02',
    '3': 'social-group-mention-categorization-round03',
}

folder = annotations_path / 'group_mention_categorization'
ica_raw = pd.concat({
    r: pd.read_pickle(folder / f / 'parsed' / 'ica_estimates.pkl') 
    for r, f in annoation_rounds.items()
})
ica_raw.reset_index(level=0, names=['round'], inplace=True)
ica_raw.loc[ica_raw.label.isna(), 'label'] = ica_raw.loc[ica_raw.label.isna(), 'q_category']
ica_raw.loc[ica_raw.q_id=='universal_attributes', 'label'] = 'overall'

In [None]:
ica = ica_raw.loc[ica_raw.q_id.str.endswith('_attributes'), ['round', 'q_id', 'label', 'prop_yes', 'krippendorff_alpha']]
ica = ica[~ica.label.isna()]

cats_map = {i.split('__')[1]: nm for i, nm in attribute_category_names_map.items()}
ica['label'] = ica.label.replace({"education level": "education"})
ica['category'] = ica.label.replace(cats_map)

In [None]:
#| label: tbl-ica_overall
#| output: true
#| tbl-cap: "Inter-coder agreement estimates for attribute classifications computed at the level of attribute dimensions: universal, economic, and non-economic. Estimates are based on Krippendorff's σ and the prevalence of 'yes' annotations (prevalence) across all annotated examples in each annotation round."
ica_overall = ica.loc[ica.label == 'overall', ['round', 'q_id', 'prop_yes', 'krippendorff_alpha']]
ica_overall = ica_overall.rename(columns={'prop_yes': 'prevalence', 'krippendorff_alpha': r"Krippendorff's $\alpha$"})
ica_overall['q_id'] = ica_overall.q_id.str.removesuffix('_attributes')
ica_overall['q_id'] = pd.Categorical(ica_overall['q_id'], categories=['universal', 'economic', 'non-economic'], ordered=True)
ica_overall = ica_overall.pivot_table(index='q_id', columns=['round'], observed=True).round(3)
ica_overall.index.name = None
ica_overall.columns.names = [None, 'annotation round']
latex_table(ica_overall, index=True)

In [None]:
#| label: tbl-ica_cats
#| output: true
#| tbl-cap: "Inter-coder agreement estimates for attribute classifications computed at the level of economic and non-economic attribute categories. Estimates are based on Krippendorff's α and values in parentheses indicate the prevalence of 'yes' annotations (prevalence) across all annotated examples in each annotation round."
ica_cats = ica.loc[ica.label != 'overall', ['round', 'q_id', 'category', 'prop_yes', 'krippendorff_alpha']]
ica_cats['category'] = pd.Categorical(ica_cats['category'], categories=list(cats_map.values()), ordered=True)
ica_cats['value'] = ica_cats.apply(lambda r: rf"{r['krippendorff_alpha']:+0.3f} ({r['prop_yes']*100:0.1f}\%)", axis=1)
ica_cats['q_id'] = ica_cats.q_id.str.removesuffix('_attributes')
ica_cats = ica_cats.pivot_table(index=['q_id', 'category'], columns=['round'], values=['value'], aggfunc='first', observed=False).round(3).fillna('').sort_index()
ica_cats.columns = pd.Index(ica_cats.columns.get_level_values(1), name='annotation round')
ica_cats.index.names = ['dimension', 'category']
latex_table(ica_cats, index=True, resize=True)

<!-- [^8]: See SM XX for the prompts, model, and hyper-parameters we used. -->
In a third round of annotation, we then addressed the problem of label class imbalance that clearly showed in the pooled set of multi-labeled mention-in-sentence-context instances from rounds one and two.
In particular, we focused on over-sampling likely examples of so far under-represented attribute categories in the vertical and horizontal attribute dimensions.
To identify likely examples for the given label categories, we used the attribute category definitions to define queries and used a pre-trained sentence embedding model to rank so far unannotated mention-in-sentence-context instances based on their cosine similarities to each of these queries.
To oversample likely examples of so far underrepresented categories, we defined quotas for each attribute category in inverse proportion to the categories prevalence in the labeled instances from round one and two, using a annotation budget of 200 examples for round 3.
We then chose as many of the so far unanottated mention-in-sentence-context instances as the quote prescribed in descending order of their embeddings' similarities to the embedding of the respective attribute category definition query.
In total, this resulted in a samole of 180 mention-in-sentence-context examples for annotation in round 3.[^10]

[^10]: Deviations from the annotation budget of 200 cases are explained by rounding in the computation of quotas and some mention-in-sentence-context instances that were ranked high for multiple attribute queries.

@tbl-ica_overall shows that the attribute annotations of examples in this third round were overall highly reliable, which is likely explained by focusing on identifying *likely* examples of each attribute category in this final round (in contrast to difficult instances in round 2).
Further, the consolidated labels from round 3 showed that our over-sampling strategy was effective as it helped to reduce the label class imbalance across the set of vertical and horizontal attribute categories.

A last round of annotation focused on conceptual consolidation and was performed by the authors.
Specifically, we manually reviewed any annotated examples that were labeled as combining certain attribute categories to determine whether the combination of these categories was conceptually valid and consistent with our attribute definitions.

<!-- 
TODOs


- describe concrete review startegies 
    1. concept-driven review
    2. word-wise review

- mention classifier ensemble-based review !!!

note: see round5/ in code/group_mention_categorization/ 

-->


## SetFit finetuning

We have first carefully split the 600 annotated examples into five training, validation and test folds to minimize leakage between the training and evaluation sets.
This is generally a crucial step in any supervised machine learning application, but it is particularly important in our application for several reasons.
First, simple random sampling does not account for the facts that 
(a) mentions are embedded in sentences so that the same mention can appear in multiple sentences and 
(b) some mentions are near duplicates of each other and having these in different splits can lead to overestimation of predictive performance.
Second, in the context of few-shot learning, it is particularly important to minimize leakage between the training and evaluation sets because the small number of training examples makes it more likely that a given example in the evaluation set is very similar to one in the training set, which can lead to overestimation of predictive performance.

We have then used the training and validation split to examine the average performance of different base embedding models[^fn:embedding_model_selection] and input formatting strategies[^fn:inpiut_formatting_strategies] and select the best-performing model and formatting strategy for each classifier.

Finally, we have trained the two classifiers on the full training set using the selected embedding model and input formatting strategy and evaluated their performance on the held-out test set.



In [None]:
fp = annotations_path / 'group_mention_categorization' / 'final_annotations.tsv'
annotations = pd.read_csv(fp, sep='\t')
annotations.mention_id.nunique()

In [None]:
# annotations[annotations.mention.str.contains("lite")].pivot_table(index=['mention_id', 'mention'], columns='attribute_combination', values='label', aggfunc='first').T