# Single-label classification annotation aggregation with the Dawid-Skene model

source: https://michaelpjcamilleri.wordpress.com/2020/06/22/reaching-a-consensus-in-crowdsourced-data-using-the-dawid-skene-model/

<center>

![Plate diagram of the Dawid-Skene per-annotator model for annotation aggregation](https://tlk.s3.yandex.net/crowd-kit/docs/ds_llm.png){width="30%"} <!--source: https://crowd-kit.readthedocs.io/en/latest/classification/#crowdkit.aggregation.classification.DawidSkene-->

</center>

For other classification annotation aggregation algorithms, see the [`crowdkit` package](https://crowd-kit.readthedocs.io/en/latest/classification/#crowdkit.aggregation.classification)

## Setup

#### Colab

In [1]:
# check if on colab
COLAB = True
try:
    import google.colab
except:
    COLAB=False

if COLAB:
    # shallow clone of current state of main branch 
    !git clone --branch main --single-branch --depth 1 --filter=blob:none https://github.com/haukelicht/advanced_text_analysis.git
    # make repo root findable for python
    import sys
    sys.path.append("/content/advanced_content_analysis/")
    
    # install required packages
    !pip install krippendorff==0.8.1

#### Required libraries

In [2]:
from pathlib import Path
import pandas as pd
from src.annotation.dawidskene import DawidSkeneModel
from scipy.stats import entropy

#### Data paths 

In [3]:
base_path = Path("/content/advanced_text_analysis/" if COLAB else "../../")
data_path = base_path / "data" / "labeled" / "fornaciari_we_2021"

### Read the sentence-level classification annotations

In [4]:
annotations_path = data_path / "annotations" / "classification" / "llms"

# list all annotation files produced by doccano 
#  (each records annotations by one annotator)
fps = list(annotations_path.glob('*.csv'))

# read the annoations into a long-format DataFrame
annotations = pd.concat({fp.stem: pd.read_csv(fp) for fp in fps}, ignore_index=False).reset_index(level=0, names=['annotator'])

# list unique annotators
annotations.annotator.unique().tolist()

['gpt-oss-120b',
 'DeepSeek-V3-0324',
 'Llama-4-Maverick-17B-128E-Instruct',
 'Qwen3-235B-A22B-Instruct-2507']

## Fit the model

In [5]:
classes = annotations.label.unique().tolist()
n_classes = len(classes)
model = DawidSkeneModel(n_classes, max_iter=500, tolerance=10e-100)

In [6]:
posterior_labels = model.fit_transform(annotations, items_col='text_id', annotators_col='annotator', annotations_col='label')
posterior_labels.reset_index(names='text_id', inplace=True)

0	-116.94591332709433
50	-72.86266319359012
100	-72.86266319359012
150	-72.86266319359012


#### Estimated annotator "abilities"

In the Dawid-Skene per-annotator model, annotators are parameterized with a `n_classes` &times; `n_classes` "reliability" matrix $\theta$ (read "theta").


In our application, we have two label classes that are indexed as follows:

In [7]:
dict(enumerate(model.classes_))

{0: 'No Pledge', 1: 'Pledge'}

Given that we have four annotators, the ability parameters have the following shape:

In [8]:
model.fitted_.theta.shape

(4, 2, 2)

Let's look at the first annotator's estimated reliability parameters.

In [9]:
print('annotator:', model.annotators_[0])
print(*model.fitted_.theta[0].round(3).tolist(), sep='\n')

annotator: gpt-oss-120b
[0.969, 0.031]
[0.194, 0.806]


The diagonal indicates an annotators estimated ability to correctly label an instance of the given class:

- for the "No Pledge" class, the probability that the annotator labels an instance as "No Pledge" given that its 'true' (estimated) label is "No Pledge" is 0.969 (element (0, 0))
- for the "Pledge" class, the probability that the annotator labels an instance as "Pledge" given that its 'true' (estimated) label is "Pledge" is 0.806 (see element (1, 1))

The average of these diagonal entries is the annotator's "ability":

In [10]:
pd.Series(model.fitted_.worker_reliabilities).sort_values(ascending=False)

Qwen3-235B-A22B-Instruct-2507         0.958662
Llama-4-Maverick-17B-128E-Instruct    0.918119
gpt-oss-120b                          0.916399
DeepSeek-V3-0324                      0.898662
dtype: float64

### Estimated label class probabilities

The Dawid-Skene model estimates the probability that an item belongs to each class based on the annotators' annotations of it and accounting for their estimated reliabilities.
Let's look at these estimates for a samples of items:

In [11]:
posterior_labels[classes].sample(5, random_state=42).round(3)

Unnamed: 0,No Pledge,Pledge
13,0.0,1.0
39,0.946,0.054
30,0.999,0.001
45,0.0,1.0
17,0.999,0.001


as yoo can see, the posterior (i.e., estimated) label class probabilties vary between items.
For example,

- The first item (index 185) is estimated to belong to "Pledge" class with a probability of ~1.0.
- The second item (index 587) is estimated to belong to "No Pledge" class with a probability of ~0.946.

We can quantify the level of uncertainty with the entropy measure:

In [12]:
posterior_labels["label_uncertatinty"] = entropy(posterior_labels[classes].values, axis=1)

In [13]:
posterior_labels["label_uncertatinty"].round(3).value_counts().sort_index()

label_uncertatinty
0.000    15
0.005    31
0.209     2
0.238     1
0.368     1
Name: count, dtype: int64

This shows that tere is only three items for which the model has considerable posterior label uncertainty.

## Evaluation against the "true" labels

In case of the example we use in this notebook, we have "true" sentence level labels that have been assigned by the annotators of the original study.


We can use this to assess how we'll our modle-induced posterior label estimates align with these annotations.

In [14]:
fp = data_path / "annotation_set_01.csv"
df = pd.read_csv(fp)

id2label = {0: 'No Pledge', 1: 'Pledge'}
df['label'] = df['label'].map(id2label)

In [15]:
tmp = df[['text_id', 'label']].merge(posterior_labels, on='text_id')

In [16]:
tmp.value_counts(['label', 'posterior_label']).unstack().fillna(0).astype(int)

posterior_label,No Pledge,Pledge
label,Unnamed: 1_level_1,Unnamed: 2_level_1
No Pledge,25,0
Pledge,9,16


Let's quantify this alignment:

In [17]:
from sklearn.metrics import classification_report
print(classification_report(tmp['label'], tmp['posterior_label'], zero_division=0))

              precision    recall  f1-score   support

   No Pledge       0.74      1.00      0.85        25
      Pledge       1.00      0.64      0.78        25

    accuracy                           0.82        50
   macro avg       0.87      0.82      0.81        50
weighted avg       0.87      0.82      0.81        50

