In [1]:
import random
from collections import defaultdict

# Measuring Annotator Agreement

This notebook is a companion to a [blog post](https://pwsiegel.github.io/ds/iaa/) that I wrote about the problem of measuring inter-annotator agreement in machine learning experiments.

In short, it is standard practice to use human annotators to produce gold standard training or validation data for machine learning algorithms, but one runs into trouble in annotation experiments where humans disagree frequently.
High levels of disagreement often indicates that the annotation task is vague or subjective, and this potentially means that the function being measured is not sufficiently well-defined for a statistical model to be successful.

So when conducting annotation experiments it is important to measure the rate of agreement in the results.
But designing an adequate metric which can be compared across experiments with different numbers of annotators assigning different numbers of categories to data is not trivial, and the standard approach (the so-called $\kappa$ statistics) are unsatisfactory.

In my post I argued that it is better to compute agreement scores for each category in the experiment separately and then aggregate those scores if necessary.
I proposed an algorithm for carrying out this computation, and the purpose of this repository is to implement that algorithm.

## Synthetic Annotator Data

To begin, let us simulate some annotation experiments.
The first will have high agreement across the board, the second will have low agreement across the board, and the third will have low agreement in one category but high agreement in the others.
In each experiment three annotators $A$, $B$, and $C$ will apply category labels $0$, $1$, and $2$ to 1000 data points.

- In the first experiment all three annotators will apply the label $i \mod 3$ to the data point indexed $i$ with probability $.95$.
- In the second experiment annotators $A$ and $B$ will apply the label $i \mod 3$ and annotator $C$ will apply the label $i+1 \mod 3$ to the data point indexed $i$ with probability $.95$.

In [2]:
def model(index, value):
    if value <= .95:
        return index % 3
    elif value <= .975:
        return (index + 1) % 3
    else:
        return (index + 2) % 3

In [3]:
high = [
    [model(i, random.random()), model(i, random.random()), model(i, random.random())]
    for i in range(1000)
]

In [4]:
low = [
    [model(i, random.random()), model(i, random.random()), model(i+1, random.random())]
    for i in range(1000)
]

In [5]:
mixed = []

for i in range(1000):
    A = model(i, random.random())
    B = model(i, random.random())
    C = model(i, random.random()) if A == 0 else 1
    mixed.append([A, B, C])

## Agreement by Category

In [6]:
def agreement_rate_by_category(data):
    agreement = defaultdict(int)
    potential = defaultdict(int)
    
    for row in data:
        for cat in set(row):
            count = row.count(cat)
            agreement[cat] += choose2(count)
            potential[cat] += count * len(row) - choose2(count + 1)
    
    return {cat: agreement[cat] / potential[cat] for cat in agreement}

def choose2(n):
    return n*(n-1)/2

In [7]:
agreement_rate_by_category(high)

{0: 0.8295142071494043, 1: 0.8324225865209471, 2: 0.8359447004608295}

In [8]:
agreement_rate_by_category(low)

{0: 0.19442761962447003, 1: 0.19830713422007254, 2: 0.2042377869334903}

In [9]:
agreement_rate_by_category(mixed)

{0: 0.8260869565217391, 1: 0.555045871559633, 2: 0.2853080568720379}