## Annotation metrics

In this tutorial, I will show you how to use the `metrics` package in the annotations library. This library includes a wide variety of metrics commonly used to evaluate annotator (dis)agreements, as well as minimal visualisation capabilities. It has two classes: `Metrics` and `Krippendorff`. The reason I separated `Krippendorff` is because it relies on a number of costly functions upon initialisation.

In [1]:
import sys 

from disagree import metrics 
import pandas as pd 

First we will create a dummy dataset of labels. Remember that current capabilities allow for labels of ascending integers starting at zero, as no labels. So if you have the possible labels `["cat", "dog", "giraffe", None]`, you will want to convert these to `[0, 1, 2, None]`. 

The data set in this tutorial will have 15 instances of data, annotated by 3 annotators. The possible labels will be `[0, 1, 2, 3, None]`:

In [2]:
test_annotations = {"a": [None, None, None, None, None, 1, 3, 0, 1, 0, 0, 2, 2, None, 2],
                    "b": [0, None, 1, 0, 2, 2, 3, 2, None, None, None, None, None, None, None],
                    "c": [None, None, 1, 0, 2, 3, 3, None, 1, 0, 0, 2, 2, None, 3]}
df = pd.DataFrame(test_annotations)
labels = [0, 1, 2, 3] # Note that you don't need to specify the presence of None labels

First we will explore all of the different metrics available in the `Metrics` class. There are two types here: those that evaluate more than two annotators, and those that evaluate disagreements between two annotators. We will start with the former (this is just the popular Fleiss's kappa metric).

In [3]:
mets = metrics.Metrics(df, labels)

In [4]:
fleiss = mets.fleiss_kappa()
print("Fleiss kappa: {:.2f}".format(fleiss))

Fleiss kappa: -0.29


There are 5 metrics for the latter type: joint probability, Cohen's kappa, Pearson correlation, Spearman correlation, and Kendall's tau correlation. The latter 3 output a tuple of the correlation and the p-value. 

Consider an evaluation of how often annotator "b" and "c" agree:

In [5]:
cohens = mets.cohens_kappa(ann1="b", ann2="c")

In [6]:
joint = mets.joint_probability(ann1="b", ann2="c")

In [7]:
pearson = mets.correlation(ann1="b", ann2="c", measure="pearson")
spearman = mets.correlation(ann1="b", ann2="c", measure="spearman")
kendall = mets.correlation(ann1="b", ann2="c", measure="kendall")

In [8]:
print("Cohen's kappa: {:.2f}".format(cohens))
print("Joint probability: {:.2f}".format(joint))
print("Pearson's correlation: " + str(pearson))
print("Spearman's correlation: " + str(spearman))
print("Kendall's correlation: " + str(kendall))

Cohen's kappa: 0.79
Joint probability: 0.80
Pearson's correlation: (0.9417419115948373, 0.01673155107662241)
Spearman's correlation: (0.9210526315789475, 0.026310519685577894)
Kendall's correlation: (0.8888888888888888, 0.037356472445581754)


For these metrics comparing two annotators, you can visualise the metric in a matrix for all annotators by using the `metric_matrix` method. The only required argument is the function name.

In [9]:
mets.metric_matrix(mets.cohens_kappa)

array([[1.   , 0.33 , 0.732],
       [0.33 , 1.   , 0.795],
       [0.732, 0.795, 1.   ]])

### Krippendorff's alpha

Krippendorff's alpha follows a similar logic. This uses the `tqdm` library to output a loading bar as well, because for projects with a very large number of annotators this can take a long time, and has non-linear time complexity. (As an example, for 20,000 instances of data and 5 annotators, this takes about 10 seconds.)

In [10]:
kripp = metrics.Krippendorff(df, labels)

  self.A = df.as_matrix().T
100%|██████████| 4/4 [00:00<00:00, 681.31it/s]


There are a number of different ways to calculate Krippendorff's alpha, depending on the type of data that has been labelled. This is specified using the `data_type` argument seen below. You can use nominal, ordinal, interval, or ratio.

In [11]:
alpha = kripp.alpha(data_type="nominal")

In [12]:
print("Krippendorff's alpha: {:.2f}".format(alpha))

Krippendorff's alpha: 0.65
