## Annotation metrics

In this tutorial, I will show you how to use the `metrics` package in the annotations library. This library includes a wide variety of metrics commonly used to evaluate annotator (dis)agreements, as well as minimal visualisation capabilities. It has two classes: `Metrics` and `Krippendorff`. The reason I separated `Krippendorff` is because it relies on a number of costly functions upon initialisation.

In [17]:
import sys 
import disagree
import pandas as pd 

I will use data of type string in this tutorial, but types int and float are also possible, and the library can handle missing values (i.e. `nan` and `None`).

The data set in this tutorial will have 15 instances of data, annotated by 3 annotators. The possible labels will be `["cat", "dog", "cow", "ant", None]`:

In [18]:
test_annotations = {"a": [None, None, None, None, None, "dog", "ant", "cat", "dog", "cat", "cat", "cow", "cow", None, "cow"],
                    "b": ["cat", None, "dog", "cat", "cow", "cow", "ant", "cow", None, None, None, None, None, None, None],
                    "c": [None, None, "dog", "cat", "cow", "ant", "ant", None, "dog", "cat", "cat", "cow", "cow", None, "ant"]}
df = pd.DataFrame(test_annotations)
print(df)

       a     b     c
0   None   cat  None
1   None  None  None
2   None   dog   dog
3   None   cat   cat
4   None   cow   cow
5    dog   cow   ant
6    ant   ant   ant
7    cat   cow  None
8    dog  None   dog
9    cat  None   cat
10   cat  None   cat
11   cow  None   cow
12   cow  None   cow
13  None  None  None
14   cow  None   ant


First we will explore all of the different metrics available in the `Metrics` class. There are two types here: those that evaluate disagreements between more than two annotators, and those that evaluate disagreements between just two annotators. We will start with the former (this is just the popular Fleiss's kappa metric).

In [19]:
from disagree import metrics

In [20]:
mets = metrics.Metrics(df) 

In [21]:
fleiss = mets.fleiss_kappa()
print("Fleiss kappa: {:.2f}".format(fleiss))

Fleiss kappa: 0.45


Here we imported the `metrics` module from disagree, and used the `Metrics` class to set things up. All statistics (besides Krippendorff's alpha) are then called from the `Metrics` class as above.

There are 5 simple metrics for comparing two annotators: joint probability, Cohen's kappa, Pearson correlation, Spearman correlation, and Kendall's tau correlation. The latter 3 output a tuple of the correlation and the p-value. 

Consider an evaluation of how often annotator "b" and "c" agree:

In [22]:
cohens = mets.cohens_kappa(ann1="b", ann2="c") 

In [23]:
joint = mets.joint_probability(ann1="b", ann2="c")

In [24]:
pearson = mets.correlation(ann1="b", ann2="c", measure="pearson")
spearman = mets.correlation(ann1="b", ann2="c", measure="spearman")
kendall = mets.correlation(ann1="b", ann2="c", measure="kendall")

In [25]:
print("Cohen's kappa: {:.2f}".format(cohens))
print("Joint probability: {:.2f}".format(joint))
print("Pearson's correlation: " + str(pearson))
print("Spearman's correlation: " + str(spearman))
print("Kendall's correlation: " + str(kendall))

Cohen's kappa: 0.74
Joint probability: 0.80
Pearson's correlation: (0.7399400733959436, 0.15283808566654913)
Spearman's correlation: (0.7105263157894738, 0.1786189192398147)
Kendall's correlation: (0.6666666666666666, 0.11843292891667201)


For these metrics comparing two annotators, you can visualise the metric in a matrix for all annotators by using the `metric_matrix` method. The only required argument is the function name.

In [26]:
mets.metric_matrix(mets.cohens_kappa)

array([[1.   , 0.25 , 0.673],
       [0.25 , 1.   , 0.737],
       [0.673, 0.737, 1.   ]])

### Krippendorff's alpha

Krippendorff's alpha follows a similar logic. This contains an option to use the `tqdm` library to output a loading bar as well, because for projects with a very large number of annotators this can take a long time, and has non-linear time complexity. (As an example, for 20,000 instances of data and 5 annotators, this takes about 10 seconds.) 

In [27]:
kripp = metrics.Krippendorff(df)

There are a number of different ways to calculate Krippendorff's alpha, depending on the type of data that has been labelled. This is specified using the `data_type` argument seen below. You can use nominal, ordinal, interval, or ratio.

In [28]:
alpha = kripp.alpha(data_type="nominal")

In [29]:
print("Krippendorff's alpha: {:.2f}".format(alpha))

Krippendorff's alpha: 0.65
