# Agreement Study

We randomly selected 3 cases of medium length (70-150 sentences) for double annotation to assess agreement of the revised annotation schema and guidelines. Annotation for this study was conducted by Michael and Nathan.

This notebook covers:

- Some high level stats about the annotations and disagreements
- IAA metrics, including F1 and Gamma
- Qualitative analysis of the disagreements

Cases for agreement study:

- 12625853_mixed_alito
- 12628561_ootc_sotomayor
- 12625931_dissent_thomas

## Setup


In [1]:
# Allows for seamless use of updated src

%load_ext autoreload
%autoreload 2

# Switch to top of curiam directory for easier paths
%cd ../../..


/home/mkranzlein/michael/dev/curiam


In [2]:
import os

from curiam import agreement

from curiam.inception import tsv_processing

from sklearn.metrics import precision_recall_fscore_support

## Summary Stats


In [3]:
agreement_folder = "data/full_scale/agreement_study"

# These are list of opinions which are list of sentences which are lists of tokens
# eg opinions_m[0][0][0] is the 0-th token of the 0-th sentence of the 0-th opinion in the agreement study.
opinions = [tsv_processing.process_opinion_file(f"{agreement_folder}/michael/{filename}")
          for filename in os.listdir(f"{agreement_folder}/michael")
          if filename.endswith(".tsv")]

opinions_n = [tsv_processing.process_opinion_file(f"{agreement_folder}/nathan/{filename}")
          for filename in os.listdir(f"{agreement_folder}/nathan")
          if filename.endswith(".tsv")]

# Set 4th column of each token to Nathan's annotation
# Each token now has the format: [sentence_num, tok_str, michael_annotation, nathan_annotation]
for i, opinion in enumerate(opinions):
    for j, sentence in enumerate(opinion):
        for k, token in enumerate(sentence):
            token.append(opinions_n[i][j][k][2])


### How many sentences?


In [4]:
sum([len(opinion) for opinion in opinions])

332


### How many tokens?


In [5]:
token_total = sum([len(sentence) for opinion in opinions for sentence in opinion])
print(token_total)

9109


### How many tokens received at least one label?

In [6]:
def get_token_coverage(sentence, annotation_column):
    return sum([1 if token[annotation_column] != "_" else 0 for token in sentence])

coverage_m = sum([get_token_coverage(sentence, 2) for opinion in opinions for sentence in opinion])
coverage_n = sum([get_token_coverage(sentence, 3) for opinion in opinions for sentence in opinion])

print(f"Tokens with at least one label:")
print(f"Michael: {coverage_m} ({coverage_m/token_total*100:.2f}%)")
print(f"Nathan: {coverage_n} ({coverage_n/token_total*100:.2f}%)")

Tokens with at least one label:
Michael: 4616 (50.68%)
Nathan: 4275 (46.93%)



### How many spans did each annotator annotate?



## Agreement

### Agreement Overall



#### Gamma


In [7]:
from pygamma_agreement import Continuum
from pyannote.core import Segment
from pygamma_agreement import CombinedCategoricalDissimilarity

In [8]:
def get_opinion_gamma(opinion, excluded_categories=[]):
    continuum = Continuum()
    offset = 0
    for sentence in opinion:
        annotations_m = tsv_processing.get_annotations(sentence, annotation_column=2)
        annotations_n = tsv_processing.get_annotations(sentence, annotation_column=3)
        for annotation in annotations_m:
            category, start, end = annotation[0], annotation[1], annotation[2]
            if category in excluded_categories:
                continue
            continuum.add("m", Segment(start+offset, end+offset+1), category)
        for annotation in annotations_n:
            category, start, end = annotation[0], annotation[1], annotation[2]
            if category in excluded_categories:
                continue
            continuum.add("n", Segment(start+offset, end+offset+1), category)
        offset += len(sentence)
    dissim = CombinedCategoricalDissimilarity(alpha=1, beta=1)
    gamma_results = continuum.compute_gamma(dissim)
    return gamma_results.gamma

In [9]:
for opinion in opinions:
    print(get_opinion_gamma(opinion))

0.8076161318013454
0.8247066494288938
0.8648076435148637


In [10]:
for opinion in opinions:
    print(get_opinion_gamma(opinion, excluded_categories=["Direct Quote", "Legal Source"]))

0.6944353963751995
0.7471369718588388
0.7429975028577418


In [11]:
for opinion in opinions:
    print(get_opinion_gamma(opinion, excluded_categories=["Direct Quote", "Legal Source", "Metalinguistic Cue"]))

0.6583071730224289
0.6778096615916295
0.5092599582426991



#### P, R, F1



In [12]:
categories = ['Appeal to Meaning',
              'Definition',
              'Direct Quote',
              'Example Use',
              'Focal Term',
              'Language Source',
              'Legal Source',
              'Metalinguistic Cue',
              'Named Interpretive Rule']

In [13]:
def get_token_categories(label):
    token_categories = []
    indexed_annotations = label.split("|")
    for indexed_annotation in indexed_annotations:
        colon_index = indexed_annotation.index(":")
        category = indexed_annotation[:colon_index]
        token_categories.append(category)
    return token_categories

In [14]:
label_set = set()
for opinion in opinions:
    for sentence in opinion:
        for token in sentence:
            if token[2] != "_":
                label_set.update(set(get_token_categories(token[2])))

In [15]:
def print_opinion_p_r_f1(opinion):
    for category in categories:
        token_labels_m = []
        token_labels_n = []
        for sentence in opinion:
            # Create single-class labels for each token for Michael and Nathan
            for token in sentence:
                label_m = token[2]
                label_n = token[3]
                if label_m == "_":
                    token_labels_m.append(0)
                
                elif category in get_token_categories(label_m):
                    token_labels_m.append(1)
                else:
                    token_labels_m.append(0)
                if label_n == "_":
                    token_labels_n.append(0)
                elif category in get_token_categories(label_n):
                    token_labels_n.append(1)
                else:
                    token_labels_n.append(0)
        if sum(token_labels_m) > 0 and sum(token_labels_n) > 0:
            p, r, f1, _ = precision_recall_fscore_support(token_labels_m, token_labels_n, average="macro")
            print(f"{p:.4f}\t {r:.4f}\t {f1:.4f}\t {category}")

        

In [16]:
print_opinion_p_r_f1(opinions[1])

0.6098	 0.5864	 0.5956	 Appeal to Meaning
0.9126	 0.9313	 0.9217	 Definition
0.9949	 0.9743	 0.9842	 Direct Quote
0.9251	 0.7841	 0.8377	 Example Use
0.9313	 0.8496	 0.8860	 Focal Term
0.9996	 0.9756	 0.9873	 Language Source
0.9965	 0.9982	 0.9973	 Legal Source
0.9198	 0.8624	 0.8888	 Metalinguistic Cue
0.5708	 0.6976	 0.6037	 Named Interpretive Rule




### Agreement By Category

## Qualitative Analysis