# Agreement Study

We randomly selected 3 cases of medium length (70-150 sentences) for double annotation to assess agreement of the revised annotation schema and guidelines. Annotation for this study was conducted by Michael and Nathan.

This notebook covers:

- Some high level stats about the annotations and disagreements
- IAA metrics, including F1 and Gamma
- Qualitative analysis of the disagreements

Cases for agreement study:

- 12625853_mixed_alito
- 12628561_ootc_sotomayor
- 12625931_dissent_thomas

## Setup


In [1]:
# Allows for seamless use of updated src

%load_ext autoreload
%autoreload 2

# Switch to top of curiam directory for easier paths
%cd ../../..


/home/mkranzlein/michael/dev/curiam


In [2]:
import os

from curiam import agreement

from curiam.inception import tsv_processing

from sklearn.metrics import precision_recall_fscore_support

## Summary Stats


In [3]:
agreement_folder = "data/full_scale/agreement_study"

# These are list of opinions which are list of sentences which are lists of tokens
# eg opinions_m[0][0][0] is the 0-th token of the 0-th sentence of the 0-th opinion in the agreement study.
opinions = [tsv_processing.process_opinion_file(f"{agreement_folder}/michael/{filename}")
          for filename in os.listdir(f"{agreement_folder}/michael")
          if filename.endswith(".tsv")]

opinions_n = [tsv_processing.process_opinion_file(f"{agreement_folder}/nathan/{filename}")
          for filename in os.listdir(f"{agreement_folder}/nathan")
          if filename.endswith(".tsv")]

# Set 4th column of each token to Nathan"s annotation
# Each token now has the format: [sentence_num, tok_str, michael_annotation, nathan_annotation]
for i, opinion in enumerate(opinions):
    for j, sentence in enumerate(opinion):
        for k, token in enumerate(sentence):
            nathan_label_dict = opinions_n[i][j][k][2]
            token.append(nathan_label_dict)


### How many sentences?


In [4]:
sum([len(opinion) for opinion in opinions])

332


### How many tokens?


In [5]:
token_total = sum([len(sentence) for opinion in opinions for sentence in opinion])
print(token_total)

9109


### How many tokens received at least one label?

In [6]:
def get_token_coverage(sentence, annotation_column):
    return sum([1 if len(token[annotation_column]["categories"]) > 0
                else 0 for token in sentence])

coverage_m = sum([get_token_coverage(sentence, 2) for opinion in opinions for sentence in opinion])
coverage_n = sum([get_token_coverage(sentence, 3) for opinion in opinions for sentence in opinion])

print(f"Tokens with at least one label:")
print(f"Michael: {coverage_m} ({coverage_m/token_total*100:.2f}%)")
print(f"Nathan: {coverage_n} ({coverage_n/token_total*100:.2f}%)")

Tokens with at least one label:
Michael: 4616 (50.68%)
Nathan: 4275 (46.93%)



### How many spans did each annotator annotate?



## Agreement

### Agreement Overall



#### Gamma


In [7]:
from pygamma_agreement import Continuum
from pyannote.core import Segment
from pygamma_agreement import CombinedCategoricalDissimilarity

In [8]:
def get_opinion_gamma(opinion, excluded_categories=[]):
    continuum = Continuum()
    offset = 0
    for sentence in opinion:
        annotations_m = tsv_processing.get_sentence_annotations(sentence, annotation_column=2)
        annotations_n = tsv_processing.get_sentence_annotations(sentence, annotation_column=3)
        for annotation in annotations_m:
            category, start, end = annotation[0], annotation[1], annotation[2]
            if category in excluded_categories:
                continue
            continuum.add("m", Segment(start+offset, end+offset+1), category)
        for annotation in annotations_n:
            category, start, end = annotation[0], annotation[1], annotation[2]
            if category in excluded_categories:
                continue
            continuum.add("n", Segment(start+offset, end+offset+1), category)
        offset += len(sentence)
    dissim = CombinedCategoricalDissimilarity(alpha=1, beta=1)
    gamma_results = continuum.compute_gamma(dissim)
    return gamma_results.gamma

In [42]:
opinion_gammas = []
for opinion in opinions:
    opinion_gammas.append(round(get_opinion_gamma(opinion), 3))

opinion_gammas

[0.808, 0.824, 0.864]

Gamma should really be calculated at a document level (which we've done), so to get an overall gamma measurement for the corpus,
we can calculate a token-weighted average. That is, average the gamma scores, accounting for the length of each opinion.

In [36]:
opinion_token_counts = [sum([len(sentence) for sentence in opinion]) for opinion in opinions]


In [46]:
tokens_total = sum(opinion_token_counts)
weighted_gammas = []
for gamma, token_count in zip(opinion_gammas, opinion_token_counts):
    weighted_gammas.append(gamma * (token_count / token_total))
weighted_average_gamma = sum(weighted_gammas)
print(weighted_average_gamma)

0.8291017674827095


In [10]:
for opinion in opinions:
    print(get_opinion_gamma(opinion, excluded_categories=["Direct Quote", "Legal Source"]))

0.6964367061547254
0.7448524042406597
0.7420227226965087


In [11]:
for opinion in opinions:
    print(get_opinion_gamma(opinion, excluded_categories=["Direct Quote", "Legal Source", "Metalinguistic Cue"]))

0.6536598816466577
0.6848057469361514
0.500687653017608



#### P, R, F1



In [12]:
categories = [
    "Appeal to Meaning",
    "Definition",
    "Direct Quote",
    "Example Use",
    "Focal Term",
    "Language Source",
    "Legal Source",
    "Metalinguistic Cue",
    "Named Interpretive Rule"
    ]

categories_to_abbreviations = {
    "Appeal to Meaning": "ATM",
    "Definition": "D",
    "Direct Quote": "DQ",
    "Example Use": "EU",
    "Focal Term": "FT",
    "Language Source": "LaS",
    "Legal Source": "LeS",
    "Metalinguistic Cue": "MC",
    "Named Interpretive Rule": "NIR"
}

abbreviations_to_categories = {
    "ATM": "Appeal to Meaning",
    "D": "Definition",
    "DQ": "Direct Quote",
    "EU": "Example Use",
    "FT": "Focal Term",
    "LaS": "Language Source",
    "LeS": "Legal Source",
    "MC": "Metalinguistic Cue",
    "NIR": "Named Interpretive Rule",
}

# TODO: confirm category order in sec 3 matches this
# Categories ordered how they're presented in the paper
ordered_categories = ["FT", "D", "MC", "DQ", "LaS", "LeS", "NIR", "EU", "ATM"]


In [33]:
def print_opinion_p_r_f1(opinion):
    for abbreviated_category in ordered_categories:
        category = abbreviations_to_categories[abbreviated_category]
        token_labels_m = []
        token_labels_n = []
        for sentence in opinion:
            # Create single-class labels for each token for Michael and Nathan
            for token in sentence:
                label_dict_m = token[2]
                label_dict_n = token[3]
                if category in label_dict_m["categories"]:
                    token_labels_m.append(1)
                else:
                    token_labels_m.append(0)
                if category in label_dict_n["categories"]:
                    token_labels_n.append(1)
                else:
                    token_labels_n.append(0)
        if sum(token_labels_m) > 0 and sum(token_labels_n) > 0:
            p, r, f1, _ = precision_recall_fscore_support(token_labels_m, token_labels_n, average="macro")
            print(f"{category} & {p:0.3f} & {r:0.3f} & {f1:0.3f}\\\\")

print_opinion_p_r_f1(opinions[1])

Focal Term & 0.931 & 0.850 & 0.886\\
Definition & 0.913 & 0.931 & 0.922\\
Metalinguistic Cue & 0.920 & 0.862 & 0.889\\
Direct Quote & 0.995 & 0.974 & 0.984\\
Language Source & 1.000 & 0.976 & 0.987\\
Legal Source & 0.997 & 0.998 & 0.997\\
Named Interpretive Rule & 0.571 & 0.698 & 0.604\\
Example Use & 0.925 & 0.784 & 0.838\\
Appeal to Meaning & 0.610 & 0.586 & 0.596\\


### Span-level Exact Match



### Agreement By Category

## Qualitative Analysis