# Agreement Study

We randomly selected 3 cases of medium length (70-150 sentences) for double annotation to assess agreement of the revised annotation schema and guidelines. Annotation for this study was conducted by Michael and Nathan.

This notebook covers:

- Some high level stats about the annotations and disagreements
- IAA metrics, including F1 and Gamma
- Qualitative analysis of the disagreements

Cases for agreement study:

- 12625853_mixed_alito
- 12628561_ootc_sotomayor
- 12625931_dissent_thomas

## Setup


In [1]:
# Allows for seamless use of updated src

%load_ext autoreload
%autoreload 2

# Switch to top of curiam directory for easier paths
%cd ../..


/home/mkranzlein/michael/dev/curiam


In [2]:
from pathlib import Path

from curiam import categories
from curiam.preprocessing import inception_tsv

from pyannote.core import Segment
from pygamma_agreement import Continuum
from pygamma_agreement import CombinedCategoricalDissimilarity
from sklearn.metrics import precision_recall_fscore_support


## Summary Stats


In [3]:
agreement_path = Path("data/full_scale/agreement_study")

# These are list of opinions which are list of sentences which are lists of tokens
# eg opinions_m[0][0][0] is the 0-th token of the 0-th sentence of the 0-th opinion in the agreement study.
opinions = [inception_tsv.process_opinion_file(opinion_path) for opinion_path
            in agreement_path.joinpath("michael").glob("*.tsv")]

opinions_n = [inception_tsv.process_opinion_file(opinion_path) for opinion_path
              in agreement_path.joinpath("nathan").glob("*.tsv")]

# Set 4th column of each token to Nathan's annotation
# Each token now has the format: [sentence_num, tok_str, michael_annotation, nathan_annotation]
for i, opinion in enumerate(opinions):
    for j, sentence in enumerate(opinion):
        for k, token in enumerate(sentence):
            nathan_label_dict = opinions_n[i][j][k][2]
            token.append(nathan_label_dict)


### How many sentences?


In [4]:
sum([len(opinion) for opinion in opinions])

332


### How many tokens?


In [5]:
token_total = sum([len(sentence) for opinion in opinions for sentence in opinion])
print(token_total)

9109


### How many tokens received at least one label?

In [6]:
def get_token_coverage(sentence, annotation_column):
    return sum([1 if len(token[annotation_column]["categories"]) > 0
                else 0 for token in sentence])

coverage_m = sum([get_token_coverage(sentence, 2) for opinion in opinions for sentence in opinion])
coverage_n = sum([get_token_coverage(sentence, 3) for opinion in opinions for sentence in opinion])

print(f"Tokens with at least one label:")
print(f"Michael: {coverage_m} ({coverage_m/token_total*100:.2f}%)")
print(f"Nathan: {coverage_n} ({coverage_n/token_total*100:.2f}%)")

Tokens with at least one label:
Michael: 4616 (50.68%)
Nathan: 4275 (46.93%)



### How many spans did each annotator annotate?



## Agreement

### Agreement Overall



#### Gamma


In [7]:
def get_opinion_gamma(opinion, excluded_categories=[]):
    continuum = Continuum()
    offset = 0
    for sentence in opinion:
        annotations_m = inception_tsv.get_sentence_annotations(sentence, annotation_column=2)
        annotations_n = inception_tsv.get_sentence_annotations(sentence, annotation_column=3)
        for annotation in annotations_m:
            category, start, end = annotation[0], annotation[1], annotation[2]
            if category in excluded_categories:
                continue
            continuum.add("m", Segment(start+offset, end+offset+1), category)
        for annotation in annotations_n:
            category, start, end = annotation[0], annotation[1], annotation[2]
            if category in excluded_categories:
                continue
            continuum.add("n", Segment(start+offset, end+offset+1), category)
        offset += len(sentence)
    dissim = CombinedCategoricalDissimilarity(alpha=1, beta=1)
    # .005 is a pretty intense precision value (default is .02)
    # Lower is more precise, but more compute-intensive
    gamma_results = continuum.compute_gamma(dissim, precision_level=.005)
    return gamma_results.gamma

In [8]:
opinion_gammas = []
for opinion in opinions:
    opinion_gammas.append(round(get_opinion_gamma(opinion), 3))

print("Gamma for each opinion in agreement study: ", opinion_gammas)

Gamma for each opinion in agreement study:  [0.808, 0.825, 0.866]


Gamma should really be calculated at a document level (which we've done), so to get an overall gamma measurement for the corpus,
we can calculate a token-weighted average. That is, average the gamma scores, accounting for the length of each opinion.

In [9]:
opinion_token_counts = [sum([len(sentence) for sentence in opinion]) for opinion in opinions]

tokens_total = sum(opinion_token_counts)
weighted_gammas = []
for gamma, token_count in zip(opinion_gammas, opinion_token_counts):
    weighted_gammas.append(gamma * (token_count / token_total))
weighted_average_gamma = sum(weighted_gammas)
print(f"Token-weighted gamma average for whole agreement study: {weighted_average_gamma:.3f}")

Token-weighted gamma average for whole agreement study: 0.830


In [10]:
opinion_gammas_no_dq_or_les = []
for opinion in opinions:
    opinion_gammas_no_dq_or_les.append(round(get_opinion_gamma(opinion, excluded_categories=["Direct Quote", "Legal Source"]), 3))

print("Gamma for each opinion in agreement study: ", opinion_gammas_no_dq_or_les)

Gamma for each opinion in agreement study:  [0.697, 0.747, 0.743]


In [11]:
opinion_token_counts = [sum([len(sentence) for sentence in opinion]) for opinion in opinions]

tokens_total = sum(opinion_token_counts)
weighted_gammas = []
for gamma, token_count in zip(opinion_gammas_no_dq_or_les, opinion_token_counts):
    weighted_gammas.append(gamma * (token_count / token_total))
weighted_average_gamma = sum(weighted_gammas)
print(f"Token-weighted gamma average for whole agreement study: {weighted_average_gamma:.3f}")

Token-weighted gamma average for whole agreement study: 0.724



#### P, R, F1



In [12]:
table_output = ""
for category in categories.ORDERED_CATEGORIES:
    token_labels_m = []
    token_labels_n = []
    for opinion in opinions:
        for sentence in opinion:
            # Create single-class labels for each token for Michael and Nathan
            for token in sentence:
                label_dict_m = token[2]
                label_dict_n = token[3]
                if category in label_dict_m["categories"]:
                    token_labels_m.append(1)
                else:
                    token_labels_m.append(0)
                if category in label_dict_n["categories"]:
                    token_labels_n.append(1)
                else:
                    token_labels_n.append(0)
    if sum(token_labels_m) > 0 and sum(token_labels_n) > 0:
        p, r, f1, _ = precision_recall_fscore_support(token_labels_m, token_labels_n, average="macro")
        table_output+=(f"{category} & {p:0.3f} & {r:0.3f} & {f1:0.3f}\\\\\n")
output_path = Path("results/tables/agreement_study_p_r_f1.txt")
with output_path.open("w", encoding="utf-8") as f:
    f.write(table_output)
print(f"Table saved to {output_path.as_posix()}")

Table saved to results/tables/agreement_study_p_r_f1.txt


### Span-level Exact Match

For each of Michael's annotations, did Nathan have an identical span?

If we treat Michael's annotations as gold, `precision = matches / nathan_count` and `recall = matches / michael_count`.




### Agreement By Category

## Qualitative Analysis