# Agreement Study

We randomly selected 3 cases of medium length (70-150 sentences) for double annotation to assess agreement of the revised annotation schema and guidelines. Annotation for this study was conducted by Michael and Nathan.

This notebook covers:

- Some high level stats about the annotations and disagreements
- IAA metrics, including F1 and Gamma
- Qualitative analysis of the disagreements

Cases for agreement study:

- 12625853_mixed_alito
- 12628561_ootc_sotomayor
- 12625931_dissent_thomas

## Setup


In [1]:
%cd -q ../..

In [2]:
from pathlib import Path

from pyannote.core import Segment
from pygamma_agreement import CombinedCategoricalDissimilarity, Continuum
from sklearn.metrics import precision_recall_fscore_support

from curiam import categories
from curiam.document import Sentence
from curiam.preprocessing import inception_tsv

## Summary Stats


In [3]:
agreement_path = Path("data/main/agreement_study")

# These are list of opinions which are list of sentences which are lists of tokens
# eg opinions_m[0][0][0] is the 0-th token of the 0-th sentence of the 0-th opinion in the agreement study.
opinions_m = [inception_tsv.process_opinion_file(opinion_path,opinion_path.name)
            for opinion_path in agreement_path.joinpath("michael").glob("*.tsv")]

opinions_n = [inception_tsv.process_opinion_file(opinion_path, opinion_path.name)
              for opinion_path in agreement_path.joinpath("nathan").glob("*.tsv")]


### How many sentences?


In [4]:
sum([len(opinion) for opinion in opinions_m])

332


### How many tokens?


In [5]:
token_total = sum([len(sentence) for opinion in opinions_m for sentence in opinion])
print(token_total)

9109


### How many tokens received at least one label?

In [6]:
def get_token_coverage(sentence: Sentence):
    return sum([1 if token.annotations is not None
                else 0 for token in sentence])

coverage_m = sum([get_token_coverage(sentence) for opinion in opinions_m for sentence in opinion])
coverage_n = sum([get_token_coverage(sentence) for opinion in opinions_n for sentence in opinion])

print("Tokens with at least one label:")
print(f"Michael: {coverage_m} ({coverage_m/token_total*100:.2f}%)")
print(f"Nathan: {coverage_n} ({coverage_n/token_total*100:.2f}%)")

Tokens with at least one label:
Michael: 4616 (50.68%)
Nathan: 4275 (46.93%)



### How many spans did each annotator annotate?



## Agreement

### Agreement Overall



#### Gamma


In [7]:
opinions_m[0].sentences[4].get_annotations()

[Annotation(category='Legal Source', start=0, end=1)]

In [8]:
opinions_m[0].sentences[4]

Sentence(id=4, tokens=[Token(id=0, text='Ibid', annotations=[TokenAnnotation(id=0, category='Legal Source')]), Token(id=1, text='.', annotations=[TokenAnnotation(id=0, category='Legal Source')])])

In [9]:
def get_opinion_gamma(opinion_m, opinion_n, excluded_categories=[]):
    continuum = Continuum()
    offset = 0
    for sentence_m, sentence_n in zip(opinion_m, opinion_n):
        annotations_m = sentence_m.get_annotations()
        annotations_n = sentence_n.get_annotations()
        for annotation in annotations_m:
            category = annotation.category
            start = annotation.start
            end = annotation.end
            if category in excluded_categories:
                continue
            continuum.add("m", Segment(start+offset, end+offset+1), category)
        for annotation in annotations_n:
            category = annotation.category
            start = annotation.start
            end = annotation.end
            if category in excluded_categories:
                continue
            continuum.add("n", Segment(start+offset, end+offset+1), category)
        offset += len(sentence_m)
    dissim = CombinedCategoricalDissimilarity(alpha=1, beta=1)
    # .005 is a pretty intense precision value (default is .02)
    # Lower is more precise, but more compute-intensive
    gamma_results = continuum.compute_gamma(dissim, precision_level=.005)
    return gamma_results.gamma

In [10]:
opinion_gammas = []
for opinion_m, opinion_n in zip(opinions_m, opinions_n):
    opinion_gammas.append(round(get_opinion_gamma(opinion_m, opinion_n), 5))

print("Gamma for each opinion in agreement study: ", opinion_gammas)

Gamma for each opinion in agreement study:  [0.80818, 0.82429, 0.86519]


Gamma should really be calculated at a document level (which we've done), so to get an overall gamma measurement for the corpus,
we can calculate a token-weighted average. That is, average the gamma scores, accounting for the length of each opinion.

In [11]:
opinion_token_counts = [sum([len(sentence) for sentence in opinion]) for opinion in opinions_m]

tokens_total = sum(opinion_token_counts)
weighted_gammas = []
for gamma, token_count in zip(opinion_gammas, opinion_token_counts):
    weighted_gammas.append(gamma * (token_count / token_total))
weighted_average_gamma = sum(weighted_gammas)
print(f"Token-weighted gamma average for whole agreement study: {weighted_average_gamma:.3f}")

Token-weighted gamma average for whole agreement study: 0.830


In [12]:
gammas_no_dq_or_les = []
for opinion_m, opinion_n in zip(opinions_m, opinions_n):
    gammas_no_dq_or_les.append(round(get_opinion_gamma(opinion_m, opinion_n,
                                                       excluded_categories=["Direct Quote", "Legal Source"]), 3))

print("Gamma for each opinion in agreement study: ", gammas_no_dq_or_les)

Gamma for each opinion in agreement study:  [0.695, 0.745, 0.742]


In [13]:
opinion_token_counts = [sum([len(sentence) for sentence in opinion]) for opinion in opinions_m]

tokens_total = sum(opinion_token_counts)
weighted_gammas = []
for gamma, token_count in zip(gammas_no_dq_or_les, opinion_token_counts):
    weighted_gammas.append(gamma * (token_count / token_total))
weighted_average_gamma = sum(weighted_gammas)
print(f"Token-weighted gamma average for whole agreement study: {weighted_average_gamma:.3f}")

Token-weighted gamma average for whole agreement study: 0.723



#### P, R, F1



In [14]:
table_output = ""
for category in categories.ORDERED_CATEGORIES:
    token_labels_m = []
    token_labels_n = []
    for opinion in opinions_m:
        for sentence in opinion:
            # Create single-class labels for each token for Michael and Nathan
            for token in sentence:
                if category in token.get_categories():
                    token_labels_m.append(1)
                else:
                    token_labels_m.append(0)
    for opinion in opinions_n:
        for sentence in opinion:
            # Create single-class labels for each token for Michael and Nathan
            for token in sentence:
                if category in token.get_categories():
                    token_labels_n.append(1)
                else:
                    token_labels_n.append(0)
    if sum(token_labels_m) > 0 and sum(token_labels_n) > 0:
        p, r, f1, _ = precision_recall_fscore_support(token_labels_m, token_labels_n, average="macro")
        table_output+=(f"{category} & {p:0.3f} & {r:0.3f} & {f1:0.3f}\\\\\n")
output_path = Path("results/tables/agreement_study_p_r_f1.txt")
with output_path.open("w", encoding="utf-8") as f:
    f.write(table_output)
print(f"Table saved to {output_path.as_posix()}")

Table saved to results/tables/agreement_study_p_r_f1.txt


### Span-level Exact Match

For each of Michael's annotations, did Nathan have an identical span?

If we treat Michael's annotations as gold, `precision = matches / nathan_count` and `recall = matches / michael_count`.




### Agreement By Category

## Qualitative Analysis