# Annotation Artifacts

- 📺 **Video:** [https://youtu.be/RXYaMZcDIWU](https://youtu.be/RXYaMZcDIWU)

## Overview
- Identify spurious correlations introduced by dataset annotation shortcuts.
- Guard against models exploiting artifacts instead of the intended signal.

## Key ideas
- **Annotator bias:** crowdworkers use heuristics like lexical cues.
- **Artifact detection:** check word frequencies conditional on label.
- **Adversarial filtering:** remove or rebalance examples with strong artifacts.
- **Evaluation:** compare artifact-heavy vs. artifact-free subsets.

## Demo
Compute pointwise mutual information between words and labels on a synthetic dataset to surface artifacts, following the lecture (https://youtu.be/Qtd4QktaSWk).

In [1]:
from collections import Counter
import math

examples = [
    ('Because evidence is limited, the answer is no.', 0),
    ('Yes, the claim is supported by multiple studies.', 1),
    ('Because witnesses disagree, we cannot confirm it.', 0),
    ('The results clearly indicate yes.', 1),
    ('Because the document is missing, answer no.', 0)
]

word_counts = Counter()
label_counts = Counter()
joint_counts = Counter()

for text, label in examples:
    tokens = set(text.lower().split())
    label_counts[label] += 1
    for token in tokens:
        word_counts[token] += 1
        joint_counts[(token, label)] += 1

for token, count in word_counts.items():
    p_token = count / len(examples)
    for label in label_counts:
        joint = joint_counts[(token, label)] / len(examples)
        if joint == 0:
            continue
        p_label = label_counts[label] / len(examples)
        pmi = math.log(joint / (p_token * p_label))
        print(f"PMI(token='{token}', label={label}) = {pmi:.2f}")


PMI(token='no.', label=0) = 0.51
PMI(token='answer', label=0) = 0.51
PMI(token='evidence', label=0) = 0.51
PMI(token='limited,', label=0) = 0.51
PMI(token='is', label=0) = 0.11
PMI(token='is', label=1) = -0.18
PMI(token='the', label=0) = -0.18
PMI(token='the', label=1) = 0.22
PMI(token='because', label=0) = 0.51
PMI(token='multiple', label=1) = 0.92
PMI(token='supported', label=1) = 0.92
PMI(token='yes,', label=1) = 0.92
PMI(token='claim', label=1) = 0.92
PMI(token='by', label=1) = 0.92
PMI(token='studies.', label=1) = 0.92
PMI(token='cannot', label=0) = 0.51
PMI(token='it.', label=0) = 0.51
PMI(token='witnesses', label=0) = 0.51
PMI(token='confirm', label=0) = 0.51
PMI(token='disagree,', label=0) = 0.51
PMI(token='we', label=0) = 0.51
PMI(token='yes.', label=1) = 0.92
PMI(token='indicate', label=1) = 0.92
PMI(token='results', label=1) = 0.92
PMI(token='clearly', label=1) = 0.92
PMI(token='missing,', label=0) = 0.51
PMI(token='document', label=0) = 0.51


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [The Mythos of Model Interpretability](https://arxiv.org/pdf/1606.03490.pdf)
- [Deep Unordered Composition Rivals Syntactic Methods for Text Classification](https://www.aclweb.org/anthology/P15-1162/)
- [Analysis Methods in Neural Language Processing: A Survey](https://arxiv.org/pdf/1812.08951.pdf)
- ["Why Should I Trust You?" Explaining the Predictions of Any Classifier](https://arxiv.org/pdf/1602.04938.pdf)
- [Axiomatic Attribution for Deep Networks](https://arxiv.org/pdf/1703.01365.pdf)
- [BERT Rediscovers the Classical NLP Pipeline](https://arxiv.org/pdf/1905.05950.pdf)
- [What Do You Learn From Context? Probing For Sentence Structure In Contextualized Word Represenations](https://arxiv.org/pdf/1905.06316.pdf)
- [Annotation Artifacts in Natural Language Inference Data](https://www.aclweb.org/anthology/N18-2017/)
- [Hypothesis Only Baselines in Natural Language Inference](https://www.aclweb.org/anthology/S18-2023/)
- [Did the Model Understand the Question?](https://www.aclweb.org/anthology/P18-1176/)
- [Swag: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference](https://www.aclweb.org/anthology/D18-1009.pdf)
- [Generating Visual Explanations](https://arxiv.org/pdf/1603.08507.pdf)
- [e-SNLI: Natural Language Inference with Natural Language Explanations](https://arxiv.org/abs/1812.01193)
- [Explaining Question Answering Models through Text Generation](https://arxiv.org/pdf/2004.05569.pdf)
- [Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems](https://arxiv.org/abs/1705.04146)
- [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903)
- [The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning](https://arxiv.org/abs/2205.03401)
- [Large Language Models are Zero-Shot Reasoners](https://arxiv.org/abs/2205.11916)
- [Complementary Explanations for Effective In-Context Learning](https://arxiv.org/pdf/2211.13892.pdf)
- [PAL: Program-aided Language Models](https://arxiv.org/abs/2211.10435)
- [Measuring and Narrowing the Compositionality Gap in Language Models](https://arxiv.org/abs/2210.03350)


*Links only; we do not redistribute slides or papers.*