# Reliable Sequence Annotations

Speaker: Jonas Molina Ramirez
    
Date: 14th Nov 2019

# Content

1. Annotation processes
2. Active learning with Bayesian Sequence Combination (SiGu, 2019)
  - Annotator models
  - Bayesian Sequence Combination (BSC)
  - Experiments
3. Iterative reliability testing (Artstein, 2017)
  - Rationale
  - Measures
  - Problems
4. Summary

# Literature

(Artstein, 2017): Artstein. Inter-annotator agreement. http://artstein.org/publications/inter-annotator-preprint.pdf, 2017.
        
(SiGu, 2019): Simpson and Gurevych. A Bayesian Approach for Sequence Tagging with Crowds. In EMNLP 2019.


# 1. Annotation processes

# Goal of annotation process

- Find reliable labels for tokens in documents
- Here: BIO scheme
  - Begin
  - Inside
  - Outside

<img src="./img/annotation-process.png" width="300">

# Problem with deep learning approaches: (SiGu, 2019)
- Data intensive: ten thousands (and more) labelled documents for training
- New tasks/new domains: labelling is time consuming/costly (pay experts)

# Solution:
- Crowdsourcing: large number of untrained workers annotate documents

# New problem:
- Multiple sequences of unreliable labels per document
- Solve,e.g., via iterative reliability testing/active learning with BSC


# Two iterative approaches to create reliable annotations

![title](./img/approaches.png)

# 2. Active learning with BSC

Based on (SiGu, 2019)

# Active learning with BSC

![title](./img/active-learning.png)

# Recap: Bayes' theorem

Likelihood function: $P(B \mid A)$

Prior: $P(A)$

Marginal Likelihood: $P(B)$

Posterior: $P(A \mid B)$

$ 
\begin{align}
P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} 
\end{align}
$

# Different likelihood functions

### Probabilistic annotator models

Goal: improve performance through probabilistic model of individual noise/bias

Define likelihood function: annotator chooses label $c_\tau$ given true label $t_\tau$
- $P(c_\tau \mid t_\tau)$

Four existing probabilistic annotator models:
- Accuracy model (acc) (not covered)
- Spamming model (spam) (not covered)
- Confusion vector (CV) (not covered)
- Confusion matrix (CM)

New model:
- Sequential confusion matrix (seq)


# Confusion matrix (CM)

Basic idea:
- Extension of confusion vector
- $J^2$ parameters
- Capture spammers
- Capture different error rates and biases per class
- For each combination of chosen label and true label,
  a different penalty/award can be modelled

\begin{align}
A = p(c_\tau = i \mid t_\tau = j ,\pi) = \pi_{j,i}
\end{align}

Problem:
- Does not model dependencies within a sequence

|   | B | I | O |
|---|---|---|---|
| B | $\pi_{B,B}$  | $\pi_{I,B}$  | $\pi_{O,B}$  |
| I | $\pi_{B,I}$  | $\pi_{I,I}$  | $\pi_{O,I}$  |
| O | $\pi_{B,O}$  | $\pi_{I,O}$  | $\pi_{O,O}$  |

# Sequential confusion matrix (seq)

Basic idea:
- Extension of confusion matrix
- Includes dependency of a label on its predecessor
- $J^3$ parameters
- Disallowed transitions: a priori near zero likelihood

\begin{align}
A = p(c_\tau = i \mid c_{\tau -1} = l, t_\tau = j ,\pi) = \pi_{j,l,i}
\end{align}


Advantages:
- Can model tendency towards overly long sequences
- Can model tendency to split spans


| B  | B | I | O |
|---|---|---|---|
| B | $\pi_{B,B,B}$  | $\pi_{I,B,B}$  | $\pi_{O,B,B}$  |
| I | $\pi_{B,B,I}$  | $\pi_{I,B,I}$  | $\pi_{O,B,I}$  |
| O | $\pi_{B,B,O}$  | $\pi_{I,B,O}$  | $\pi_{O,B,O}$  |

| I  | B | I | O |
|---|---|---|---|
| B | $\pi_{B,I,B}$  | $\pi_{I,I,B}$  | $\color{red}{\pi_{O,I,B}}$  |
| I | $\pi_{B,I,I}$  | $\pi_{I,I,I}$  | $\color{red}{\pi_{O,I,I}}$  |
| O | $\pi_{B,I,O}$  | $\pi_{I,I,O}$  | $\color{red}{\pi_{O,I,O}}$  |

| O | B | I | O |
|---|---|---|---|
| B | $\pi_{B,O,B}$  | $\pi_{I,O,B}$  | $\pi_{O,O,B}$  |
| I | $\pi_{B,O,I}$  | $\pi_{I,O,I}$  | $\pi_{O,O,I}$  |
| O | $\pi_{B,O,O}$  | $\pi_{I,O,O}$  | $\pi_{O,O,O}$  |

# Bayesisan Sequence Combination

Goal: Aggregate sequential annotations from multiple annotators

Basic idea: 
- Combination of HMM and annotator model function
- Important assumption: annotators (and their errors) are independent

Inference:
- Variational Bayes aka. variational inference
  - Allows for online-learning, each step updates the model
- Conjugate prior for multionmial distribution: Dirichlet distribution
- Result: Posterior distribution over true sequence labels
  - Not so useful
- Better: most probable sequence of labels (Viterbi algorithm)

# Active learning with BSC

![title](./img/active-learning.png)

# Experiments

# Datasets

All include gold labels and crowdsources labels

- NER: named-entity recognition
  - Very short spans (often single token)
- PICO: medical paper abstracts
  - Identify population enrolled in clinical trial
- ARG: mixture of argumentative and non-argumentative sentences
  - Mark pro/con arguments
  - Long spans

# Evaluation metrics

NER and ARG: exact F1-score
- only exact span matches are considered correct

PICO: relaxed F1-score
- count matching fractions of spans when computing precision and recall
- partial information useful to identify population

Cross Entropy Error (CEE, log-loss):
- token-level metrics
- penalises over-confident mistakes

# Evaluation metrics

NER and ARG: exact F1-score
- Only exact span matches are considered correct

PICO: relaxed F1-score
- Count matching fractions of spans when computing precision and recall
- Partial information useful to identify population

Cross Entropy Error (CEE, log-loss):
- Token-level metrics
- Penalises over-confident mistakes

# Evaluated Methods

- MV: token-level majority voting
- MACE: spam annotator model
- DS: CM model (Dawid-Skene)
- IBCC: Independent Bayesian Classifier Combination
- HMM-crowd: MAP for CV model and var. inf. for integrated HMM



# Aggregation results:

Task: 
- Aggregate crowdsourced labels
- Predict true labels

Compare: 
- HMM-crowd and BSC-CV
- DS and IBCC
![title](./img/aggregate-labels.png)

(Image taken from SiGu)


# Contributions of (SiGu, 2019)
Bayesian Sequence Combination (BSC)
- “fully-Bayesian method for aggregating sequence labels from multiple annotators”

Sequential Confusion Matrix (seq)
- “model sequential dependencies between annotators’ labels”

Theoretical comparison of probabilistic models
- annotator noise and annotator bias

Empirical evaluation
- three sequence labelling tasks


# Take-away: Active learning with BSC

- Different annotator models available
- Sequential/non-sequential models (apply depending on the task)


# 3. Iterative reliability testing

Based on (Artstein, 2017)


# Iterative reliability testing

Goal:
- consistent and reproducible annotation process (which means "reliable")

How to assess the annotation process?
- measure inter-annotator agreement on same source data

![title](./img/iterative-reliability-testing.png)


# What's the rationale?

Rationale for measuring agreement
- Agreement among annotators

demonstrates

- Reliable annotation process

necessary but not sufficient for

- Correct annotations

(Taken from Artstein, 2017)


# Iterative reliability testing

![title](./img/iterative-reliability-testing2.png)

# How to measure reliability?

Compute coefficients:
- They measure the reliability of the annotations **over all annotators**
- Measure agreement: Fleiss's $\kappa$
- Measure disagreement: Krippendorff's $\alpha$

Basic idea:
- Measure amount of (dis)agreement above chance

Drawback:
- A single scalar value cannot capture complex patterns in data/process
- fine grained anaylsis necessary


# Four factors of complications:

- diversity in underlying data
- similarities between labels
- differences in the difficulty of individual items
- differences between individual annotators and annotator populations

# Example: Diversity in underlying data

Heterogeneous data may comprise homogeneous subsets

Example:
- No correlation between weight and age of adult elephants
- No correlation between weight and age of adult mice
- Pool elephatns and mice together:
  - Strong correlation between weight and age
  - Elephants are both older and heavier than mice

Annotatator agreement regarding weight:
- Chance level agreement when looking only at mice or elephatns
- Near perfect agreement when looking at both simultaneously
- Annotators cannot differentiate elephants from elephants
- Annotators cannot differentiate mice from mice
- Annotators can differentiate elephants from mice

# Example: Diversity in underlying data

Heterogeneous data may comprise homogeneous subsets

Example:
- Annotate weight of adult **elephants**
  - Does it correlate with their age? NO
- Annotate weight of adult **mice**
  - Does it correlate with their age? NO
- Annotate weight of adult **elephants** and adult **mice**
  - Does it correlate with their age? YES strongly
  - Elephants are both older and heavier than mice

Problem:
- Annotators cannot differentiate elephants from elephants
- Annotators cannot differentiate mice from mice

Look at homogeneous subsets of the data.

# Take-away: iterative reliability testing

- Be careful when assessing reliability
- Investigate different aspects of your data/your annotation process

# Summary

We talked about:
- 2 approaches to iteratively create reliable annotations
- Active learning with BSC (SiGu, 2019):
  - Annotator models (likelihood functions)
  - Bayesian Sequence Combination (BSC)
  - Results of the aggregation task
- Iterative reliability testing (Artstein, 2017):
  - Agreement coefficients
  - Complications when measuring inter-annotator agreement


# Summary

# Iterative reliability testing

![title](./img/iterative-reliability-testing2.png)

# Summary

# Active learning with BSC

![title](./img/active-learning.png)

# BACKUP Inter-annotator agreement  (Artstein, 2017)

# Measure inter-annotator agreement

- compute single coefficient

- raw/observed agreement: 
  - for 2 annotators, count number of identical labels
  - divide by number of all items to be annotated
 

- Problem:
  - accidental agreement
  - Example: gene sequences


### Measure inter-annotator agreement
- better: measure amount of agreement above chance
- $A_o$: amount of observed inter-annotator agreement
- $A_e$: amount of inter-annotator agreement expected by random coding model
- $A_o - A_e$: amount of agreement above chance
- $1 - A_e$: maximal agreement

$
\begin{align}
\kappa, \pi, ... = \frac{A_o - A_e}{1 - A_e}
\end{align}
$

# Measure inter-annotator agreement

Other approach: measure disagreement

- $D_o = 1 - A_o$
- $D_e = 1 - A_e$

$
\begin{align}
\alpha = 1 - \frac{D_o}{D_e}
\end{align}
$

Advantage: allows to express disagreement in other units than percentages

# Concrete Coefficients: observed (dis)agreement
- Fleiss's $\kappa$
- Krippendorff's $\alpha$

They differ the way they compute observed agreement/expected random agreement.

- Fleiss's $\kappa$:
  - observed agreement
  - similar to raw agreement
  - all disagreement are treated equally
- Krippendorff's $\alpha$:
  - observed disagreement
  - distance function per pair of labels
  - scales the disagreement 


### Concrete Coefficients: expected agreement
- Fleiss's $\kappa$
- Krippendorff's $\alpha$

They differ the way they compute observed agreement/expected random agreement.

- Fleiss's $\kappa$:
  - similar to raw agreement
  - all disagreements are treated equally
- Krippendorff's $\alpha$:
  - distance function per pair of labels
  - scales the disagreement

### Problem of single coefficient

- cannot capture complex aspects of the annotation process
- fine grained anaylsis necessary

Four factors of complications:
- diversity in underlying data
- similarities between labels
- differences in the difficulty of individual items
- differences between individual annotators and annotator populations

### Diversity in the underlying data
"When studying annotation of heterogeneous data, 
agreement should be calculated and reported for 
the homogeneous subparts of the data, in 
addition to the data as a whole."

(Artstein, 2017)

### Similarities between the labels
“When annotation labels have an internal structure, 
it may be acceptable to calculate agreement on 
different aspects of the same annotation. This is 
justified when the different aspects reflect separate 
and distinct decisions made by the annotators, thus 
reflecting different facets of a complex annotation process.”

(Artstein, 2017)


### Differences in the difficulty of individual items
"To identify the extent of individual item difficulty, it is recommended to conduct a reliability study with multiple annotators."

(Artstein, 2017)


### Differences between annotators/annot. populations

"In a reliability study with more than two annotators, 
differences between the annotators should be investigated 
by calculating agreement among subgroups of annotators."

(Artstein, 2017)

# BACKUP Annotator models

# Accuracy model (acc)

Basic idea:
- For every annotator, there is a single accuracy parameter $\pi$

\begin{align}
A = p(c_\tau = i \mid t_\tau = j) = \begin{cases}
    \pi , & \text{where $i = j$}\\
    \frac{1 - \pi}{J - 1}, & \text{otherwise}
  \end{cases}
\end{align}

Problem: 
- Spammer can select the most common label
- Results in high accuracy

# Spamming model (spam)

Basic idea:
- For every annotator, there is a single accuracy parameter $\pi$
- If annotator is incorrect: assign label according to "spamming distribution"
- $J+1$ parameters

\begin{align}
A = p(c_\tau = i \mid t_\tau = j ,\xi) = \begin{cases}
    \pi + (1 - \pi)\xi_j, & \text{where $i = j$}\\
    (1 - \pi)\xi_j, & \text{otherwise}
  \end{cases}
\end{align}

Problem:
- no explicit error rate per class

# Confusion vector (CV)

Basic idea:
- separate accuracy per class label
- parameter vector (instead of scalar value)
- $J$ parameters

\begin{align}
A = p(c_\tau = i \mid t_\tau = j ,\pi) = \begin{cases}
    \pi_j, & \text{where $i = j$}\\
    \frac{1 - \pi_j}{J - 1}, & \text{otherwise}
  \end{cases}
\end{align}

Problem:
- if spammer always chooses an incorrect label that has a much higher likelihood
- cannot capture this pattern