# Inter-annotator Agreement v2.1

This notebook is a modified version based on the [IAA notebook](https://github.com/evalgenchal/20Y-CHEC/blob/main/inter-annotator-agreement/inter-annotator-agreement-v2_single-annotations.ipynb)
used in the [Twenty Years of Confusion in Human Evaluation](https://www.aclweb.org/anthology/2020.inlg-1.23.pdf)
research paper.

The annotation team formalized our annotation guidelines in July 2020.
We then annotated 10 papers from ACL 2020 according to these new guidelines to get an idea of the degree of inter-annotator agreement;
this gives us a sense of how reliable the new guidelines and our annotations are.
The results were disappointing so we iterated on our guidelines and annotation spreadsheets and did another round of 10 papers.
This document reports on our IAA for these 10 papers.

In this notebook, we import data from our spreadsheets and do a bit of preprocessing so that we can calculate IAA easily using `nltk`.

We calculate [Krippendorff's alpha]() using [MASI distance]() and [Jaccard distance]().
We also include raw pair-wise agreement scores.

Note that we are not doing any hypothesis testing here, so you will not see any significance scores.
These are strictly descriptive statistics.

## Preliminaries

Our original annotations were collected using [Google Sheets]() so we used `gspread` to interact with Google Sheets, `nltk` to calculate $\alpha$, and `pandas` to manage our data.
These spreadsheets are not public, but the data from them is released in the CSV files in this repo.

The code below can be used to analyse either the 5 "expert" annotators or all 9 annotators' data.

In [1]:
import iaa_utilities
import pandas as pd
import re

from nltk.metrics.agreement import AnnotationTask
from nltk.metrics import edit_distance, jaccard_distance, masi_distance

from IPython.display import display
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"



In [2]:
# Read commonsense IAA CSV sheet:
annotation_df = pd.read_csv("iaa-commonsense_v2.csv")
annotation_df.head(2)

Unnamed: 0,source_spreadsheet,key,annotator,date_annotated,annotation_comments,exclude,time_taken,pub_venue,pub_authors,pub_year,...,op_instrument_size,op_instrument_type,op_data_type,op_form,op_question_prompt_verbatim,op_statistics,criterion_verbatim,criterion_definition_verbatim,criterion_paraphrase,criterion_external
0,1,2020.acl-main.711,anno1,26/01/21,,,,ACL,Chakrabarty et al.,2020,...,,text annotation,text,evaluation through post-editing/annotation,,,,,,Retrieving Sentences Containing\nCommonsense C...
1,1,2020.acl-main.711,anno1,26/01/21,,,,ACL,Chakrabarty et al.,2020,...,,rank ordering,rank order,relative quality estimation,,,,,------ 39h. Goodness of outputs relative to li...,


### Cleaning the dataset

We keep the code used to clean up superficial differences between the different annotators that we needed to handle for the first round of IAA:

1. some annotators left whole rows blank; and
2. annotator paraphrase of definition was often left blank, as was the column for statistics. blank entries compare poorly on set-distance metrics so we will replace these with "~*EMPTY*~"

In [4]:
no_values = pd.DataFrame(annotation_df.loc[:,'system_language':'op_statistics']).any(axis = 1)
annotation_df = annotation_df[no_values]

annotation_df.replace("^\s$", "~*empty*~", inplace=True)
annotation_df.fillna("~*empty*~", inplace=True)

for column in iaa_utilities.IAAv2SpreadsheetScheme.OPEN_CLASS_COLUMNS:
    annotation_df[column] = annotation_df[column].astype(str)
    annotation_df[column] = annotation_df[column].str.lower()

For the second round of IAA, in addition to updating the guidelines, we updated the spreadsheet to have dropdown menus for the criteria names and other columns.
For the `criterion_paraphrase` column, this included the enumeration from the guidelines, as well as value-initial hyphens to get some degree of indentation indicative of the overall hierarchy.
We therefore need to do a bunch of normalization for the criterion_paraphrase column to remove the hyphens and numbers.

In [5]:
annotation_df['criterion_paraphrase'] = annotation_df['criterion_paraphrase'].str.lower().str.replace("-", "").str.replace("[0123456789a-z\/]+\.", "").str.replace("\s+", " ").str.replace(";", ",").str.strip()

  annotation_df['criterion_paraphrase'] = annotation_df['criterion_paraphrase'].str.lower().str.replace("-", "").str.replace("[0123456789a-z\/]+\.", "").str.replace("\s+", " ").str.replace(";", ",").str.strip()


In [6]:
annotation_df.head(2)

Unnamed: 0,source_spreadsheet,key,annotator,date_annotated,annotation_comments,exclude,time_taken,pub_venue,pub_authors,pub_year,...,op_instrument_size,op_instrument_type,op_data_type,op_form,op_question_prompt_verbatim,op_statistics,criterion_verbatim,criterion_definition_verbatim,criterion_paraphrase,criterion_external
0,1,2020.acl-main.711,anno1,26/01/21,~*empty*~,~*empty*~,~*empty*~,ACL,Chakrabarty et al.,2020,...,~*empty*~,text annotation,text,evaluation through post-editing/annotation,~*empty*~,~*empty*~,~*empty*~,~*empty*~,~*empty*~,retrieving sentences containing\ncommonsense c...
1,1,2020.acl-main.711,anno1,26/01/21,~*empty*~,~*empty*~,~*empty*~,ACL,Chakrabarty et al.,2020,...,~*empty*~,rank ordering,rank order,relative quality estimation,~*empty*~,~*empty*~,~*empty*~,~*empty*~,goodness of outputs relative to linguistic con...,~*empty*~


We also need to deal with columns where `multiple (please specify):` is a valid value, so that the likelihood of spurious differences is as low as possible.
Ideally we should standardise the order of the values listed, but we did not do that for the initial INLG submission (hence the empty code cell).

## Extracting the relevant information

Now that we've prepared the primary dataframe, we can easily extract smaller dataframes which facilitate the analysis.
In particular, this function gives us a three-column DF with the `source_spreadsheet`, `key` (= paper identifier), and the target column,
where we have aggregated all labels given in that column for that paper in the spreadsheet `source_spreadsheet` into a set.
(Using a set means that each label will appear only once; using a `frozenset` makes it immutable.)

This code appears in `iaa_utilities.py`

    def extract_iaa_df_by_column_name(annotation_df: pd.DataFrame, column_name: str) -> pd.DataFrame:
        """Extract a three-column dataframe with `column_name` items grouped by `source_spreadsheet` and `key`."""
        return annotation_df[['source_spreadsheet', 'key', column_name]]\
            .groupby(['source_spreadsheet', 'key'])[column_name]\
            .apply(frozenset).reset_index()

    def extract_records_for_nltk(iaa_df: pd.DataFrame) -> List[Tuple]:
        """The first column in the `to_records()` representation is an index, which we don't need for `nltk`."""
        return [(b, c, d) for _, b, c, d in iaa_df.to_records()]


In [7]:
extract_iaa_df_by_column_name = iaa_utilities.extract_iaa_df_by_column_name
extract_records_for_nltk = iaa_utilities.extract_records_for_nltk

## Calculating agreement

We use the same setup for calculating Krippendorff's alpha with Jaccard distance and MASI distance for the closed-class columns.


In [8]:
iaa_by_column = iaa_utilities.IAAv2SpreadsheetScheme.run_closed_class_jaccard_and_masi(annotation_df)
print(iaa_by_column['criterion_paraphrase']['df'].head())

   source_spreadsheet                  key  \
0                   1    2020.acl-main.711   
1                   1   2020.emnlp-main.61   
2                   1  2020.emnlp-main.739   
3                   1       2020.tacl-1.38   
4                   1             D18-1454   

                                criterion_paraphrase  
0  (detectability of controlled feature [property...  
1  (correctness of outputs in their own right (bo...  
2  (effect on reader/listener [effect] (specify):...  
3                                 (clarity, fluency)  
4        (~*empty*~, information content of outputs)  


In [9]:
iaa_utilities.pretty_print_iaa_by_column(iaa_by_column)

column	alpha_jaccard  alpha_masi
system_input	0.52    0.50
external_knowledge	0.15    0.15
system_output	0.09    0.09
system_task	0.38    0.38
knowledge_eval	0.18    0.18
criterion_paraphrase	0.22    0.14
op_form	0.05    0.04
op_data_type	0.25    0.21
op_instrument_type	0.07    0.06


This does reasonable things for our dev data in a strict-agreement mode, but we should also produce a version which relaxes some of the restrictions.
We should also do one for open-class columns with a different distance measure.

## Broad Agreement

We will call the exact-matching (at the string level) version of agreement which we have used so far *narrow* and now define *broad* agreement.
Broad agreement uses the natural hierarchies in the annotation scheme to group elements together which we might want to consider as equivalent.

For example, if two annotators disagree about the output type of a system, with one saying *text: paragraph* and the other saying *text: document*,
we want to penalize this less than if one of them were to say *multi-modal* instead.

### Input/Output Columns

For the `system_input` and `system_output` columns, we will consider the following equivalence classes

* text = {text: subsentential units of text, text: sentence, text: paragraph, text: document, text: dialogue, text: other (please specify)},
* multiple = {all variations of *multiple (list all)*}, and
* other = {all variations of *other (please specify)*}

with all other allowed labels belonging individually to their own equivalence class containing only one element (i.e. raw data = {*raw data*})
In the narrow agreement calculations, each element in an equivalence class differs from all others with a nominal distance metric (i.e. identical strings are distance 0 and all others are distance 1 from each other).


### Task Column

For the `system_task` column, we use the following equivalence classes

* multiple = {all variations of *multiple (list all)*} and
* other = {all variations of *other (please specify)*}

with all other allowed labels belonging individually to their own equivalence class containing only one element (i.e. aggregation = {*aggregation*})
In the narrow agreement calculations, each element in an equivalence class differs from all others with a nominal distance metric (i.e. identical strings are distance 0 and all others are distance 1 from each other).

### Paraphrase of Criterion Name Column

For the `criterion_paraphrase` column, we use two sets of equivalence classes:
one for simple string-level differences related to annotator-specified details as in the above cases and
another based on the hierarchy of criteria.

#### String-level equivalence classes

* Detectability of Text Property (specify property here) = {all variations of *Detectability of Text Property*}
* Effect on listener (specify effect here) = {all variations of *Effect on listener*}
* Inferrability of Speaker Stance (specify object of stance here) = {all variations of *Inferrability of Speaker Stance*}
* Inferrability of Speaker Trait (specify trait here) = {all variations of *Inferrability of Speaker Trait*}

#### Hierarchy-based equivalence classes

These can be read in version 2.0 of the annotation guidelines.
We use all the immediate children of `Quality of outputs` as the top level categories and map all of their children to them.
This gives us four equivalence classes:

* `Quality of outputs` (containing only itself),
* `Correctness of outputs`,
* `Goodness of outputs (excluding correctness)`, and
* `Feature-type criteria`

### Form of Response Elicitation

* other = {all variations of *other (please specify)*}

### Performing the updates and the calculations

We work with a fresh copy of the annotation_df so that the original data is still accessible in the notebook.
For each of the columns where *other (please specify)* is a valid annotation, we replace any annotation beginning with "other" with "other:
we are collapsing the distinctions created by the further specifications.
We do the same thing for annotations of *multiple (list all)*.

In [10]:
broad_anno_df = annotation_df.copy(deep = True)
for column in ("system_input", "system_output", "system_task", "op_form"):
    broad_anno_df[column] = broad_anno_df[column].str.replace("^[Oo]ther.*", "other")
broad_anno_df['system_input']

  broad_anno_df[column] = broad_anno_df[column].str.replace("^[Oo]ther.*", "other")


0          text: sentence
1          text: sentence
2          text: sentence
3          text: sentence
4          text: sentence
             ...         
65         text: sentence
66         text: sentence
67         text: sentence
68    raw/structured data
69    raw/structured data
Name: system_input, Length: 70, dtype: object

In [11]:
for column in ("system_input", "system_output"):
    broad_anno_df[column] = broad_anno_df[column].str.replace("^[Mm]ultiple.*", "multiple")
    broad_anno_df[column] = broad_anno_df[column].str.replace("^[Tt]ext:.*", "text")
broad_anno_df['system_input']

# broad_anno_df['criterion_paraphrase'] = broad_anno_df['criterion_paraphrase'].str.replace("^[Mm]ultiple.*", "multiple")

  broad_anno_df[column] = broad_anno_df[column].str.replace("^[Mm]ultiple.*", "multiple")
  broad_anno_df[column] = broad_anno_df[column].str.replace("^[Tt]ext:.*", "text")


0                    text
1                    text
2                    text
3                    text
4                    text
             ...         
65                   text
66                   text
67                   text
68    raw/structured data
69    raw/structured data
Name: system_input, Length: 70, dtype: object

When it comes to the `criterion_paraphrase` column, however, we can take either of the two approaches described above.
We will create one copy of the broad annotation dataframe for each of them and then apply our fixes to the copies.

For the version focused only on discrepancies caused by 'please specify' lists,
we can use the same kind of approach we used earlier for 'other' and 'multiple':
look for the keyphrase at the beginning of the cell and remove any other cell contents.

In [12]:
broad_anno_string_df = broad_anno_df.copy(deep = True)
for string_prefix in ("Text Property", "Detectability of controlled feature", "Effect on reader/listener", "Inferrability of speaker/author stance", "Inferrability of speaker/author trait"):
    broad_anno_string_df['criterion_paraphrase'] = broad_anno_string_df['criterion_paraphrase'].str.replace(f"{string_prefix}.*", string_prefix, case=False)
broad_anno_string_df['criterion_paraphrase']

  broad_anno_string_df['criterion_paraphrase'] = broad_anno_string_df['criterion_paraphrase'].str.replace(f"{string_prefix}.*", string_prefix, case=False)


0                                             ~*empty*~
1     goodness of outputs relative to linguistic con...
2                   Detectability of controlled feature
3                   Detectability of controlled feature
4                   Detectability of controlled feature
                            ...                        
65                                              fluency
66                                    user satisfaction
67                goodness of outputs relative to input
68                                              fluency
69                       information content of outputs
Name: criterion_paraphrase, Length: 70, dtype: object

For the hierarchical version, we need to define the hierarchy as a replacement dictionary first.
Here we define a dictionary where the key is the 'higher-level' which we will use as a replacement for each of the values associated with it (the 'lower-level').

In [13]:
broad_anno_hierarchical_df = broad_anno_df.copy(deep = True)

hierarchy_dict = iaa_utilities.IAAv2SpreadsheetScheme.HIERARCHY_DICT

for higher_level in hierarchy_dict:
    for lower_level in hierarchy_dict[higher_level]:
        broad_anno_hierarchical_df['criterion_paraphrase'] = broad_anno_hierarchical_df['criterion_paraphrase'].str.replace(f".*{re.escape(lower_level)}.*", higher_level, case = False)
broad_anno_hierarchical_df['criterion_paraphrase']

  broad_anno_hierarchical_df['criterion_paraphrase'] = broad_anno_hierarchical_df['criterion_paraphrase'].str.replace(f".*{re.escape(lower_level)}.*", higher_level, case = False)


0                                       ~*empty*~
1     Goodness of outputs (excluding correctness)
2                           Feature-type criteria
3                           Feature-type criteria
4                           Feature-type criteria
                         ...                     
65    Goodness of outputs (excluding correctness)
66    Goodness of outputs (excluding correctness)
67    Goodness of outputs (excluding correctness)
68    Goodness of outputs (excluding correctness)
69    Goodness of outputs (excluding correctness)
Name: criterion_paraphrase, Length: 70, dtype: object

We can then repeat our calculations of $\alpha$ using the broad versions of the spreadsheet.

#### For the string-based calculations

In [14]:
broad_string_dict = iaa_utilities.IAAv2SpreadsheetScheme.run_closed_class_jaccard_and_masi(broad_anno_string_df)
iaa_utilities.pretty_print_iaa_by_column(broad_string_dict)

column	alpha_jaccard  alpha_masi
system_input	0.70    0.70
external_knowledge	0.15    0.15
system_output	1.00    1.00
system_task	0.37    0.37
knowledge_eval	0.18    0.18
criterion_paraphrase	0.26    0.18
op_form	0.05    0.04
op_data_type	0.25    0.21
op_instrument_type	0.07    0.06


In [15]:
broad_string_dict['system_output']['df']

Unnamed: 0,source_spreadsheet,key,system_output
0,1,2020.acl-main.711,(text)
1,1,2020.emnlp-main.61,(text)
2,1,2020.emnlp-main.739,(text)
3,1,2020.tacl-1.38,(text)
4,1,D18-1454,(text)
5,1,N19-1126,(text)
6,1,N19-1421,(text)
7,1,P19-1193,(text)
8,1,P19-1488,(text)
9,2,2020.acl-main.711,(text)


In [16]:
at = AnnotationTask(data = extract_records_for_nltk(broad_string_dict['system_output']['df']), distance=jaccard_distance)
at.alpha()

1

#### For the hierarchical calculations

In [17]:
broad_hier_dict = iaa_utilities.IAAv2SpreadsheetScheme.run_closed_class_jaccard_and_masi(broad_anno_hierarchical_df)
iaa_utilities.pretty_print_iaa_by_column(broad_hier_dict)

column	alpha_jaccard  alpha_masi
system_input	0.70    0.70
external_knowledge	0.15    0.15
system_output	1.00    1.00
system_task	0.37    0.37
knowledge_eval	0.18    0.18
criterion_paraphrase	0.39    0.33
op_form	0.05    0.04
op_data_type	0.25    0.21
op_instrument_type	0.07    0.06


## Pairwise interannotator agreement

In [18]:


iaa_utilities.IAAv2SpreadsheetScheme.print_absolute_agreement(annotation_df, iaa_by_column)

Interannotator agreement for system_input
 	1	2	3
1	1.00	0.61	0.50		0.56
2	0.61	1.00	0.89		0.75
3	0.50	0.89	1.00		0.69


Interannotator agreement for external_knowledge
 	1	2	3
1	1.00	0.78	0.67		0.72
2	0.78	1.00	0.89		0.83
3	0.67	0.89	1.00		0.78


Interannotator agreement for system_output
 	1	2	3
1	1.00	0.56	0.11		0.33
2	0.56	1.00	0.33		0.44
3	0.11	0.33	1.00		0.22


Interannotator agreement for system_task
 	1	2	3
1	1.00	0.44	0.44		0.44
2	0.44	1.00	0.56		0.50
3	0.44	0.56	1.00		0.50


Interannotator agreement for knowledge_eval
 	1	2	3
1	1.00	0.56	0.56		0.56
2	0.56	1.00	0.56		0.56
3	0.56	0.56	1.00		0.56


Interannotator agreement for criterion_paraphrase
 	1	2	3
1	1.00	0.30	0.23		0.27
2	0.30	1.00	0.30		0.30
3	0.23	0.30	1.00		0.27


Interannotator agreement for op_form
 	1	2	3
1	1.00	0.31	0.11		0.21
2	0.31	1.00	0.22		0.27
3	0.11	0.22	1.00		0.17


Interannotator agreement for op_data_type
 	1	2	3
1	1.00	0.39	0.43		0.41
2	0.39	1.00	0.54		0.46
3	0.43	0.54	1.00		0.48


Interannotator agreem

### Pair-wise agreement on the broad string-based annotations

In [19]:
iaa_utilities.IAAv2SpreadsheetScheme.print_absolute_agreement(broad_anno_string_df)

Interannotator agreement for system_input
 	1	2	3
1	1.00	0.78	0.78		0.78
2	0.78	1.00	1.00		0.89
3	0.78	1.00	1.00		0.89


Interannotator agreement for external_knowledge
 	1	2	3
1	1.00	0.78	0.67		0.72
2	0.78	1.00	0.89		0.83
3	0.67	0.89	1.00		0.78


Interannotator agreement for system_output
 	1	2	3
1	1.00	1.00	1.00		1.00
2	1.00	1.00	1.00		1.00
3	1.00	1.00	1.00		1.00


Interannotator agreement for system_task
 	1	2	3
1	1.00	0.44	0.44		0.44
2	0.44	1.00	0.56		0.50
3	0.44	0.56	1.00		0.50


Interannotator agreement for knowledge_eval
 	1	2	3
1	1.00	0.56	0.56		0.56
2	0.56	1.00	0.56		0.56
3	0.56	0.56	1.00		0.56


Interannotator agreement for criterion_paraphrase
 	1	2	3
1	1.00	0.31	0.24		0.28
2	0.31	1.00	0.39		0.35
3	0.24	0.39	1.00		0.32


Interannotator agreement for op_form
 	1	2	3
1	1.00	0.31	0.11		0.21
2	0.31	1.00	0.22		0.27
3	0.11	0.22	1.00		0.17


Interannotator agreement for op_data_type
 	1	2	3
1	1.00	0.39	0.43		0.41
2	0.39	1.00	0.54		0.46
3	0.43	0.54	1.00		0.48


Interannotator agreem

### Pair-wise agreement on the broad hierarchical annotations

In [20]:
iaa_utilities.IAAv2SpreadsheetScheme.print_absolute_agreement(broad_anno_hierarchical_df)

Interannotator agreement for system_input
 	1	2	3
1	1.00	0.78	0.78		0.78
2	0.78	1.00	1.00		0.89
3	0.78	1.00	1.00		0.89


Interannotator agreement for external_knowledge
 	1	2	3
1	1.00	0.78	0.67		0.72
2	0.78	1.00	0.89		0.83
3	0.67	0.89	1.00		0.78


Interannotator agreement for system_output
 	1	2	3
1	1.00	1.00	1.00		1.00
2	1.00	1.00	1.00		1.00
3	1.00	1.00	1.00		1.00


Interannotator agreement for system_task
 	1	2	3
1	1.00	0.44	0.44		0.44
2	0.44	1.00	0.56		0.50
3	0.44	0.56	1.00		0.50


Interannotator agreement for knowledge_eval
 	1	2	3
1	1.00	0.56	0.56		0.56
2	0.56	1.00	0.56		0.56
3	0.56	0.56	1.00		0.56


Interannotator agreement for criterion_paraphrase
 	1	2	3
1	1.00	0.65	0.52		0.59
2	0.65	1.00	0.48		0.57
3	0.52	0.48	1.00		0.50


Interannotator agreement for op_form
 	1	2	3
1	1.00	0.31	0.11		0.21
2	0.31	1.00	0.22		0.27
3	0.11	0.22	1.00		0.17


Interannotator agreement for op_data_type
 	1	2	3
1	1.00	0.39	0.43		0.41
2	0.39	1.00	0.54		0.46
3	0.43	0.54	1.00		0.48


Interannotator agreem

### Agreement tables for each column

In [21]:
for column in iaa_utilities.IAAv2SpreadsheetScheme.ALL_DATA_COLUMNS:
    display(column)
    display(extract_iaa_df_by_column_name(annotation_df, column).pivot(index="key", columns="source_spreadsheet", values=column))



'system_language'

source_spreadsheet,1,2,3
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020.acl-main.711,(English),(English),(English)
2020.emnlp-main.61,(English),(English),(English)
2020.emnlp-main.739,(English),(English),(~*empty*~)
2020.tacl-1.38,(English),(English),(~*empty*~)
D18-1454,(English),(English),(English)
N19-1126,(English),(Englsih),(English)
N19-1421,(English),(English),(English)
P19-1193,(English),(Chinese),(English)
P19-1488,(English),(English),(English)


'system_input'

source_spreadsheet,1,2,3
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020.acl-main.711,(text: sentence),(text: sentence),(text: sentence)
2020.emnlp-main.61,(visual),(visual),(visual)
2020.emnlp-main.739,(text: multiple sentences),(text: sentence),(text: sentence)
2020.tacl-1.38,(raw/structured data),(raw/structured data),(raw/structured data)
D18-1454,(deep linguistic representation (DLR)),(text: sentence),(text: sentence)
N19-1126,(text: sentence),(text: sentence),(text: dialogue)
N19-1421,(raw/structured data),(text: sentence),(text: sentence)
P19-1193,"(text: sentence, text: subsentential units of ...",(text: subsentential units of text),(text: subsentential units of text)
P19-1488,(text: sentence),(text: sentence),(text: sentence)


'system_output'

source_spreadsheet,1,2,3
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020.acl-main.711,(text: sentence),(text: sentence),(text: variable-length)
2020.emnlp-main.61,(text: multiple sentences),(text: sentence),(text: variable-length)
2020.emnlp-main.739,(text: dialogue),(text: sentence),(text: sentence)
2020.tacl-1.38,(text: multiple sentences),(text: multiple sentences),(text: variable-length)
D18-1454,(text: subsentential units of text),(text: sentence),(text: sentence)
N19-1126,(text: sentence),(text: sentence),(text: dialogue)
N19-1421,(text: other (please specify): multiple-choice...,(text: subsentential units of text),(text: sentence)
P19-1193,(text: multiple sentences),(text: multiple sentences),(text: documents)
P19-1488,(text: sentence),(text: sentence),(text: sentence)


'system_task'

source_spreadsheet,1,2,3
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020.acl-main.711,(feature-controlled generation),(other (please specify): sarcasm generation),(other (please specify): sarcasm generation)
2020.emnlp-main.61,(deep generation (DLR to text)),(other (please specify): video captioning),(end-to-end text generation)
2020.emnlp-main.739,(dialogue turn generation),(dialogue turn generation),(dialogue turn generation)
2020.tacl-1.38,(deep generation (DLR to text)),(data-to-text generation),(end-to-end text generation)
D18-1454,(question answering),(question answering),(question answering)
N19-1126,(deep generation (DLR to text)),(question answering),(dialogue turn generation)
N19-1421,(question answering),(question answering),(question answering)
P19-1193,(summarisation (text-to-text)),(other (please specify): topic generation),(data-to-text generation)
P19-1488,(question answering),(question answering),(question answering)


'op_response_values'

source_spreadsheet,1,2,3
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020.acl-main.711,"(1-5, ~*empty*~)",(1-5),(1-5)
2020.emnlp-main.61,"(1-5, ~*empty*~)",(1-5),(1-5)
2020.emnlp-main.739,"(r1 is better, both are similar, r1 is worse, ...","(r1 is better, both are similar, r1 is worse, ...","(r1 is better, both are similar, r1 is worse, ..."
2020.tacl-1.38,(1-5),(1-5),(1-5)
D18-1454,(~*empty*~),(~*empty*~),"(0/1, ~*empty*~)"
N19-1126,"(0-3, 0 is a completely incorect sentence and ...","(0-3, 1-4)","(0-3, 1-4)"
N19-1421,(~*empty*~),(~*empty*~),(~*empty*~)
P19-1193,(1-5),(1-5),(1-5)
P19-1488,"(definitely left, rather left, difficult to sa...","(definitely left, rather left, difficult to sa...","(definitely left, rather left, difficult to sa..."


'op_instrument_size'

source_spreadsheet,1,2,3
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020.acl-main.711,"(5.0, ~*empty*~)",(5.0),(5.0)
2020.emnlp-main.61,"(5.0, 6.0)",(5.0),(5.0)
2020.emnlp-main.739,(3.0),(3.0),(3.0)
2020.tacl-1.38,(5.0),(5.0),(5.0)
D18-1454,(~*empty*~),(~*empty*~),"(2.0, ~*empty*~)"
N19-1126,(4.0),(4.0),(4.0)
N19-1421,(~*empty*~),(~*empty*~),(~*empty*~)
P19-1193,(5.0),(5.0),(5.0)
P19-1488,(5.0),(5.0),(5.0)


'op_instrument_type'

source_spreadsheet,1,2,3
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020.acl-main.711,"(rank ordering, Likert scale, text annotation)",(Likert scale),(numerical rating scale)
2020.emnlp-main.61,"(Likert scale, output classification)",(Likert scale),(numerical rating scale)
2020.emnlp-main.739,(zero-centered rating scale),"(rank ordering, output classification)",(zero-centered rating scale)
2020.tacl-1.38,(Likert scale),(Likert scale),(numerical rating scale)
D18-1454,(~*empty*~),(unclear),"(output classification, ~*empty*~)"
N19-1126,"(Likert scale, output classification)","(rank ordering, Likert scale)","(numerical rating scale, output classification)"
N19-1421,(~*empty*~),(~*empty*~),(free-text entry)
P19-1193,(unclear),(Likert scale),(numerical rating scale)
P19-1488,(verbal descriptor scale),(rank ordering),(output classification)


'op_data_type'

source_spreadsheet,1,2,3
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020.acl-main.711,"(text, ordinal, rank order)",(ordinal),(ordinal)
2020.emnlp-main.61,"(categorical, ordinal)",(ordinal),(ordinal)
2020.emnlp-main.739,"(categorical, ratio-scale)","(categorical, rank order)",(categorical)
2020.tacl-1.38,(ordinal),(ordinal),(ordinal)
D18-1454,(~*empty*~),(unclear),"(categorical, ~*empty*~)"
N19-1126,"(categorical, ordinal)","(rank order, ordinal)","(categorical, ordinal)"
N19-1421,(~*empty*~),(~*empty*~),(text)
P19-1193,(unclear),(ordinal),(ordinal)
P19-1488,(ordinal),(rank order),(categorical)


'op_form'

source_spreadsheet,1,2,3
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020.acl-main.711,"((dis)agreement with quality statement, relati...",((dis)agreement with quality statement),(direct quality estimation)
2020.emnlp-main.61,"((dis)agreement with quality statement, classi...",((dis)agreement with quality statement),(direct quality estimation)
2020.emnlp-main.739,"(relative quality estimation, classification)","(classification, relative quality estimation)",(relative quality estimation)
2020.tacl-1.38,((dis)agreement with quality statement),((dis)agreement with quality statement),(direct quality estimation)
D18-1454,(relative quality estimation),(unclear),"(relative quality estimation, ~*empty*~)"
N19-1126,"((dis)agreement with quality statement, classi...","(direct quality estimation, relative quality e...",(direct quality estimation)
N19-1421,(unclear),(~*empty*~),(task performance measurements)
P19-1193,(unclear),((dis)agreement with quality statement),(direct quality estimation)
P19-1488,(direct quality estimation),(relative quality estimation),(relative quality estimation)


'op_question_prompt_verbatim'

source_spreadsheet,1,2,3
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020.acl-main.711,"(how creative are the utterances ?, how funny ...","(""how creative are the utterances ?"", ""how sar...","(how sarcastic are the utterances?, how creati..."
2020.emnlp-main.61,(asked to select relevant commonsense descript...,(~*empty*~),(intention of agent's action: judge them in te...
2020.emnlp-main.739,(which response do you think is more engaging/...,"(""considering english language fluency only, c...",(which response do you think is more engaging/...
2020.tacl-1.38,"(does the text flow in a natural, easy to read...",(not given),"(does the text flow in a natural, easy to read..."
D18-1454,(were the commonsense relations\nprovided by o...,(~*empty*~),(was any external commonsense knowledge necess...
N19-1126,(please select all the options that could be a...,(~*empty*~),(please select all the options that could be a...
N19-1421,(~*empty*~),(~*empty*~),(we sampled 100 random questions and for each ...
P19-1193,(~*empty*~),(~*empty*~),(not given)
P19-1488,( the order in which model a and model b appea...,(~*empty*~),(which list of facts explains the answer to th...


'op_statistics'

source_spreadsheet,1,2,3
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020.acl-main.711,"(median, p value, ~*empty*~)",(mean),"(p-value, no test mentioned)"
2020.emnlp-main.61,"(standard deviation of the ratings, inter-rate...","(std, iras, smooth iras)",(standard deviation)
2020.emnlp-main.739,"(cohen’s kappa, percentage)",(cohen's kappa),(cohen’s kappa)
2020.tacl-1.38,(pair-wise mannwhitney tests),(mann-whitney),(pair-wise mann-whitney test)
D18-1454,"(percentage, ~*empty*~)",(blank),"(percentage, ~*empty*~)"
N19-1126,(~*empty*~),(mean),"(average, percentage)"
N19-1421,(~*empty*~),(blank),(percentage of accurate responses)
P19-1193,(~*empty*~),(mean),(pearson correlation for inter-annotator agree...
P19-1488,(~*empty*~),(blank),(percentage)


'criterion_verbatim'

source_spreadsheet,1,2,3
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020.acl-main.711,"(creativity, sarcasticness, humour, ~*empty*~,...","(humor, sarcasticness, creativity, grammatical...","(sarcasticness, grammaticality, creativity, hu..."
2020.emnlp-main.61,(relevance),(relevant),"(intention, effect, attribute)"
2020.emnlp-main.739,"(~*empty*~, engagement, relevance, fluency)","(engagement, relevance, fluency)","(engagement, relevance, fluency)"
2020.tacl-1.38,"(adequacy, fluency)","(adequacy, fluency)","(adequacy, fluency)"
D18-1454,"(effectiveness, ~*empty*~)",(~*empty*~),(~*empty*~)
N19-1126,"(usefulness, grammatical correctness, fluency)","(grammatical correctness and fluency, usefulness)","(grammatical correctness and fluency, usefulness)"
N19-1421,(accuracy),(~*empty*~),(accuracy)
P19-1193,"(diversity, novelty, topic-consistency, cohere...","(diversity, novelty, topic-consistency, cohere...","(diversity, novelty, topic-consistency, cohere..."
P19-1488,(~*empty*~),(~*empty*~),(~*empty*~)


'criterion_definition_verbatim'

source_spreadsheet,1,2,3
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020.acl-main.711,(~*empty*~),(~*empty*~),(~*empty*~)
2020.emnlp-main.61,(~*empty*~),(the workers are asked to provide this rating ...,"(the attribute of the agent given the action, ..."
2020.emnlp-main.739,(relevance measures whether the generated outp...,(relevance measures whether the generated outp...,(relevance measures whether the generated outp...
2020.tacl-1.38,(~*empty*~),"(""does the text flow in a natural, easy to rea...","(does the text clearly express the data?, flue..."
D18-1454,(to check the effectiveness of our commonsense...,(~*empty*~),(~*empty*~)
N19-1126,(validate grammatical correctness of different...,(~*empty*~),(~*empty*~)
N19-1421,(~*empty*~),(~*empty*~),(~*empty*~)
P19-1193,(~*empty*~),(~*empty*~),(~*empty*~)
P19-1488,(~*empty*~),(~*empty*~),(annotators are asked to compare both lists of...


'criterion_paraphrase'

source_spreadsheet,1,2,3
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020.acl-main.711,(detectability of controlled feature [property...,(text property [complexity/simplicity (both fo...,"(grammaticality, text property [property] (spe..."
2020.emnlp-main.61,(correctness of outputs in their own right (bo...,(correctness of outputs relative to input (con...,(goodness of outputs relative to input)
2020.emnlp-main.739,(effect on reader/listener [effect] (specify):...,"(appropriateness (both form and content), flue...","(user satisfaction, goodness of outputs relati..."
2020.tacl-1.38,"(clarity, fluency)","(appropriateness (content), fluency)","(information content of outputs, fluency)"
D18-1454,"(~*empty*~, information content of outputs)",(~*empty*~),"(quality of outputs, ~*empty*~)"
N19-1126,"(grammaticality, usefulness for task/informati...","(naturalness (form), information content of ou...","(multiple (list all): grammaticality, fluency,..."
N19-1421,(~*empty*~),(~*empty*~),(~*empty*~)
P19-1193,"(nonredundancy (both form and content), cohere...","(nonredundancy (both form and content), cohesi...",(text property [property] (specify): diversity...
P19-1488,(inferrability of speaker/author trait [trait]...,(correctness of outputs in their own right),(blank)
