Gold standard records for Arabic event data

Event Detection

We provide two "event detection" datasets for Arabic language event coding, one for ASSAULT events and one for PROTEST events. These events are coded using the PLOVER ontology, which is similar to the CAMEO ontology. THese files can be used to test an automated coder's ability to recognize these two event types in Arabic text.

protest_gsr.csv and assault_gsr.csv each have the following columns:

  • accept: the number of annotators accepting the event label as true
  • event_type: "ASSAULT" or "PROTEST", depending on the file
  • id: the ID number of the sentence
  • label: one of "yes", "easy no", "difficult no", or "ambiguous", depending on the set of labels provided by annotators. "yes" is unanimous accept, "easy no" is unanimous reject, "hard no" is mostly reject with a dissenting accept, and "ambiguous" are entries with insufficent labels to be sure.
  • reject: the number of annotators who rejected the label.
  • text: the text shown to the annotator and that should be provided to the event detection system
  • total: the total number of annotations provided on the sentence.

Span Recognition and Labeling

Another set of files (assault_spans.json and protest_spans.json) include information for the gold standard recognized events, consisting of the event verb, the source actor and target actor spans, and resulting CAMEO actor codes for each.

source_gold, target_gold, and verb_gold report the common identified by all coders as part of the span.

Petrarch validation format

The two files are also available in XML format, suitable for use in UniversalPetrarch.


