Skip to content

English web corpus with 4M tokens and several annotation types

Notifications You must be signed in to change notification settings

gucorpling/amalgum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AMALGUM v0.2

AMALGUM is a machine annotated multilayer corpus following the same design and annotation layers as GUM, but substantially larger (around 4M tokens). The goal of this corpus is to close the gap between high quality, richly annotated, but small datasets, and the larger but shallowly annotated corpora that are often scraped from the Web. Read more here: https://corpling.uis.georgetown.edu/gum/amalgum.html

Download

Latest data without Reddit texts is available under amalgum/ and some additional data beyond the target size of 4M tokens amalgum_extra/. (The amalgum directory contains around 500,000 tokens for each genre, while the extra directory contains some more data beyond the genre-balanced corpus.)

You may download the older version 0.1 of the corpus without Reddit texts as a zip. The complete corpus, with Reddit data, is available upon request: please email lg876@georgetown.edu.

Description

AMALGUM (A Machine-Annotated Lookalike of GUM) is an English web corpus spanning 8 genres with 4,000,000 tokens and several annotation layers.

Genres

Source data was scraped from eight different sources containing stylistically distinct text. Each text's source is indicated with a slug in its filename:

Annotations

AMALGUM contains annotations for the following information:

  • Tokenization
  • UD and Extended PTB part of speech tags
  • Lemmas
  • UD dependency parses
  • (Non-)named nested entities
  • Coreference resolution
  • Rhetorical Structure Theory discourse parses (constituent and dependency versions)
  • Date/Time annotations in TEI format

These annotations are across four file formats: GUM-style XML, CONLLU, WebAnno TSV, and RS3.

You can see samples of the data for AMALGUM_news_khadr: xml, conllu, tsv, rs3

Performance

Current scores on the GUM corpus test set per task:

task metric performance
tokenizer F1 99.92
sentencer Acc / F1 99.85 / 94.35
xpos Acc 98.16
dependencies LAS / UAS* 92.16 / 94.25
NNER Micro F1 70.8
coreference CoNLL F1 51.4
RST S / N / R 77.98 / 61.79 / 44.07

* Parsing scores ignore punctuation attachment; punctuation is attached automatically via udapi.

Further Information

Please see our paper.

Citation

@inproceedings{gessler-etal-2020-amalgum,
    title = "{AMALGUM} {--} A Free, Balanced, Multilayer {E}nglish Web Corpus",
    author = "Gessler, Luke  and
      Peng, Siyao  and
      Liu, Yang  and
      Zhu, Yilun  and
      Behzad, Shabnam  and
      Zeldes, Amir",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.648",
    pages = "5267--5275",
    abstract = "We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a {``}better than NLP{''} benchmark and evaluate the accuracy of the resulting resource.",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

License

All annotations under the folders amalgum/ and amalgum_extra/ are available under a Creative Commons Attribution (CC-BY) license, version 4.0. Note that their texts are sourced from the following websites under their own licenses:

Development

See DEVELOPMENT.md.