A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.md

README.md

WikiAtomicEdits Dataset

A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.

http://goo.gl/language/wiki-atomic-edits

Description

An atomic edit is defined as an edit e applied to a natural language expression S as the insertion, deletion, or substitution of a sub-expression P such that both the original expression S and the resulting expression e(S) are well-formed semantic constituents (MacCartney, 2009). In this corpus, we release such atomic insertions and deletions made to sentences in wikipedia. Please click on the numbers below to download the data.

Following are the number of examples by language:

Language Insertions Deletions
German 3.3 1.9
English 13.7 9.3
Spanish 1.4 0.9
French 2.0 2.0
Italian 1.0 0.6
Japanese 2.2 1.3
Russian 1.4 0.9
Chinese 0.7 0.4
Total 25.7 17.2

Examples

Example of an insertion:

"She died there after a long illness." + "in 1949" = "She died there in 1949 after a long illness."

Example of a deletion:

"She dreams about entering the Black Lodge and about a ring." - "and about a ring." = "She dreams about entering the Black Lodge."

For each language we provide two files, one each for insertions and deletions. The files contains the following tab-separated columns:

Column Data
1 Original sentence
2 Insertion/deletion
3 Edited sentence

Reference

If you use or discuss this dataset in your work, please cite our paper:

@InProceedings{WikiAtomicEdits,
  title = {{WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse}},
  author = {Faruqui, Manaal and Pavlick, Ellie and Tenney, Ian and Das, Dipanjan},
  booktitle = {Proc. of EMNLP},
  year = {2018}
}

License

Query-wellformedness dataset is licensed under CC BY-SA 4.0. Any third party content or data is provided “As Is” without any warranty, express or implied.

Contact

If you have a technical question regarding the dataset or publication, please create an issue in this repository.