Skip to content
This repository has been archived by the owner on May 18, 2023. It is now read-only.

A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.

google-research-datasets/wiki-atomic-edits

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
October 9, 2018 12:03

WikiAtomicEdits Dataset

A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.

http://goo.gl/language/wiki-atomic-edits

Description

An atomic edit is defined as an edit e applied to a natural language expression S as the insertion, deletion, or substitution of a sub-expression P such that both the original expression S and the resulting expression e(S) are well-formed semantic constituents (MacCartney, 2009). In this corpus, we release such atomic insertions and deletions made to sentences in wikipedia. Please click on the numbers below to download the data.

Following are the number of examples by language:

Language Insertions Deletions
German 3.3 1.9
English 13.7 9.3
Spanish 1.4 0.9
French 2.0 2.0
Italian 1.0 0.6
Japanese 2.2 1.3
Russian 1.4 0.9
Chinese 0.7 0.4
Total 25.7 17.2

Examples

Example of an insertion:

"She died there after a long illness." + "in 1949" = "She died there in 1949 after a long illness."

Example of a deletion:

"She dreams about entering the Black Lodge and about a ring." - "and about a ring." = "She dreams about entering the Black Lodge."

For each language we provide two files, one each for insertions and deletions. The files contains the following tab-separated columns:

Column Data
1 Original sentence
2 Insertion/deletion
3 Edited sentence

Reference

If you use or discuss this dataset in your work, please cite our paper:

@InProceedings{WikiAtomicEdits,
  title = {{WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse}},
  author = {Faruqui, Manaal and Pavlick, Ellie and Tenney, Ian and Das, Dipanjan},
  booktitle = {Proc. of EMNLP},
  year = {2018}
}

License

Query-wellformedness dataset is licensed under CC BY-SA 4.0. Any third party content or data is provided “As Is” without any warranty, express or implied.

Contact

If you have a technical question regarding the dataset or publication, please create an issue in this repository.

About

A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published