One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.
Switch branches/tags
Nothing to show
Clone or download
Latest commit 9dadb8f Oct 9, 2018
Permalink
Failed to load latest commit information.
README.md formatting Oct 9, 2018
test.tsv Upload data files. Sep 6, 2018
train.tsv.zip Upload data files. Sep 6, 2018
tune.tsv Upload data files. Sep 6, 2018
validation.tsv Upload data files. Sep 6, 2018

README.md

WikiSplit Dataset

One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.

http://goo.gl/language/wiki-split

Description

Google's WikiSplit dataset was constructed automatically from the publicly available Wikipedia revision history. Although the dataset contains some inherent noise, it can serve as valuable training data for models that split or merge sentences.

For further details about the construction of the dataset and its use for model training, see the accompanying paper: Learning to Split and Rephrase From Wikipedia Edit History

If you use or discuss this dataset in your work, please cite our paper:

@InProceedings{BothaEtAl2018,
  title = {{Learning To Split and Rephrase From Wikipedia Edit History}},
  author = {Botha, Jan A and Faruqui, Manaal and Alex, John and Baldridge, Jason and Das, Dipanjan},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  pages = {to appear},
  note = {arXiv preprint arXiv:1808.09468},
  year = {2018}
}

Examples

  • Due to the hurricane , Lobsterfest has been canceled , making Bob very happy about it and he decides to open Bob 's Burgers for customers who were planning on going to Lobsterfest .

    • Due to the hurricane , Lobsterfest has been canceled , making Bob ecstatic .
    • He decides to open Bob 's Burgers for customers who were planning on going to Lobsterfest .
  • Her family is rumored to be a large financial clique which controls the underworld of Japan , but rarely people know the unhappiness which she suffered for being born in such a troublesome family .

    • Her family is rumored to be a large financial clique which controls the underworld of Japan .
    • People are unaware of the unhappiness which she suffered for being born in such a troublesome family .

Data format

The dataset is released as text files formatted as tab-separated values (TSV) according to the following schema:

Column Meaning
1 unsplit single sentence
2 split-up sentences, delimited by the string <::::>

The sentences have already been tokenized on punctuation.

Example data item

Due to the hurricane , Lobsterfest has been canceled , making Bob very happy about it and he decides to open Bob 's Burgers for customers who were planning on going to Lobsterfest .	Due to the hurricane , Lobsterfest has been canceled , making Bob ecstatic . <::::> He decides to open Bob 's Burgers for customers who were planning on going to Lobsterfest .

Dataset statistics

Part Instances Tokens* Vocabulary*
train.tsv 989,944 33,084,465 632,588
tune.tsv 5,000 167,456 25,871
validation.tsv 5,000 166,628 25,251
test.tsv 5,000 167,853 25,386

*counted over the unsplit sentences

Result on WebSplit 1.0 Benchmark

Our paper introducing the WikiSplit dataset applied it to the split-and-rephrase task. The main result is that including WikiSplit during model training leads to improved generalization and dramatically better output on the WebSplit 1.0 test set.

Corpus BLEU
Source (i.e. just echoing the input sentence 58.7
Model trained on...
     WebSplit only (Aharoni & Goldberg, 2017) 30.5
     WebSplit + WikiSplit (Botha et al., 2018)* 62.4

See paper for details.

License

The WikiSplit dataset is a verbatim copy of certain content from the publicly available Wikipedia revision history. The dataset is therefore licensed under CC BY-SA 4.0. Any third party content or data is provided "As Is" without any warranty, express or implied.

Contact

If you have a technical question regarding the dataset or publication, please create an issue in this repository.