MWLS1 (Multi-Word Lexical Simplification dataset 1) is a dataset for lexical simplification (making text easier to understand by replacing some words), in which both the replaced and replacing fragments can consist of multiple words (up to 3). It was gathered by asking Amazon Mechanical Turk crowd workers to provide alternatives for specified fragments that are simpler, yet preserve the overall meaning and fluency of the sentences.
For example, one of the sentences in the dataset (CASE_1433) reads For the fortified city is solitary, a habitation deserted and forsaken, like the wilderness. The replaced fragment, highlighted in bold, has been simplified by crowd workers as home left, residence abandoned, home forgotten, house empty and place empty.
The dataset was created as a part of the study described in the article Multi-Word Lexical Simplification presented at the COLING 2020 conference in Barcelona. If you need any more information consult the paper or contact its authors!
File MWLS1.tsv
contains 1462 sentences with 7059 simplifications that make up the dataset. It has the following tab-separated columns:
- Id: unique sentence indentifier,
- Group indicates the length of the replaced text (1-3 words) and the source corpus:
- BIBLIE: World English Bible translation from a parallel corpus,
- BIOMED: Text of biomedical publications gathered in the CRAFT corpus,
- EUROPARL: English text from the European Parliament proceedings compiled as the Europarl corpus.
- Prefix: the part of the sentence before the replaced fragment,
- Replaced: the replaced fragment,
- Suffix: the part of the sentence after the replaced fragment,
- NoAnswers: number of replacements provided by the workers (1-5)
- Answers: the replacements.
The dataset is released under the CC BY-NC-SA 4.0 licence.
Przybyła, P. and Shardlow, M., 2020. Multi-Word Lexical Simplification. In Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020).
@inproceedings{plainifier,
title = "Multi-Word Lexical Simplification",
author = {Przyby{\l}a, Piotr and Shardlow, Matthew}",
booktitle = {Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020)},
month = dec,
year = "2020",
address = "Barcelona, Spain",
publisher = {International Committee on Computational Linguistics},
pages = {1435--1446},
url = {https://www.aclweb.org/anthology/2020.coling-main.0}
}