Skip to content

piotrmp/mwls1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

MWLS1

MWLS1 (Multi-Word Lexical Simplification dataset 1) is a dataset for lexical simplification (making text easier to understand by replacing some words), in which both the replaced and replacing fragments can consist of multiple words (up to 3). It was gathered by asking Amazon Mechanical Turk crowd workers to provide alternatives for specified fragments that are simpler, yet preserve the overall meaning and fluency of the sentences.

For example, one of the sentences in the dataset (CASE_1433) reads For the fortified city is solitary, a habitation deserted and forsaken, like the wilderness. The replaced fragment, highlighted in bold, has been simplified by crowd workers as home left, residence abandoned, home forgotten, house empty and place empty.

The dataset was created as a part of the study described in the article Multi-Word Lexical Simplification presented at the COLING 2020 conference in Barcelona. If you need any more information consult the paper or contact its authors!

Data format

File MWLS1.tsv contains 1462 sentences with 7059 simplifications that make up the dataset. It has the following tab-separated columns:

  • Id: unique sentence indentifier,
  • Group indicates the length of the replaced text (1-3 words) and the source corpus:
    • BIBLIE: World English Bible translation from a parallel corpus,
    • BIOMED: Text of biomedical publications gathered in the CRAFT corpus,
    • EUROPARL: English text from the European Parliament proceedings compiled as the Europarl corpus.
  • Prefix: the part of the sentence before the replaced fragment,
  • Replaced: the replaced fragment,
  • Suffix: the part of the sentence after the replaced fragment,
  • NoAnswers: number of replacements provided by the workers (1-5)
  • Answers: the replacements.

Licence

The dataset is released under the CC BY-NC-SA 4.0 licence.

Citation

Przybyła, P. and Shardlow, M., 2020. Multi-Word Lexical Simplification. In Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020).

@inproceedings{plainifier,
    title = "Multi-Word Lexical Simplification",
    author = {Przyby{\l}a, Piotr and Shardlow, Matthew}",
    booktitle = {Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020)},
    month = dec,
    year = "2020",
    address = "Barcelona, Spain",
    publisher = {International Committee on Computational Linguistics},
    pages = {1435--1446},
    url = {https://www.aclweb.org/anthology/2020.coling-main.0}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published