Skip to content

Repository for the BrevE and CLaro datasets.

License

CC-BY-4.0, MIT licenses found

Licenses found

CC-BY-4.0
LICENSE
MIT
LICENSE-CODE
Notifications You must be signed in to change notification settings

microsoft/BrevE-CLaro

BrevE and CLaro

Para español, ve aquí.

This is the repository for our paper "BrevE and CLaro: an evaluation of Spanish text simplification".

  • BrevE is a corpus designed for Spanish complex sentence identification (CSI).
  • CLaro is a corpus for plain language identification (PLI).

Both corpora have been annotated by native Spanish speakers. This is important because:

  • Spanish is a language spoken by almost half a billion people!
  • Spanish morphology is quite different than English, and the sentence structure / word order allows for nuances that are not easily translated.

Text simplification (TS) is an important area of NLP, and is designed to make written material more accessible to a variety of users. However, it has been shown that identifying sentences that need simplification on the first place considerably improves the performance of any TS system. Our work is meant to benchmark this.

See documents for documentation about this dataset, and our paper (link TBD) for our findings around CSI and PLI in Spanish.

If you use BrevE or CLaro in your work, please considering citing our paper:

@misc{dewynter2023usercentered,
      title={A User-Centered Evaluation of Spanish Text Simplification}, 
      author={Adrian de Wynter and Anthony Hevia and Si-Qing Chen},
      year={2023},
      eprint={2308.07556},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

As well as the original works for the corpora we used to build BrevE and CLaro:

@inproceedings{oscar1,
  author    = {Julien Abadji and Pedro Javier Ortiz Su\'{a}rez and Laurent Romary and Beno\^{i}t Sagot},
  title     = {Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus},
  series = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event)},
  editor    = {Harald L{\"u}ngen and Marc Kupietz and Piotr Bański and Adrien Barbaresi and Simon Clematide and Ines Pisetta},
  publisher = {Leibniz-Institut f{\"u}r Deutsche Sprache},
  address   = {Mannheim},
  doi       = {10.14618/ids-pub-10468},
  url       = {https://nbn-resolving.org/urn:nbn:de:bsz:mh39-104688},
  pages     = {1 -- 9},
  year      = {2021},
}

@inproceedings{oscar2,
  author    = {Pedro Javier {Ortiz Su\'{a}rez} and Beno\^it Sagot and Laurent Romary},
  title     = {Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures},
  series = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019},
  editor    = {Piotr Ba\'nski and Adrien Barbaresi and Hanno Biber and Evelyn Breiteneder and Simon Clematide and Marc Kupietz and Harald L\"{u}ngen and Caroline Iliadi},
  publisher = {Leibniz-Institut f\"{u}r Deutsche Sprache},
  address   = {Mannheim},
  doi       = {10.14618/ids-pub-9021},
  url       = {http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215},
  pages     = {9 -- 16},
  year      = {2019},
}

@inproceedings{yimam1,
    title = "Multilingual and Cross-Lingual Complex Word Identification",
    author = "Yimam, Seid Muhie  and
      {\v{S}}tajner, Sanja  and
      Riedl, Martin  and
      Biemann, Chris",
    booktitle = "Proceedings of the International Conference Recent Advances in Natural Language Processing, {RANLP} 2017",
    month = sep,
    year = "2017",
    address = "Varna, Bulgaria",
    publisher = "INCOMA Ltd.",
    url = "https://doi.org/10.26615/978-954-452-049-6_104",
    doi = "10.26615/978-954-452-049-6_104",
    pages = "813--822",
}

@inproceedings{yimam2,
    title = "{CWIG}3{G}2 - Complex Word Identification Task across Three Text Genres and Two User Groups",
    author = "Yimam, Seid Muhie  and
      {\v{S}}tajner, Sanja  and
      Riedl, Martin  and
      Biemann, Chris",
    booktitle = "Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
    month = nov,
    year = "2017",
    address = "Taipei, Taiwan",
    publisher = "Asian Federation of Natural Language Processing",
    url = "https://aclanthology.org/I17-2068",
    pages = "401--407",
}

Updates

  • 5th June 2023: BrevE and CLaro are out!

Contributing

See here.

Legal Notices

See here

About

Repository for the BrevE and CLaro datasets.

Topics

Resources

License

CC-BY-4.0, MIT licenses found

Licenses found

CC-BY-4.0
LICENSE
MIT
LICENSE-CODE

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published