Skip to content

Source code for the paper "A Benchmark for Neural Readability Assessment of Texts in Spanish" at the TSAR Workshop (EMNLP 2022)

Notifications You must be signed in to change notification settings

lmvasque/readability-es-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Benchmark for Readability Assessment in Spanish

Official repository for the paper "A Benchmark for Neural Readability Assessment of Texts in Spanish" by @lmvasquezr, @pcuenq, @fireblend and @feralvam.

If you have any question, please don't hesitate to contact us. Feel free to submit any issue/enhancement in GitHub.

Datasets

Our datasets include a combination of texts that are freely available and with a data license agreement. Nevertheless, we have published the collected datasets that are freely available datasets in HuggingFace to support further readability studies. Please find the links in the table below:

Dataset Original Readability Level
HablaCultura CEFR
kwiziq CEFR
coh-metrix-esp simple, complex
CAES CEFR
Simplext* simple, complex
Newsela* School Grade Levels (2-12) and Readability Levels (0-4)
OneStopCorpus basic, intermediate, advanced

*Please request your access for Newsela and Simplext corpus (Horacio Saggion) and we will be happy to share our splits upon request.

Models

We have released all of our pretrained models in HuggingFace:

Model Granularity # classes
BERTIN (ES) paragraphs 2
BERTIN (ES) paragraphs 3
mBERT (ES) paragraphs 2
mBERT (ES) paragraphs 3
mBERT (EN+ES) paragraphs 3
BERTIN (ES) sentences 2
BERTIN (ES) sentences 3
mBERT (ES) sentences 2
mBERT (ES) sentences 3
mBERT (EN+ES) sentences 3

For the zero-shot setting, we used the original models BERTIN and mBERT with no further training. Also, you can find our TF-IDF+Logistic Regression approach in model_regression.py, which is based on this implementation.

Reproducibility

We have published all of our datasets and models in HuggingFace. However, as a reference, we have also included our training and data processing scripts in the source folder.

Citation

If you use our results and scripts in your research, please cite our work: "A Benchmark for Neural Readability Assessment of Texts in Spanish" (to be published)

@inproceedings{vasquez-rodriguez-etal-2022-benchmarking,
    title = "A Benchmark for Neural Readability Assessment of Texts in Spanish",
    author = "V{\'a}squez-Rodr{\'\i}guez, Laura  and
      Cuenca-Jim{\'\e}nez, Pedro-Manuel and
      Morales-Esquivel, Sergio Esteban and
      Alva-Manchego, Fernando",
    booktitle = "Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), EMNLP 2022",
    month = dec,
    year = "2022",
}

About the available datasets

We have downloaded the datasets below from its original website to make it available to the community in HuggingFace. If you use this data, please credit the original author and our work as well.

CAES

We have extracted the CAES corpus from their website. If you use this corpus, please also cite their work as follows:

@article{Parodi2015,
  author = "Giovanni Parodi",
  title = "Corpus de aprendices de español (CAES)",
  journal = "Journal of Spanish Language Teaching",
  volume = "2",
  number = "2",
  pages = "194-200",
  year  = "2015",
  publisher = "Routledge",
  doi = "10.1080/23247797.2015.1084685",
  URL = "https://doi.org/10.1080/23247797.2015.1084685",
  eprint = "https://doi.org/10.1080/23247797.2015.1084685"
}

Coh-Metrix-Esp (Cuentos)

We have made available in the HF the collected dataset from Coh-Metrix-Esp paper. If you use their data, please cite their work as follows:

@inproceedings{quispesaravia-etal-2016-coh,
    title = "{C}oh-{M}etrix-{E}sp: A Complexity Analysis Tool for Documents Written in {S}panish",
    author = "Quispesaravia, Andre  and
      Perez, Walter  and
      Sobrevilla Cabezudo, Marco  and
      Alva-Manchego, Fernando",
    booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}'16)",
    month = may,
    year = "2016",
    address = "Portoro{\v{z}}, Slovenia",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://aclanthology.org/L16-1745",
    pages = "4694--4698",
}

HablaCultura and Kwiziq

For these datasets, please also give credit to HablaCultura.com and Kwiziq websites.

About

Source code for the paper "A Benchmark for Neural Readability Assessment of Texts in Spanish" at the TSAR Workshop (EMNLP 2022)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages