Benchmark for Readability Assessment in Spanish

Official repository for the paper "A Benchmark for Neural Readability Assessment of Texts in Spanish" by @lmvasquezr, @pcuenq, @fireblend and @feralvam.

If you have any question, please don't hesitate to contact us. Feel free to submit any issue/enhancement in GitHub.

Datasets

Our datasets include a combination of texts that are freely available and with a data license agreement. Nevertheless, we have published the collected datasets that are freely available datasets in HuggingFace to support further readability studies. Please find the links in the table below:

Dataset	Original Readability Level
HablaCultura	CEFR
kwiziq	CEFR
coh-metrix-esp	simple, complex
CAES	CEFR
Simplext*	simple, complex
Newsela*	School Grade Levels (2-12) and Readability Levels (0-4)
OneStopCorpus	basic, intermediate, advanced

*Please request your access for Newsela and Simplext corpus (Horacio Saggion) and we will be happy to share our splits upon request.

Models

We have released all of our pretrained models in HuggingFace:

Model	Granularity	# classes
BERTIN (ES)	paragraphs	2
BERTIN (ES)	paragraphs	3
mBERT (ES)	paragraphs	2
mBERT (ES)	paragraphs	3
mBERT (EN+ES)	paragraphs	3
BERTIN (ES)	sentences	2
BERTIN (ES)	sentences	3
mBERT (ES)	sentences	2
mBERT (ES)	sentences	3
mBERT (EN+ES)	sentences	3

For the zero-shot setting, we used the original models BERTIN and mBERT with no further training. Also, you can find our TF-IDF+Logistic Regression approach in model_regression.py, which is based on this implementation.

Reproducibility

We have published all of our datasets and models in HuggingFace. However, as a reference, we have also included our training and data processing scripts in the source folder.

Citation

If you use our results and scripts in your research, please cite our work: "A Benchmark for Neural Readability Assessment of Texts in Spanish" (to be published)

@inproceedings{vasquez-rodriguez-etal-2022-benchmarking,
    title = "A Benchmark for Neural Readability Assessment of Texts in Spanish",
    author = "V{\'a}squez-Rodr{\'\i}guez, Laura  and
      Cuenca-Jim{\'\e}nez, Pedro-Manuel and
      Morales-Esquivel, Sergio Esteban and
      Alva-Manchego, Fernando",
    booktitle = "Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), EMNLP 2022",
    month = dec,
    year = "2022",
}

About the available datasets

We have downloaded the datasets below from its original website to make it available to the community in HuggingFace. If you use this data, please credit the original author and our work as well.

CAES

We have extracted the CAES corpus from their website. If you use this corpus, please also cite their work as follows:

@article{Parodi2015,
  author = "Giovanni Parodi",
  title = "Corpus de aprendices de español (CAES)",
  journal = "Journal of Spanish Language Teaching",
  volume = "2",
  number = "2",
  pages = "194-200",
  year  = "2015",
  publisher = "Routledge",
  doi = "10.1080/23247797.2015.1084685",
  URL = "https://doi.org/10.1080/23247797.2015.1084685",
  eprint = "https://doi.org/10.1080/23247797.2015.1084685"
}

Coh-Metrix-Esp (Cuentos)

We have made available in the HF the collected dataset from Coh-Metrix-Esp paper. If you use their data, please cite their work as follows:

@inproceedings{quispesaravia-etal-2016-coh,
    title = "{C}oh-{M}etrix-{E}sp: A Complexity Analysis Tool for Documents Written in {S}panish",
    author = "Quispesaravia, Andre  and
      Perez, Walter  and
      Sobrevilla Cabezudo, Marco  and
      Alva-Manchego, Fernando",
    booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}'16)",
    month = may,
    year = "2016",
    address = "Portoro{\v{z}}, Slovenia",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://aclanthology.org/L16-1745",
    pages = "4694--4698",
}

HablaCultura and Kwiziq

For these datasets, please also give credit to HablaCultura.com and Kwiziq websites.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark for Readability Assessment in Spanish

Datasets

Models

Reproducibility

Citation

About the available datasets

CAES

Coh-Metrix-Esp (Cuentos)

HablaCultura and Kwiziq

About

Releases

Packages

Languages

lmvasque/readability-es-benchmark

Folders and files

Latest commit

History

Repository files navigation

Benchmark for Readability Assessment in Spanish

Datasets

Models

Reproducibility

Citation

About the available datasets

CAES

Coh-Metrix-Esp (Cuentos)

HablaCultura and Kwiziq

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages