Benchmark for Automatic Keyword Extraction in Spanish: Datasets and Methods

Tasks such as document indexing or information retrieval still seem to heavily rely on keywords, even in the LLMs era. However, there is still a need for automatic keyword extraction works and training sets in languages other than English. To the best of our knowledge, no datasets for keyword extraction in Spanish are publicly available for training or evaluation purposes. Additionally, those innovative keyword extraction methods that rely on language models are not being adapted to language models in other languages. To palliate this situation, this work proposes a method to translate into Spanish two of the main gold standard datasets used by community, while preserving semantics and terms. Then, the main state-of-the-art methods are evaluated against the new translated datasets. The methods used for the evaluation have been configured or re-implemented for Spanish.

Installation

On Python 3.9

Libraries and versions on file Requirements.txt

Also remember the spacy model for Spanish

´´´

python -m spacy download es_core_news_sm

´´´

Structure of the repo

The main folder with results is 'datasets'. Inside the directory:

Source: Original datasets
Target: Final translated datasets
doc_translation: Intermediate results, obtained by different translators.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.idea		.idea
datasets		datasets
src		src
translations		translations
.DS_Store		.DS_Store
.gitignore		.gitignore
AnnotateCorpus.py		AnnotateCorpus.py
DatasetGenerator.py		DatasetGenerator.py
Google_dataset_translation.py		Google_dataset_translation.py
Helsinki_translator.py		Helsinki_translator.py
Helsinky_tests.py		Helsinky_tests.py
Keyword_repairer.py		Keyword_repairer.py
LICENSE		LICENSE
OpenAI_dataset_translator.py		OpenAI_dataset_translator.py
README.md		README.md
benchmark.py		benchmark.py
dataset_metrics.py		dataset_metrics.py
doc_translation.py		doc_translation.py
helsinki_inspec.py		helsinki_inspec.py
keyword_extraction.py		keyword_extraction.py
openaiwe.py		openaiwe.py
postprocessing.py		postprocessing.py
requirements.txt		requirements.txt
translation.py		translation.py
translation_class.py		translation_class.py
translation_helsinki.py		translation_helsinki.py
translation_report.py		translation_report.py

License

oeg-upm/spanish-termex

Folders and files

Latest commit

History

Repository files navigation

Benchmark for Automatic Keyword Extraction in Spanish: Datasets and Methods

Installation

Structure of the repo

About

Resources

License

Stars

Watchers

Forks

Languages