Skip to content

orai-nlp/bl2mp

Repository files navigation

How Well Can BERT Learn the Grammar of an Agglutinative and Flexible-Order Language? The Case of Basque.

Data (BL2MP dataset and pretraining corpora), models and evaluation scripts from our work How Well Can BERT Learn the Grammar of an Agglutinative and Flexible-Order Language? The Case of Basque. accepted at LREC-COLING2024.

BL2MP (Basque L2 student-based Minimal Pairs):

We introduce BL2MP (Basque L2 student-based Minimal Pairs), designed to assess the grammatical knowledge of language Models in the Basque language, inspired by the BLiMP benchmark. The BL2MP dataset includes examples sourced from the bai&by language academy, derived from essays written by students enrolled at the academy. These instances provide a wealth of authentic and natural grammatical errors, representing genuine mistakes made by learners and thus offering a realistic reflection of real-world language errors.

BL2MP is also available on HuggingFace 🤗

Pretraining corpora:

We employed three corpora of different sizes (5M, 25M, 125M) in our experiments:

5M, 25M, 125M

We also share the lemmatized counterparts:

5M_lemma, 25M_lemma, 125M_lemma

MLM validation datasets:

MLM_val, MLM_val_lemma

Models

We trained 3 BERT models of different sizes, namely mini, medium and base (with 4L, 8L and 12L respectively) with each corpus.

Here we share the best performing checkpoint for each model:

bert_mini_eu_5M, bert_mini_eu_25M, bert_mini_eu_125M

bert_medium_eu_5M, bert_medium_eu_25M, bert_medium_eu_125M

bert_base_eu_5M, bert_base_eu_25M, bert_base_eu_125M

We also share the models trained with the lemmatized version:

bert_medium_eu_5M_lemma, bert_medium_eu_25M_lemma, bert_medium_eu_125M_lemma

Evaluation Script usage:

We used minicons to evaluate (0-shot) our MLMs on BL2MP, which uses Salazar et al. (2020) to score sentences.

It can be installed with pip:

pip install minicons

And the we evaluate a MLMs as follows:

python3 mlm-score.py  --input bl2mp.jsonl --output_dir output/ --lm orai-nlp/ElhBERTeu-medium --device cuda:0

There are different versions of the dataset and evaluation script, created for different experiments, but all of them use the same call to minicons, and differ only in reading the input data, and the conditions set to filter minimal-pairs to compute the final accuracy score.

Authors

Gorka Urbizu [1] [2], Muitze Zulaika [1], Xabier Saralegi [1], Ander Corral [1]

Affiliations:

[1] Orai NLP Technologies

[2] University of the Basque Country

Licensing

Copyright (C) by Orai NLP Technologies.

The corpora, datasets, models and scripts created in this work, are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0.

International License (CC BY-NC-SA 4.0). To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.

Acknowledgements

If you use these corpora, datasets or models please cite the following paper:

  • G. Urbizu, M. Zulaika, X. Saralegi, A. Corral. How Well Can BERT Learn the Grammar of an Agglutinative and Flexible-Order Language? The Case of Basque. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING2024). May, 2024. Torino, Italy

Contact information

Gorka Urbizu, Muitze Zulaika: {g.urbizu,m.zulaika}@orai.eus

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages