How Well Can BERT Learn the Grammar of an Agglutinative and Flexible-Order Language? The Case of Basque.

Data (BL2MP dataset and pretraining corpora), models and evaluation scripts from our work How Well Can BERT Learn the Grammar of an Agglutinative and Flexible-Order Language? The Case of Basque. accepted at LREC-COLING2024.

BL2MP (Basque L2 student-based Minimal Pairs):

We introduce BL2MP (Basque L2 student-based Minimal Pairs), designed to assess the grammatical knowledge of language Models in the Basque language, inspired by the BLiMP benchmark. The BL2MP dataset includes examples sourced from the bai&by language academy, derived from essays written by students enrolled at the academy. These instances provide a wealth of authentic and natural grammatical errors, representing genuine mistakes made by learners and thus offering a realistic reflection of real-world language errors.

BL2MP is also available on HuggingFace 🤗

Pretraining corpora:

We employed three corpora of different sizes (5M, 25M, 125M) in our experiments:

5M, 25M, 125M

We also share the lemmatized counterparts:

5M_lemma, 25M_lemma, 125M_lemma

MLM validation datasets:

MLM_val, MLM_val_lemma

Models

We trained 3 BERT models of different sizes, namely mini, medium and base (with 4L, 8L and 12L respectively) with each corpus.

Here we share the best performing checkpoint for each model:

bert_mini_eu_5M, bert_mini_eu_25M, bert_mini_eu_125M

bert_medium_eu_5M, bert_medium_eu_25M, bert_medium_eu_125M

bert_base_eu_5M, bert_base_eu_25M, bert_base_eu_125M

We also share the models trained with the lemmatized version:

bert_medium_eu_5M_lemma, bert_medium_eu_25M_lemma, bert_medium_eu_125M_lemma

Evaluation Script usage:

We used minicons to evaluate (0-shot) our MLMs on BL2MP, which uses Salazar et al. (2020) to score sentences.

It can be installed with pip:

pip install minicons

And the we evaluate a MLMs as follows:

python3 mlm-score.py  --input bl2mp.jsonl --output_dir output/ --lm orai-nlp/ElhBERTeu-medium --device cuda:0

There are different versions of the dataset and evaluation script, created for different experiments, but all of them use the same call to minicons, and differ only in reading the input data, and the conditions set to filter minimal-pairs to compute the final accuracy score.

Authors

Gorka Urbizu [1] [2], Muitze Zulaika [1], Xabier Saralegi [1], Ander Corral [1]

Affiliations:

[1] Orai NLP Technologies

[2] University of the Basque Country

Licensing

Copyright (C) by Orai NLP Technologies.

The corpora, datasets, models and scripts created in this work, are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0.

International License (CC BY-NC-SA 4.0). To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.

Acknowledgements

If you use these corpora, datasets or models please cite the following paper:

G. Urbizu, M. Zulaika, X. Saralegi, A. Corral. How Well Can BERT Learn the Grammar of an Agglutinative and Flexible-Order Language? The Case of Basque. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING2024). May, 2024. Torino, Italy

Contact information

Gorka Urbizu, Muitze Zulaika: {g.urbizu,m.zulaika}@orai.eus

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
bl2mp.jsonl		bl2mp.jsonl
bl2mp_level.jsonl		bl2mp_level.jsonl
bl2mp_reorder_200.jsonl		bl2mp_reorder_200.jsonl
bl2mp_type.jsonl		bl2mp_type.jsonl
bl2mp_type_lemma.jsonl		bl2mp_type_lemma.jsonl
mlm-score-2tokenizers.py		mlm-score-2tokenizers.py
mlm-score-reorder.py		mlm-score-reorder.py
mlm-score.py		mlm-score.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

bl2mp.jsonl

bl2mp.jsonl

bl2mp_level.jsonl

bl2mp_level.jsonl

bl2mp_reorder_200.jsonl

bl2mp_reorder_200.jsonl

bl2mp_type.jsonl

bl2mp_type.jsonl

bl2mp_type_lemma.jsonl

bl2mp_type_lemma.jsonl

mlm-score-2tokenizers.py

mlm-score-2tokenizers.py

mlm-score-reorder.py

mlm-score-reorder.py

mlm-score.py

mlm-score.py

Repository files navigation

How Well Can BERT Learn the Grammar of an Agglutinative and Flexible-Order Language? The Case of Basque.

BL2MP (Basque L2 student-based Minimal Pairs):

Pretraining corpora:

Models

Evaluation Script usage:

Authors

Licensing

Acknowledgements

Contact information

About

Releases

Packages

Languages

orai-nlp/bl2mp

Folders and files

Latest commit

History

Repository files navigation

How Well Can BERT Learn the Grammar of an Agglutinative and Flexible-Order Language? The Case of Basque.

BL2MP (Basque L2 student-based Minimal Pairs):

Pretraining corpora:

Models

Evaluation Script usage:

Authors

Licensing

Acknowledgements

Contact information

About

Resources

Stars

Watchers

Forks

Languages