GitHub - ltgoslo/ltg-bert: LTG-Bert

Trained on 100 million words and still in shape:
BERT meets British National Corpus

David Samuel, Andrey Kutuzov, Lilja Øvrelid and Erik Velldal

University of Oslo
Language Technology Group

This is the official repository for our EACL paper about pre-training language models on a representative 100M-word corpus. We propose a data-efficient LM architecture (LTG-BERT) that outperforms the original BERT model. We believe that this type of modestly-sized, but representative, corpora has great potential as a language modeling benchmark.

Content of this repository

./modeling_ltgbert.py: HuggingFace-compatible wrapper for LTG-BERT
./preprocessing/: Scripts for processing the XML format of BNC and for processing the evaluation datasets
./training/: Scripts for training LTG-BERT on processed BNC
./evaluation/: Evaluation scripts for evaluation LTG-BERT on (Super)GLUE, edge probing and BLiMP

Please cite the following publication (just arXiv for now)

@inproceedings{samuel-etal-2023-trained,
    title = "Trained on 100 million words and still in shape: {BERT} meets {B}ritish {N}ational {C}orpus",
    author = "Samuel, David  and
      Kutuzov, Andrey  and
      {\O}vrelid, Lilja  and
      Velldal, Erik",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-eacl.146",
    pages = "1954--1974",
    abstract = "While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source {--} the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
babylm		babylm
evaluation		evaluation
preprocessing		preprocessing
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

babylm

babylm

evaluation

evaluation

preprocessing

preprocessing

training

training

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Trained on 100 million words and still in shape:
BERT meets British National Corpus

Trained on 100 million words and still in shape:

BERT meets British National Corpus

Content of this repository

Please cite the following publication (just arXiv for now)

About

Releases

Packages

Contributors 2

Languages

License

ltgoslo/ltg-bert

Folders and files

Latest commit

History

Repository files navigation

Trained on 100 million words and still in shape:BERT meets British National Corpus

Trained on 100 million words and still in shape:

BERT meets British National Corpus

Content of this repository

Please cite the following publication (just arXiv for now)

About

Resources

License

Stars

Watchers

Forks

Languages

Trained on 100 million words and still in shape:
BERT meets British National Corpus