BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models [paper link]

Welcome to this repository where you'll find all you need to evaluate your language model at:

the lexical level using a spot-the-word task (available in audio or phonetic form; see Table 1)
the syntactic level using a grammatical acceptability judgment task (available in audio, phonetic or orthographic form; see Table 2)

Getting started

You'll probably want to start from there:

Examples of stimuli

Stimuli examples can be listened to on this web page.

Word	Pseudo-word	Word	Pseudo-word
hello	lello pello sero dello sello	cookie	kootie koonie roodie rootie boonie

Table 1: Minimal pairs of real and pseudo-words used in the spot-the-word lexical task.

Phenomenon	Sentence example
Adjective-noun order	✓ The good mom. ✗ The mom good.
Noun-verb order	✓ The dragon says. ✗ The says dragon.
Anaphor-gender agreement	✓ The dad cuts himself. ✗ The dad cuts herself.
Anaphor-number agreement	✓The boys told themselves. ✗ The boys told himself.
Determiner-noun agreement	✓ Each good sister. ✗ Many good sister.
Noun-verb agreement	✓ The prince needs the princess. ✗ The prince need the princess.

Table 2: Minimal pairs of grammatical (✓) and ungrammatical (✗) sentences used in the syntactic task.

Reproduce the BabySLM benchmark

If you want to go further:

How to cite?

@inproceedings{lavechin2023baby,
title={BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models},
author={Lavechin, Marvin and Sy, Yaya and Titeux, Hadrien and Bland{\'o}n, Mar{\'\i}a Andrea Cruz and R{\"a}s{\"a}nen, Okko and Bredin, Herv{\'e} and Dupoux, Emmanuel and Cristia, Alejandrina},
year={2023},
booktitle = {Interspeech}
}

Additionnally, if you use BabyBERTa, please cite:

@inproceedings{huebner2021babyberta,
  title={BabyBERTa: Learning more grammar with small-scale child-directed language},
  author={Huebner, Philip A and Sulem, Elior and Cynthia, Fisher and Roth, Dan},
  booktitle={Proceedings of the 25th conference on computational natural language learning},
  pages={624--646},
  year={2021}
}

If you use the Providence corpus, please cite:

@inproceedings{borschinger2013joint,
  title={A joint model of word segmentation and phonological variation for English word-final/t/-deletion},
  author={B{\"o}rschinger, Benjamin and Johnson, Mark and Demuth, Katherine},
  booktitle={Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={1508--1516},
  year={2013}
}

If you use the LibriVox corpus, please cite:

@article{kearns2014librivox,
  title={Librivox: Free public domain audiobooks},
  author={Kearns, Jodi},
  journal={Reference Reviews},
  volume={28},
  number={1},
  pages={7--8},
  year={2014},
  publisher={Emerald Group Publishing Limited}
}

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
docs		docs
example		example
scripts		scripts
README.md		README.md
babyberta_env.yml		babyberta_env.yml
data_prep.yml		data_prep.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

example

example

scripts

scripts

README.md

README.md

babyberta_env.yml

babyberta_env.yml

data_prep.yml

data_prep.yml

Repository files navigation

BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models [paper link]

Getting started

Examples of stimuli

Reproduce the BabySLM benchmark

How to cite?

About

Releases

Packages

Languages

MarvinLvn/BabySLM

Folders and files

Latest commit

History

Repository files navigation

BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models [paper link]

Getting started

Examples of stimuli

Reproduce the BabySLM benchmark

How to cite?

About

Resources

Stars

Watchers

Forks

Languages