Skip to content

MarvinLvn/BabySLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models [paper link]

Welcome to this repository where you'll find all you need to evaluate your language model at:

  1. the lexical level using a spot-the-word task (available in audio or phonetic form; see Table 1)
  2. the syntactic level using a grammatical acceptability judgment task (available in audio, phonetic or orthographic form; see Table 2)

Getting started

You'll probably want to start from there:

Examples of stimuli

Stimuli examples can be listened to on this web page.

Word Pseudo-word Word Pseudo-word
hello lello
pello
sero
dello
sello
cookie kootie
koonie
roodie
rootie
boonie

Table 1: Minimal pairs of real and pseudo-words used in the spot-the-word lexical task.

Phenomenon Sentence example
Adjective-noun order ✓ The good mom.
✗ The mom good.
Noun-verb order ✓ The dragon says.
✗ The says dragon.
Anaphor-gender agreement ✓ The dad cuts himself.
✗ The dad cuts herself.
Anaphor-number agreement ✓The boys told themselves.
✗ The boys told himself.
Determiner-noun agreement ✓ Each good sister.
✗ Many good sister.
Noun-verb agreement ✓ The prince needs the princess.
✗ The prince need the princess.

Table 2: Minimal pairs of grammatical (✓) and ungrammatical (✗) sentences used in the syntactic task.

Reproduce the BabySLM benchmark

If you want to go further:

How to cite?

@inproceedings{lavechin2023baby,
title={BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models},
author={Lavechin, Marvin and Sy, Yaya and Titeux, Hadrien and Bland{\'o}n, Mar{\'\i}a Andrea Cruz and R{\"a}s{\"a}nen, Okko and Bredin, Herv{\'e} and Dupoux, Emmanuel and Cristia, Alejandrina},
year={2023},
booktitle = {Interspeech}
}

Additionnally, if you use BabyBERTa, please cite:

@inproceedings{huebner2021babyberta,
  title={BabyBERTa: Learning more grammar with small-scale child-directed language},
  author={Huebner, Philip A and Sulem, Elior and Cynthia, Fisher and Roth, Dan},
  booktitle={Proceedings of the 25th conference on computational natural language learning},
  pages={624--646},
  year={2021}
}

If you use the Providence corpus, please cite:

@inproceedings{borschinger2013joint,
  title={A joint model of word segmentation and phonological variation for English word-final/t/-deletion},
  author={B{\"o}rschinger, Benjamin and Johnson, Mark and Demuth, Katherine},
  booktitle={Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={1508--1516},
  year={2013}
}

If you use the LibriVox corpus, please cite:

@article{kearns2014librivox,
  title={Librivox: Free public domain audiobooks},
  author={Kearns, Jodi},
  journal={Reference Reviews},
  volume={28},
  number={1},
  pages={7--8},
  year={2014},
  publisher={Emerald Group Publishing Limited}
}

About

Behavioral probing of language acquisition models at the lexical and syntactic level

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages