Skip to content

guillaume-wisniewski/olac_grabber

Repository files navigation

This scripts allows to download a corpus from the pangloss collection and soon from any OLAC repository.

For the moment, you must first download Pangloss metadata using the script that can be found here.

You can than download data from a subset of languages using the following command:

python olac_grabber.py  --metadata "metadata_pangloss.xml" --languages "Lazé" --exceptspeakers "Anonyme"

the tests performed were done with the Lazé language, using the following command :

*no speaker excluded : python olac_grabber.py --metadata "/home/mfily/Documents/diagnoSTIC_XP/03_make_corpus/metadata_pangloss.xml" --languages "Lazé"

*with excluded speakers : python olac_grabber.py --metadata "/home/mfily/Documents/diagnoSTIC_XP/03_make_corpus/metadata_pangloss.xml" --languages "Lazé" --exceptspeakers "Anonyme"

the difference can be seen in files downloaded_data_lazé_no_exception.csv and downloaded_data_lazé_with_exception.csv

This script has been developed during the DiagnoSTIC project.

This work was partly funded by Agence de l’Innovation de Défense (grant 2022 65 0079).

About

Scripts to download corpora from an olac repository

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages