This scripts allows to download a corpus from the pangloss collection and soon from any OLAC repository.

For the moment, you must first download Pangloss metadata using the script that can be found here.

You can then download data from a subset of languages using the following command:

python olac_grabber.py  --metadata "metadata_pangloss.xml" --languages "Lazé" --exceptspeakers "Anonyme"

the tests performed were done with the Lazé language, using the following command :

*no speaker excluded : python olac_grabber.py --metadata "/home/mfily/Documents/diagnoSTIC_XP/03_make_corpus/metadata_pangloss.xml" --languages "Lazé"

*with excluded speakers : python olac_grabber.py --metadata "/home/mfily/Documents/diagnoSTIC_XP/03_make_corpus/metadata_pangloss.xml" --languages "Lazé" --exceptspeakers "Anonyme"

the difference can be seen in files downloaded_data_lazé_no_exception.csv and downloaded_data_lazé_with_exception.csv

This script has been developed during the DiagnoSTIC project.

This work was partly funded by Agence de l’Innovation de Défense (grant 2022 65 0079).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Files

README.md

Latest commit

History

README.md

File metadata and controls