Skip to content

mingjund/mapudungun-corpus

Repository files navigation

mapudungun-corpus

This repository contains the cleaned version of the Mapudungun dataset collected for the AVENUE project by CMU, the Chilean Ministry of Education, and the Instituto de Estudios Indígenas at Universidad de La Frontera.

You can download the raw audio data for all files from here.

The TRANSCRIPTION and TRANSLATION directories include the original transcriptions and translations. The transcription-clean and translation-clean directories include cleaned versions with additional annotations removed, in order to be used for speech recognition, synthesis, and machine translation experiments. The necessary scripts for producing these clean versions are available in the data-cleaning directory.

The training, dev, and test dataset splits for our baseline experiments are listed under dataset_splits.

Baseline Results

Citation

If you use the original raw data, please use the following citation:

@dataset{mapudungun,
	title={Mapudungun Speech Corpus},
	author={Luis Caniupil, Flor Caniupil; Héctor Painequeo; Rosendo Huisca; Hugo Carrasco; Rodolfo M Vega; Lori Levin; Jaime Carbonell}
}

If you use the cleaned dataset or if you compare to our baseline results, please use the following citation:

@misc{duan2019mapudungun,
	author={Mingjun Duan, Carlos Fasola, Sai Krishna Rallabandi, Rodolfo M. Vega, Antonios Anastasopoulos, Lori Levin, and Alan W Black}
	title={A Resource for Computational Experiments on Mapudungun},
	note={preprint},
	year={2019}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published