More languages #20

doublex · 2020-12-09T14:26:52Z

Are there any plans to train more languages, e.g. adding this dataset:
https://www.50languages.com/

If it helps I can provide MP3-files as ZIPs.

matiaslindgren · 2020-12-12T14:28:41Z

It would be nice to have some pre-trained models trained on different datasets at some point but I don't think it's going to happen any time soon. If your 50-language dataset is available under a free, public license, I can try to train some models at some point when I have time. The examples might be a useful starting point for training custom models on non-public datasets.

doublex · 2020-12-12T14:38:12Z

A pre-trained model would be great.
The "50 language" dataset is "create common" licensed - so an attribution would be required:
https://www.50languages.com/?user_lang=EN
If it helps I can provide a tarball containing the samples.

matiaslindgren · 2020-12-12T15:56:05Z

Does this page contain all the samples? If so, I'm unsure if the amount of data is enough. Deep-learning based language identification models usually need at least 5 hours (preferably 20 hours) per language before they become useful. Each language also needs to have at least 10 (preferably many 100s) of different speakers.

doublex · 2020-12-12T17:05:50Z

You have to open each language-page separately.
If it helps, I have downloaded and sorted all ZIP files (3.5 GB):
http://doppelbauer.name/50languages.tar

The proper license:
http://creativecommons.org/licenses/by-nc-nd/3.0/us/

matiaslindgren · 2020-12-12T18:01:51Z

Thanks for preparing the zip files. I downloaded the tar-file and listened to some of the Finnish and Swedish samples. Unfortunately, there are too few speakers (only 1 or 2) to train a model on this dataset. Last year, I tried training a model on e-books but there were simply not enough speakers to ensure the model learns the languages. Instead, the model learned only the speakers (or maybe the microphones). I'm quite sure this would happen also with this dataset, even with several hours of data per language.

In any case, this dataset has a very large amount of languages and it could work nicely as an evaluation set for testing trained models on new data. I don't currently have any pre-trained models to test and I'm not sure when I'll have a good one (I'm doing this on my free time). I'm currently working with the Common Voice datasets and if I happen to get something useful working I can let you know by commenting here.

doublex · 2020-12-12T18:18:06Z

You are right. Common Voice hat about ~50 languages.
Thanks for this project.

nshmyrev · 2020-12-13T15:02:58Z

There is http://bark.phon.ioc.ee/voxlingua107/ if you like

VoxLingua107 contains data for 107 languages. The total amount of speech in the training set is 6628 hours. The average amount of data per language is 62 hours. However, the real amount per language varies a lot. There is also a seperate development set containing 1609 speech segments from 33 languages, validated by at least two volunteers to really contain the given language.

doublex · 2020-12-13T15:23:14Z

@nshmyrev
Good catch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More languages #20

More languages #20

doublex commented Dec 9, 2020

matiaslindgren commented Dec 12, 2020

doublex commented Dec 12, 2020

matiaslindgren commented Dec 12, 2020

doublex commented Dec 12, 2020 •

edited

Loading

matiaslindgren commented Dec 12, 2020

doublex commented Dec 12, 2020

nshmyrev commented Dec 13, 2020 •

edited

Loading

doublex commented Dec 13, 2020

More languages #20

More languages #20

Comments

doublex commented Dec 9, 2020

matiaslindgren commented Dec 12, 2020

doublex commented Dec 12, 2020

matiaslindgren commented Dec 12, 2020

doublex commented Dec 12, 2020 • edited Loading

matiaslindgren commented Dec 12, 2020

doublex commented Dec 12, 2020

nshmyrev commented Dec 13, 2020 • edited Loading

doublex commented Dec 13, 2020

doublex commented Dec 12, 2020 •

edited

Loading

nshmyrev commented Dec 13, 2020 •

edited

Loading