New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creation of code to load LibriVox and format for the python_speech_features package #46
Comments
LibriVox doesn't have properly aligned transcriptions. Is figuring out a solution for that within the scope of this issue? |
Another alternative would be using existing corpuses (corpi?) extracted from LibriVox like LibriSpeech: http://www.openslr.org/12/ |
Have you looked at the TED code in issue 2? |
Yep. I started writing a bunch of code for downloading and formatting the LibriVox data directly, from the Internet Archive, but after reading the LibriSpeech paper I learned that proper alignment and segmentation is a very large effort and we should probably just use that corpus directly, so I'm gonna do that. |
Before you go off on a wild goose chase, please define what you mean by "proper alignment". |
Also did you read and understand the Deep Speech paper? The Deep Speech paper and our code under master uses the CTC algorithm which does not require "alignment" in the sense used for HMM STT engines. |
Using LibriSpeech directly is fine, it's actually what I expected form the start, but do not spend time trying to "align" the corpus in the sense used for HMM STT engines. CTC does not require such "alignment". |
Not as well as I thought I had, evidently! Either that or I'm just abusing the jargon. I was under the impression that the transcriptions need to have a minimal resemblance to the audio, which the raw LibriVox data, by default, doesn't have. That's as far as my definition of "alignment" went: skipping the initial audio disclaimers, skipping the license header on the Project Gutenberg files, etc. In any case, we've ended up on the same page, albeit in my case that included a few bumps along the way :P |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
The code for the TED corpus is in the fork issue2. One should take this code as a starting point.
The text was updated successfully, but these errors were encountered: