Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creation of code to load LibriVox and format for the python_speech_features package #46

Closed
kdavis-mozilla opened this issue Oct 5, 2016 · 9 comments
Assignees

Comments

@kdavis-mozilla
Copy link
Contributor

The code for the TED corpus is in the fork issue2. One should take this code as a starting point.

@reuben
Copy link
Contributor

reuben commented Oct 6, 2016

LibriVox doesn't have properly aligned transcriptions. Is figuring out a solution for that within the scope of this issue?

@reuben
Copy link
Contributor

reuben commented Oct 6, 2016

Another alternative would be using existing corpuses (corpi?) extracted from LibriVox like LibriSpeech: http://www.openslr.org/12/

@kdavis-mozilla
Copy link
Contributor Author

Have you looked at the TED code in issue 2?

@reuben
Copy link
Contributor

reuben commented Oct 7, 2016

Yep. I started writing a bunch of code for downloading and formatting the LibriVox data directly, from the Internet Archive, but after reading the LibriSpeech paper I learned that proper alignment and segmentation is a very large effort and we should probably just use that corpus directly, so I'm gonna do that.

@kdavis-mozilla
Copy link
Contributor Author

Before you go off on a wild goose chase, please define what you mean by "proper alignment".

@kdavis-mozilla
Copy link
Contributor Author

Also did you read and understand the Deep Speech paper?

The Deep Speech paper and our code under master uses the CTC algorithm which does not require "alignment" in the sense used for HMM STT engines.

@kdavis-mozilla
Copy link
Contributor Author

Using LibriSpeech directly is fine, it's actually what I expected form the start, but do not spend time trying to "align" the corpus in the sense used for HMM STT engines. CTC does not require such "alignment".

@reuben
Copy link
Contributor

reuben commented Oct 7, 2016

Also did you read and understand the Deep Speech paper?

Not as well as I thought I had, evidently! Either that or I'm just abusing the jargon.

I was under the impression that the transcriptions need to have a minimal resemblance to the audio, which the raw LibriVox data, by default, doesn't have. That's as far as my definition of "alignment" went: skipping the initial audio disclaimers, skipping the license header on the Project Gutenberg files, etc.

In any case, we've ended up on the same page, albeit in my case that included a few bumps along the way :P

@lock
Copy link

lock bot commented Jan 4, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Jan 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants