Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long-form synthesis #9

Open
fakerybakery opened this issue Apr 10, 2024 · 3 comments
Open

Long-form synthesis #9

fakerybakery opened this issue Apr 10, 2024 · 3 comments

Comments

@fakerybakery
Copy link

Hi,
Congrats on the release!! Is long form synthesis planned?
Thank you!

@sanchit-gandhi
Copy link
Collaborator

Currently we train on a maximum of 30-second audios. With @ylacombe we're looking at increasing the context length to potentially longer audio lengths. Alibi embeddings (or a variant thereof) look promising for this https://arxiv.org/abs/2108.12409

As a future works, it would be amazing if you could feed an entire chapter of an audiobook to the model, and have it learn the prosody and intonation directly from training examples (with no guidance from the text prompt)

@fakerybakery
Copy link
Author

That would be nice. I was wondering if it would be possible to use chunking, and have previous chunks as context, to make the speech sound natural with different speakers. (This would be nice for audiobooks with multiple characters.)

@lmxue
Copy link

lmxue commented May 2, 2024

Currently we train on a maximum of 30-second audios. With @ylacombe we're looking at increasing the context length to potentially longer audio lengths. Alibi embeddings (or a variant thereof) look promising for this https://arxiv.org/abs/2108.12409

As a future works, it would be amazing if you could feed an entire chapter of an audiobook to the model, and have it learn the prosody and intonation directly from training examples (with no guidance from the text prompt)

Is there any updates aobut the long-form speech synthesis? I'm looking forward to the results.
What's more, for the future works you mentioned, it sounds more applicable in the audiobook scene. But I'm curious about what the voice be like. A pre-defined voice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants