Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using SpeechT5 Large for TTS #51

Open
imranmaj opened this issue May 1, 2023 · 0 comments
Open

Using SpeechT5 Large for TTS #51

imranmaj opened this issue May 1, 2023 · 0 comments

Comments

@imranmaj
Copy link

imranmaj commented May 1, 2023

Hello, thank you so much for providing these models and code along with all the documentation. The HuggingFace integration is very helpful for people like me whose specialty is not ML :) I tried out the TTS model available on HuggingFace and the results are very good, but I'm curious what the difference would be like using the larger SpeechT5 model.

My goal is to prepare the SpeechT5 Large model (60k hrs Libri-Light + LibriSpeech LM Dataset) for TTS in the same way that the smaller model on HuggingFace is tuned for TTS. I'm a little confused though on how the training was done for the smaller model in order to prepare it for TTS. I looked at the manifest and it says: "speecht5_tts.pt are reimplemented Text-to-Speech fine-tuning on the released manifest but with a smaller batch size or max updates (Ensure the manifest is ok)." Does this mean that the SpeechT5 for TTS model was completely retrained from scratch with different batch size/max updates, or was it fine-tuned from the SpeechT5 base model (960 hrs LibriSpeech + LibriSpeech LM Dataset)?

The manifest also says: "This manifest is an attempt to recreate the Text-to-Speech recipe used for training SpeechT5. This manifest was constructed using LibriTTS clean datasets, including train-clean-100 and train-clean-360 for training, dev-clean for validation, and test-clean for evaluation." Does this mean that it was trained from scratch using 100 + 360 = 460 hours of LibriTTS data, or was it fine-tuned on those 460 hours of data?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant