Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About TTS resume #94

Closed
arieszhang1994 opened this issue Jan 8, 2024 · 5 comments
Closed

About TTS resume #94

arieszhang1994 opened this issue Jan 8, 2024 · 5 comments
Assignees

Comments

@arieszhang1994
Copy link

HI, I found that resume code of TTS is in
https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L140
and
https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L302

however, _accelerator_prepare is in
https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L145

So when resume_type=="resume", self. _check_resume function seems not to work.

Is there something which I missed?

@arieszhang1994
Copy link
Author

arieszhang1994 commented Jan 12, 2024

For another issue, I am confused with the phon_id_collator.get_phone_id_sequence
when i run the infer process with
sh egs/tts/VITS/run.sh --stage 3 --gpu "0" \ --config ckpts/tts/vits_ljspeech/args.json \ --infer_expt_dir ckpts/tts/vits_ljspeech/ \ --infer_output_dir ckpts/tts/vits_ljspeech/result \ --infer_mode "single" \ --infer_text "This is a clip of generated speech with the given text from a TTS model."

in https://github.com/open-mmlab/Amphion/blob/main/models/tts/vits/vits_inference.py#L116

the text is 'This is a clip of generated speech with the given text from a TTS model.'
the phone_seq is '['DH', 'IH0', 'S', 'IH0', 'Z', 'AH0', 'K', 'L', 'IH1', 'P', 'AH0', 'V', 'JH', 'EH1', 'N', 'ER0', 'EY2', 'T', 'AH0', 'D', 'S', 'P', 'IY1', 'CH', 'W', 'IH0', 'DH', 'DH', 'AH0', 'G', 'IH1', 'V', 'AH0', 'N', 'T', 'EH1', 'K', 'S', 'T', 'F', 'ER0', 'M', 'AH0', 'T', 'IY1', 'EH1', 'N', 'IY1', 'S', 'M', 'AA1', 'D', 'AH0', 'L']'
however, the phone_id_seq is [41, 45, 11, 46, 45, 63, 42, 55, 52, 11, 56, 11, 46, 45, 63, 42, 55, 52, 11, 63, 11, 38, 45, 63, 42, 55, 52, 11, 48, 11, 49, 11, 46, 45, 52, 51, 42, 11, 53, 11, 38, 45, 63, 42, 55, 52, 11, 59, 11, 47, 45, 11, 42, 45, 52, 51, 42, 11, 51, 11, 42, 55, 63, 42, 55, 52, 11, 42, 62, 57, 60, 52, 11, 57, 11, 38, 45, 63, 42, 55, 52, 11, 41, 11, 56, 11, 53, 11, 46, 62, 52, 51, 42, 11, 40, 45, 11, 60, 11, 46, 45, 63, 42, 55, 52, 11, 41, 45, 11, 41, 45, 11, 38, 45, 63, 42, 55, 52, 11, 44, 11, 46, 45, 52, 51, 42, 11, 59, 11, 38, 45, 63, 42, 55, 52, 11, 51, 11, 57, 11, 42, 45, 52, 51, 42, 11, 48, 11, 56, 11, 57, 11, 43, 11, 42, 55, 63, 42, 55, 52, 11, 50, 11, 38, 45, 63, 42, 55, 52, 11, 57, 11, 46, 62, 52, 51, 42, 11, 42, 45, 52, 51, 42, 11, 51, 11, 46, 62, 52, 51, 42, 11, 56, 11, 50, 11, 38, 38, 52, 51, 42, 11, 41, 11, 38, 45, 63, 42, 55, 52, 11, 49]

when I run
text.sequence_to_text(phone_id_seq)
the result is
dh ihzero s ihzero z ahzero k l ihone p ahzero v jh ehone n erzero eytwo t ahzero d s p iyone ch w ihzero dh dh ahzero g ihone v ahzero n t ehone k s t f erzero m ahzero t iyone ehone n iyone s m aaone d ahzero l

does amphion do this on purpose?

@lmxue
Copy link
Collaborator

lmxue commented Jan 16, 2024

HI, I found that resume code of TTS is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L140 and https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L302

however, _accelerator_prepare is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L145

So when resume_type=="resume", self. _check_resume function seems not to work.

Is there something which I missed?

Thanks for your feedback. Please check this PR #108 .

@arieszhang1994
Copy link
Author

arieszhang1994 commented Jan 16, 2024

HI, I found that resume code of TTS is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L140 and https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L302
however, _accelerator_prepare is in https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L145
So when resume_type=="resume", self. _check_resume function seems not to work.
Is there something which I missed?

Thanks for your feedback. Please check this PR #108 .

Thank you!
Besides, can you check the second issue I mentioned? I tried to add
phones =" ".join(phone_seq)
phones = "{"+phones"}"
phone_seq=phones.split(" ")
after this line:

phones_seq = phones.split(" ")

and retrain a new vits model.

also I change the same code of inference. However, the retrained model fails to synthesize human-understandable English. Although the loss seems normal (dropped to 37). The generated demo sounds like the phone embedding haven't be trained.
It's so weird that I have debugged for several days and stil can't find out the reason now.

lmxue added a commit that referenced this issue Jan 17, 2024
* Fix bug for VITS resuming training. Related issue #94
@lmxue
Copy link
Collaborator

lmxue commented Feb 15, 2024

For another issue, I am confused with the phon_id_collator.get_phone_id_sequence when i run the infer process with sh egs/tts/VITS/run.sh --stage 3 --gpu "0" \ --config ckpts/tts/vits_ljspeech/args.json \ --infer_expt_dir ckpts/tts/vits_ljspeech/ \ --infer_output_dir ckpts/tts/vits_ljspeech/result \ --infer_mode "single" \ --infer_text "This is a clip of generated speech with the given text from a TTS model."

in https://github.com/open-mmlab/Amphion/blob/main/models/tts/vits/vits_inference.py#L116

the text is 'This is a clip of generated speech with the given text from a TTS model.' the phone_seq is '['DH', 'IH0', 'S', 'IH0', 'Z', 'AH0', 'K', 'L', 'IH1', 'P', 'AH0', 'V', 'JH', 'EH1', 'N', 'ER0', 'EY2', 'T', 'AH0', 'D', 'S', 'P', 'IY1', 'CH', 'W', 'IH0', 'DH', 'DH', 'AH0', 'G', 'IH1', 'V', 'AH0', 'N', 'T', 'EH1', 'K', 'S', 'T', 'F', 'ER0', 'M', 'AH0', 'T', 'IY1', 'EH1', 'N', 'IY1', 'S', 'M', 'AA1', 'D', 'AH0', 'L']' however, the phone_id_seq is [41, 45, 11, 46, 45, 63, 42, 55, 52, 11, 56, 11, 46, 45, 63, 42, 55, 52, 11, 63, 11, 38, 45, 63, 42, 55, 52, 11, 48, 11, 49, 11, 46, 45, 52, 51, 42, 11, 53, 11, 38, 45, 63, 42, 55, 52, 11, 59, 11, 47, 45, 11, 42, 45, 52, 51, 42, 11, 51, 11, 42, 55, 63, 42, 55, 52, 11, 42, 62, 57, 60, 52, 11, 57, 11, 38, 45, 63, 42, 55, 52, 11, 41, 11, 56, 11, 53, 11, 46, 62, 52, 51, 42, 11, 40, 45, 11, 60, 11, 46, 45, 63, 42, 55, 52, 11, 41, 45, 11, 41, 45, 11, 38, 45, 63, 42, 55, 52, 11, 44, 11, 46, 45, 52, 51, 42, 11, 59, 11, 38, 45, 63, 42, 55, 52, 11, 51, 11, 57, 11, 42, 45, 52, 51, 42, 11, 48, 11, 56, 11, 57, 11, 43, 11, 42, 55, 63, 42, 55, 52, 11, 50, 11, 38, 45, 63, 42, 55, 52, 11, 57, 11, 46, 62, 52, 51, 42, 11, 42, 45, 52, 51, 42, 11, 51, 11, 46, 62, 52, 51, 42, 11, 56, 11, 50, 11, 38, 38, 52, 51, 42, 11, 41, 11, 38, 45, 63, 42, 55, 52, 11, 49]

when I run text.sequence_to_text(phone_id_seq) the result is dh ihzero s ihzero z ahzero k l ihone p ahzero v jh ehone n erzero eytwo t ahzero d s p iyone ch w ihzero dh dh ahzero g ihone v ahzero n t ehone k s t f erzero m ahzero t iyone ehone n iyone s m aaone d ahzero l

does amphion do this on purpose?

When cfg.preprocess.phone_extractor == "lexicon", we convert text to phone sequence based on the dictionary defined in https://raw.githubusercontent.com/open-mmlab/Amphion/main/text/lexicon/librispeech-lexicon.txt.
For the conversion from phone sequence to phone ID sequence, we currently uses the phoneme set from the https://github.com/HarryHe11/vc-dev/blob/main/text/symbols.py. However, it should use the phoneme set from the librispeech-lexicon.txt. I'll refactor this part. Thanks for your feedback.

@HarryHe11
Copy link
Collaborator

Hi @arieszhang1994 , If you have any further questions about the TTS resume, feel free to re-open this issue. We are glad to follow up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants