Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time dimension doesn't match #11

Open
MingjieChen opened this issue Jan 24, 2022 · 24 comments
Open

time dimension doesn't match #11

MingjieChen opened this issue Jan 24, 2022 · 24 comments

Comments

@MingjieChen
Copy link

^MTraining: 0%| | 0/200000 [00:00<?, ?it/s]
^MEpoch 1: 0%| | 0/454 [00:00<?, ?it/s]^[[APrepare training ...
Number of StyleSpeech Parameters: 28197333
Removing weight norm...
Traceback (most recent call last):
File "train.py", line 224, in
main(args, configs)
File "train.py", line 98, in main
output = (None, None, model((batch[2:-5])))
File "/share/mini1/sw/std/python/anaconda3-2019.07/v3.7/envs/StyleSpeech/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/share/mini1/sw/std/python/anaconda3-2019.07/v3.7/envs/StyleSpeech/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in forward
return self.module(*inputs[0], **kwargs[0])
File "/share/mini1/sw/std/python/anaconda3-2019.07/v3.7/envs/StyleSpeech/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/share/mini1/res/t/vc/studio/timap-en/libritts/StyleSpeech/model/StyleSpeech.py", line 144, in forward
d_control,
File "/share/mini1/res/t/vc/studio/timap-en/libritts/StyleSpeech/model/StyleSpeech.py", line 88, in G
d_control,
File "/share/mini1/sw/std/python/anaconda3-2019.07/v3.7/envs/StyleSpeech/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/share/mini1/res/t/vc/studio/timap-en/libritts/StyleSpeech/model/modules.py", line 417, in forward
x = x + pitch_embedding
RuntimeError: The size of tensor a (132) must match the size of tensor b (130) at non-singleton dimension 1
^MTraining: 0%| | 1/200000 [00:02<166:02:12, 2.99s/it]

I think it might because of mfa I used.
As mentioned in https://montreal-forced-aligner.readthedocs.io/en/latest/getting_started.html, I installed mfa through conda.

Then I used
mfa align raw_data/LibriTTS lexicon/librispeech-lexicon.txt english preprocessed_data/LibriTTS
instead of the way you showed.
But I can't find a way to run it as the way you showed, because I installed mfa through conda.

@MingjieChen
Copy link
Author

Can you give some details about how you installed mfa? Since mfa installed from conda install mfa -c conda-forge doesn't support mfa_aligh commandline

@mmgn123
Copy link

mmgn123 commented Feb 9, 2022

I am facing the same problem here
result = self.forward(*input, **kwargs)
File "~/StyleSpeech/model/modules.py", line 420, in forward
x = x + pitch_embedding
RuntimeError: The size of tensor a (124) must match the size of tensor b (184) at non-singleton dimension 1
Any solution?

@keonlee9420
Copy link
Owner

Thanks @MingjieChen and @mmgn123 for your report. The MFA part should definitely be updated as MFA has been updated recently, but even with that, you should have no problem if you got the TextGrid from your conda version of MFA and then preprocessed the dataset with it.

So is this the process that you took?
conda version MFA installation -> align with dataset (and get TextGrid files) -> preprocess with the TextGrid -> running train.py

If so, I think the issue mentioned should not occur. If not, please let me know what your process was.

@mmgn123
Copy link

mmgn123 commented Feb 9, 2022

exactly, I have installed MFA then I have got TextGrid files.
After running preprocess with the TextGrid, I have got energy, pitch, mel, duration folders as well as train.txt and val.txt files (but no tarin_filtered.txt file).
But after that when running train.py, I get this error.
When I tried to print the shapes of x and pitch_embedding I got respectively [16,125,256] and [16,198,256]

@Yaccoub
Copy link

Yaccoub commented Feb 9, 2022

I'm also still facing the same tensor mismatching problem. Thanks in advance for your help

@keonlee9420
Copy link
Owner

@mmgn123 @Yaccoub what dataset are you using?

@mmgn123
Copy link

mmgn123 commented Feb 9, 2022

I am actually using mls_german dataset, which can be found here: http://www.openslr.org/94/.
I brought it to the same format as LibriTTS in train-clean-100 then I run the prepare_align.py, it worked and I got the raw_data.
After that I run "mfa train raw_data/mls/ german-lexicon.txt german_acoustic_model.zip" to get the german_acoustic_model.zip
then "mfa align raw_data/mls/ german-lexicon.txt german_acoustic_model.zip preprocessed_data/mls" which lead to the TextGrid files.

@keonlee9420
Copy link
Owner

Thanks for the info. Can you print out the shape of duration, pitch, energy, and mel just before this line during running preprocessor.py? If you set "phoneme_level" for both pitch and energy in preprocess.yaml, the length of duration, pitch, energy should be the same.

@mmgn123
Copy link

mmgn123 commented Feb 9, 2022

That's correct, the lengths of duration, pitch and energy are same! 122 in my case

@keonlee9420
Copy link
Owner

and the summation of duration is equal to the length of the mel?

@mmgn123
Copy link

mmgn123 commented Feb 9, 2022

exactly, the sum of duration is 987 and the shape of the mel is (80,987)

@keonlee9420
Copy link
Owner

also, does the length of phone in here have the same length of duration, pitch and energy? Then, I think there was no issue on MFA for preprocessing.

@mmgn123
Copy link

mmgn123 commented Feb 9, 2022

yes, the lengths are the same!

@keonlee9420
Copy link
Owner

thanks for checking! Ok, then we can confirm that the data is processed correctly. Now, we can think of these:

  1. During data loading, can you check that every element of the input is from the same filename (such as in here) and has the same length to each other? I think the length mismatch such as 124 and 184 as in your log can be from the mismatch of the source file of them. But as in @MingjieChen 's case where the two tensors have 130 and 132 each, and I think this discrepancy can be from the version mismatch for the pitch extractor (padding rule might be different and hence output in a slightly different length). In the latter case, upgrading/downgrading the module could resolve the issue.
  2. Did you change some parts of the model architecture? I think the line that raised the issue was 420 in your log, but actually, that's in line 422 as here. So I guess there are some modifications on the model side code.
  3. other than that, I cannot think of any other reason without seeing your code, sorry;(

@mmgn123
Copy link

mmgn123 commented Feb 9, 2022

1- here the duration pitch and energy they do have the same length except the quary_duration which have a different length
2- these are two print lines
Do you mean upgrading/downgrading MFA?

@keonlee9420
Copy link
Owner

1- and phone in here also has the same length of pitch(,duration and energy)?
2- I see. So no modification at all.
MFA version might be an issue but if it's already matched, then you can ignore it.

@mmgn123
Copy link

mmgn123 commented Feb 9, 2022

phone here is having a different length

@keonlee9420
Copy link
Owner

gotcha, I should mention this first, you have to modify /text as in your case where the target language is not English. In the current code, the output of text_to_sequence function is different from the MFA output based on 'raw_data/mls/ german-lexicon.txt'. To resolve this, you have to match the output of both functions. This is also important at inference time, where we will use the same function in /text.

@mmgn123
Copy link

mmgn123 commented Feb 9, 2022

I checked the output of the text_to_sequence function and I found that there are some parts of the sentence were not converted to phonemes correctly like in this example:

wie schon die während der letzten krise mehrfach vorgekommenen versuche bedrängter italischer parteichefs daselbst sich festzusetzen hinreichend bewiesen
{V IIH SH OOH N D IIH V EHH RR AX N T D EH EX L EH TS T AX N K RR IIH Z AX M EEH EX F AH X spn F EH EX Z UUH X AX spn spn spn D AAH Z EH L P S T Z IH CC spn HH IH N RR AY CC AX N T B AX V IIH Z AX N}
[143, 132, 119, 90, 143, 119, 133, 90, 92, 117, 92, 133, 119, 116, 146, 118, 104, 72, 358, 104, 92, 146, 358, 358, 358, 90, 146, 92, 117, 129, 131, 133, 146, 107, 358, 106, 107, 119, 84, 119, 133, 88, 143, 146, 119].

can this be a reason for this tensor mismatching problem?
Thank you very much for your help!

@keonlee9420
Copy link
Owner

exactly. The missing phonemes must also be missed here, which is the part you must modify along with your languages. Again, you need to make sure that the output of text_to_sequence function should always be matched with the TextGrid's phoneme sequence (MFA lexicons).

@mmgn123
Copy link

mmgn123 commented Feb 9, 2022

That's it! I didn't change the valid_symbols set!
Thank you very much for your timely reply help and support!

@keonlee9420
Copy link
Owner

great! @MingjieChen @Yaccoub , I hope this can help you too.

@keonlee9420
Copy link
Owner

FYI, I updated MFA description in README.md

@MingjieChen
Copy link
Author

Hello, I am using LibriTTS, but I am not sure whether it is also because of the missing phoneme problem. I will take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants