New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The quality of longer wavs generated by hn-sinc-nsf worsens over time. #2
Comments
Hi, thanks for reporting this issue. Since the pitch sounds OK to me, therefore the problem would be due to the filter part. The BLSTM layer is not time agnostic; it is not guaranteed to produce the same output if the input data is embedded in a longer input data sequence. Furthermore, the training data in 00_demo.sh uses mainly CMU-arctic, which contains short utterances (at least shorter than 1 minute). I would suggest removing the BLSTM layer in the condition module, or replace it with one conv layer, for example
There is no reason to stick to BLSTM. I will do a quick experiment to test it. Apologize if it caused trouble to your application. |
I looked into the code, and I think I find the issue(s). 1. Numerical problem with
|
# To prevent torch.cumsum numerical overflow, | |
# it is necessary to add -1 whenever \sum_k=1^n rad_value_k > 1. | |
# Buffer tmp_over_one_idx indicates the time step to add -1. | |
# This will not change F0 of sine because (x-1) * 2*pi = x *2*pi | |
tmp_over_one = torch.cumsum(rad_values, 1) % 1 | |
tmp_over_one_idx = (tmp_over_one[:, 1:, :] - | |
tmp_over_one[:, :-1, :]) < 0 | |
cumsum_shift = torch.zeros_like(rad_values) | |
cumsum_shift[:, 1:, :] = tmp_over_one_idx * -1.0 | |
sines = torch.sin(torch.cumsum(rad_values + cumsum_shift, dim=1) | |
* 2 * np.pi) |
The same modification to hn-nsf/model.py in the branch https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/newfunctions.
cyc-noise-NSF doesn't need to update because it has used a similar method when aligning the generated pulse-train.
project-NN-Pytorch-scripts/project/hn-sinc-nsf-9/model.py
Lines 500 to 512 in 09ef1ba
tmp_cumsum = torch.cumsum(rad_values, dim=1) | |
# different batch needs to be processed differently | |
for idx in range(f0_values.shape[0]): | |
temp_sum = tmp_cumsum[idx, u_loc[idx, :, 0], :] | |
temp_sum[1:, :] = temp_sum[1:, :] - temp_sum[0:-1, :] | |
# stores the accumulation of i.phase within | |
# each voiced segments | |
tmp_cumsum[idx, :, :] = 0 | |
tmp_cumsum[idx, u_loc[idx, :, 0], :] = temp_sum | |
# rad_values - tmp_cumsum: remove the accumulation of i.phase | |
# within the previous voiced segment. | |
i_phase = torch.cumsum(rad_values - tmp_cumsum, dim=1) |
Note
There is no need to re-train the model if the training data sequences have been truncated into short segments during training (see truncate_seq in config.py). Therefore, please just:
- update model.py
- do generation
I attached two samples from pre-trained hn-sinc-nsf-9. The file long-4-bad.wav is the one before revising model.py; long-4.wav is the one after modification.
https://www.dropbox.com/sh/gf3zp00qvdp3row/AACrDZEiUlFQzmBF0fFCEfGRa/temp/issue-long-wav-202008?dl=0&subfolder_nav_tracking=1
2. BLSTM in condition module
This is not really important for cmu-arctic. When concatenating multiple utterances, the silence between utterances will tell the BLSTM to reset. You can fix issue 1 and use the trained NSF to generate long utterances.
Why
BLSTM is not time-length agnostic. Training data are short trimmed segments, while generation data can be very long utterances
How to
Replace BLSTM in condition module with CNN layers. I added hn-sinc-nsf-10 as one example.
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/blob/newfunctions/project/hn-sinc-nsf-10/model.py
Note
No need to change 00_demo.sh and the model.py for CMU-arctic.
3. Concatenation of acoustic features
This is the least important one. If the data is well prepared, this is no need to worry about this issue.
why
During experiment on long utterances, we may concatenate the Mel-spectrogram or F0 of multiple utterances as a single file.
However, Mel-spectrogram and F0 of the same utterance may have different length due to the extraction tools. If we directly concatente features, the length mismatch will accumulate, and the Mel-spectrogram and F0 will not be well aligned.
Mel_1.shape: [0, 100]
F0_1.shape: [0, 99]
Mel_2.shape: [0, 110]
F0_2.shape: [0, 108]
cat(Mel_1, Mel_2).shape: [0, 210]
cat(F0_1, F0_2).shape: [0, 207]
how to
Do manual trimming before concatenation
Mel_1 <- Mel_1 [0, 0:99]
Mel_2 <- Mel_2 [0, 0:108]
(Mel_1+Mel_2).shape: [0, 207]
(F0_1+F0_2).shape: [0, 207]
note
This is not a problem for a single utterance because the default data IO will trim the longer feature sequence
Mel_1 <- Mel_1 [0, 0:99]
4. summary
The first issue is the most important. Please try the updated model.py
For default 00_demo.sh on CMU-arctic, it is OK to use the pre-trained model to generate very long utterances. For models trained on other corpus, the BLSTM may become an issue.
Depending the way to prepare the input features for long utterances, the third issue with feature concatenation may also matter.
Thank you very much for your rapid response. I changed SineGen._f02sine() as you did and I confirmed this problem is solved. I have tested hn-sinc-nsf only, but if the length of utterance is as long as 3 minutes or so, everything seems to be fine only with the fix of issue 1. I attached the two links of the examples whose length is about 3 minutes. [1]Upper link is generated before fix of model.py and [2]lower link is generated after fix.
Because the problem is solved with longer utterances, I have not tried replacing BLSTM, and I prepared the acoustic features by extracting from the concatenated wav, not concatenating separately generated acoustic features, so I believe my data is free from the accumulation of length mismatch. I am very grateful for your speedy and correct solution. Thank you very much. |
Great. Yes, the first issue is the most important. |
Now pushed to master. 6ff3a78 |
Hello. I trained hn-sinc-nsf model by running project/hn-sinc-nsf-9/00_demo.sh and each generated WAV listed in test_list in config.py sounds good. But when I synthesized a longer WAV, the quality of generated sound worsens over time.
For example, [1]I prepared a longer WAV file with length of 1:52 by concatenating all WAVs listed in test_list for 3 times randomly. I extracted its acoustic features and synthesized a WAV file with trained hn-sinc-nsf model using these features.
From 1:41 to 1:47, this big WAV contains the data of slt_arctic_b0474 and slt_arctic_b0476 but their quality of sound are inferior to those of [2][3]separately generated WAVs
This phenomenon occurs with default 00_demo.sh settings(acoustic features of mel-spectrum, f0), with other acoustic features(mel-generalized cepstrum, band aperiodicity, f0), and with other data-set(NIT-SONG070 singing voice data-set provided at HTS webpage.
Could anyone please advise me to avoid this trouble?
The text was updated successfully, but these errors were encountered: