Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The quality of longer wavs generated by hn-sinc-nsf worsens over time. #2

Closed
taroushirani opened this issue Jul 31, 2020 · 5 comments
Closed

Comments

@taroushirani
Copy link

taroushirani commented Jul 31, 2020

Hello. I trained hn-sinc-nsf model by running project/hn-sinc-nsf-9/00_demo.sh and each generated WAV listed in test_list in config.py sounds good. But when I synthesized a longer WAV, the quality of generated sound worsens over time.

For example, [1]I prepared a longer WAV file with length of 1:52 by concatenating all WAVs listed in test_list for 3 times randomly. I extracted its acoustic features and synthesized a WAV file with trained hn-sinc-nsf model using these features.

From 1:41 to 1:47, this big WAV contains the data of slt_arctic_b0474 and slt_arctic_b0476 but their quality of sound are inferior to those of [2][3]separately generated WAVs

This phenomenon occurs with default 00_demo.sh settings(acoustic features of mel-spectrum, f0), with other acoustic features(mel-generalized cepstrum, band aperiodicity, f0), and with other data-set(NIT-SONG070 singing voice data-set provided at HTS webpage.

Could anyone please advise me to avoid this trouble?

  1. https://drive.google.com/file/d/1konFc3QtgTNUhCUOgGULDRJjfM44Zkvl/view?usp=sharing
  2. https://drive.google.com/file/d/1dYDvsoGKZgmEl7BNlkX1HRuQzEL5Vf0m/view?usp=sharing
  3. https://drive.google.com/file/d/1ctLLzYoOn1t5lGRwtu79RIYAE3s3s3xJ/view?usp=sharing
@TonyWangX
Copy link
Member

TonyWangX commented Aug 1, 2020

Hi, thanks for reporting this issue.
Here are my thoughts before I check the code:

Since the pitch sounds OK to me, therefore the problem would be due to the filter part.
Since convolution and linear layers are time-length-agnostic, the first reason that comes to my mind is the BLSTM layer in the condition module.

The BLSTM layer is not time agnostic; it is not guaranteed to produce the same output if the input data is embedded in a longer input data sequence. Furthermore, the training data in 00_demo.sh uses mainly CMU-arctic, which contains short utterances (at least shorter than 1 minute).

I would suggest removing the BLSTM layer in the condition module, or replace it with one conv layer, for example

self.l_blstm = BLSTMLayer(input_dim, self.blstm_s)

self.l_conv1d_pre = Conv1dKeepLength(input_dim, self.blstm_s, dilation_s = 1, kernel_s = self.cnn_kernel_s)

tmp = self.l_upsamp(self.l_conv1d(self.l_blstm(feature)))

tmp = self.l_upsamp(self.l_conv1d(self.l_conv1d_pre(feature)))

There is no reason to stick to BLSTM.
I used it mainly because I have to make it consistent with old models in my experiments.

I will do a quick experiment to test it.

Apologize if it caused trouble to your application.

@TonyWangX
Copy link
Member

TonyWangX commented Aug 2, 2020

I looked into the code, and I think I find the issue(s).

1. Numerical problem with torch.cumsum() in sine generator

This is the main issue.

Why

When generating sine, the phase is accumulated by torch.cumsum(f0/sampling_rate * 2 * np.pi). When the utterance is too long, for example 1 minute in length, the accumulated value will be very large and torch.cumsum() becomes inaccurate. The generated sine will be weird, especially for the segments near the end of the long utterances.

How to

The solution is to do phase wrapping over f0/sampling_rate. Because sin((f0/sampling_rate - K) * 2 *np.pi) = sin((f0/sampling_rate) * 2 *np.pi - 2 K pi) sin((f0/sampling_rate) * 2 *np.pi), we can wrap the value of f0/sampling_rate to make it between (0, 1).

In my CUDA implementation of NSF, the cumsum is a for loop, and I embedded a similar wrapping function "https://github.com/nii-yamagishilab/project-CURRENNT-public/blob/69ce115688b2686810a43cbb27d1daf699e2de15/CURRENNT_codes/currennt_lib/src/layers/SignalGenLayer.cu#L252"

In the Pytorch implementation, there is no for loop. Therefore, the implementation is to add -1.0 to f0/sampling_rate at time steps where the accumulated phase will cross 1.0.

The code is here

# To prevent torch.cumsum numerical overflow,
# it is necessary to add -1 whenever \sum_k=1^n rad_value_k > 1.
# Buffer tmp_over_one_idx indicates the time step to add -1.
# This will not change F0 of sine because (x-1) * 2*pi = x *2*pi
tmp_over_one = torch.cumsum(rad_values, 1) % 1
tmp_over_one_idx = (tmp_over_one[:, 1:, :] -
tmp_over_one[:, :-1, :]) < 0
cumsum_shift = torch.zeros_like(rad_values)
cumsum_shift[:, 1:, :] = tmp_over_one_idx * -1.0
sines = torch.sin(torch.cumsum(rad_values + cumsum_shift, dim=1)
* 2 * np.pi)

The same modification to hn-nsf/model.py in the branch https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/newfunctions.

cyc-noise-NSF doesn't need to update because it has used a similar method when aligning the generated pulse-train.

tmp_cumsum = torch.cumsum(rad_values, dim=1)
# different batch needs to be processed differently
for idx in range(f0_values.shape[0]):
temp_sum = tmp_cumsum[idx, u_loc[idx, :, 0], :]
temp_sum[1:, :] = temp_sum[1:, :] - temp_sum[0:-1, :]
# stores the accumulation of i.phase within
# each voiced segments
tmp_cumsum[idx, :, :] = 0
tmp_cumsum[idx, u_loc[idx, :, 0], :] = temp_sum
# rad_values - tmp_cumsum: remove the accumulation of i.phase
# within the previous voiced segment.
i_phase = torch.cumsum(rad_values - tmp_cumsum, dim=1)

Note

There is no need to re-train the model if the training data sequences have been truncated into short segments during training (see truncate_seq in config.py). Therefore, please just:

  1. update model.py
  2. do generation

I attached two samples from pre-trained hn-sinc-nsf-9. The file long-4-bad.wav is the one before revising model.py; long-4.wav is the one after modification.
https://www.dropbox.com/sh/gf3zp00qvdp3row/AACrDZEiUlFQzmBF0fFCEfGRa/temp/issue-long-wav-202008?dl=0&subfolder_nav_tracking=1

2. BLSTM in condition module

This is not really important for cmu-arctic. When concatenating multiple utterances, the silence between utterances will tell the BLSTM to reset. You can fix issue 1 and use the trained NSF to generate long utterances.

Why

BLSTM is not time-length agnostic. Training data are short trimmed segments, while generation data can be very long utterances

How to

Replace BLSTM in condition module with CNN layers. I added hn-sinc-nsf-10 as one example.
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/blob/newfunctions/project/hn-sinc-nsf-10/model.py

Note

No need to change 00_demo.sh and the model.py for CMU-arctic.

3. Concatenation of acoustic features

This is the least important one. If the data is well prepared, this is no need to worry about this issue.

why

During experiment on long utterances, we may concatenate the Mel-spectrogram or F0 of multiple utterances as a single file.
However, Mel-spectrogram and F0 of the same utterance may have different length due to the extraction tools. If we directly concatente features, the length mismatch will accumulate, and the Mel-spectrogram and F0 will not be well aligned.
Mel_1.shape: [0, 100]
F0_1.shape: [0, 99]
Mel_2.shape: [0, 110]
F0_2.shape: [0, 108]
cat(Mel_1, Mel_2).shape: [0, 210]
cat(F0_1, F0_2).shape: [0, 207]

how to

Do manual trimming before concatenation
Mel_1 <- Mel_1 [0, 0:99]
Mel_2 <- Mel_2 [0, 0:108]
(Mel_1+Mel_2).shape: [0, 207]
(F0_1+F0_2).shape: [0, 207]

note

This is not a problem for a single utterance because the default data IO will trim the longer feature sequence
Mel_1 <- Mel_1 [0, 0:99]

4. summary

The first issue is the most important. Please try the updated model.py

For default 00_demo.sh on CMU-arctic, it is OK to use the pre-trained model to generate very long utterances. For models trained on other corpus, the BLSTM may become an issue.

Depending the way to prepare the input features for long utterances, the third issue with feature concatenation may also matter.

@taroushirani
Copy link
Author

taroushirani commented Aug 2, 2020

Thank you very much for your rapid response. I changed SineGen._f02sine() as you did and I confirmed this problem is solved. I have tested hn-sinc-nsf only, but if the length of utterance is as long as 3 minutes or so, everything seems to be fine only with the fix of issue 1.

I attached the two links of the examples whose length is about 3 minutes. [1]Upper link is generated before fix of model.py and [2]lower link is generated after fix.

  1. https://soundcloud.com/user-883019797/nnsvs_nit_song070_01_svs_nsf_mimirobo_from_scratch_1
  2. https://soundcloud.com/user-883019797/mimirobo

Because the problem is solved with longer utterances, I have not tried replacing BLSTM, and I prepared the acoustic features by extracting from the concatenated wav, not concatenating separately generated acoustic features, so I believe my data is free from the accumulation of length mismatch.

I am very grateful for your speedy and correct solution. Thank you very much.

@TonyWangX
Copy link
Member

Great. Yes, the first issue is the most important.
I will close the issue and merge it to the master branch

@TonyWangX
Copy link
Member

TonyWangX commented Aug 9, 2020

Now pushed to master.
It should have no impact on model training because most of the data sequences in one batch are not as long as 1 minute.

6ff3a78
(please ignore the wrong comment "dilated BLSTM".)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants