The quality of longer wavs generated by hn-sinc-nsf worsens over time. #2

taroushirani · 2020-07-31T13:42:27Z

Hello. I trained hn-sinc-nsf model by running project/hn-sinc-nsf-9/00_demo.sh and each generated WAV listed in test_list in config.py sounds good. But when I synthesized a longer WAV, the quality of generated sound worsens over time.

For example, [1]I prepared a longer WAV file with length of 1:52 by concatenating all WAVs listed in test_list for 3 times randomly. I extracted its acoustic features and synthesized a WAV file with trained hn-sinc-nsf model using these features.

From 1:41 to 1:47, this big WAV contains the data of slt_arctic_b0474 and slt_arctic_b0476 but their quality of sound are inferior to those of [2][3]separately generated WAVs

This phenomenon occurs with default 00_demo.sh settings(acoustic features of mel-spectrum, f0), with other acoustic features(mel-generalized cepstrum, band aperiodicity, f0), and with other data-set(NIT-SONG070 singing voice data-set provided at HTS webpage.

Could anyone please advise me to avoid this trouble?

TonyWangX · 2020-08-01T09:46:12Z

Hi, thanks for reporting this issue.
Here are my thoughts before I check the code:

Since the pitch sounds OK to me, therefore the problem would be due to the filter part.
Since convolution and linear layers are time-length-agnostic, the first reason that comes to my mind is the BLSTM layer in the condition module.

The BLSTM layer is not time agnostic; it is not guaranteed to produce the same output if the input data is embedded in a longer input data sequence. Furthermore, the training data in 00_demo.sh uses mainly CMU-arctic, which contains short utterances (at least shorter than 1 minute).

I would suggest removing the BLSTM layer in the condition module, or replace it with one conv layer, for example

project-NN-Pytorch-scripts/project/hn-sinc-nsf-9/model.py

Line 584 in c5152e9

self.l_blstm = BLSTMLayer(input_dim, self.blstm_s)

self.l_conv1d_pre = Conv1dKeepLength(input_dim, self.blstm_s, dilation_s = 1, kernel_s = self.cnn_kernel_s)

project-NN-Pytorch-scripts/project/hn-sinc-nsf-9/model.py

Line 626 in c5152e9

tmp = self.l_upsamp(self.l_conv1d(self.l_blstm(feature)))

tmp = self.l_upsamp(self.l_conv1d(self.l_conv1d_pre(feature)))

There is no reason to stick to BLSTM.
I used it mainly because I have to make it consistent with old models in my experiments.

I will do a quick experiment to test it.

Apologize if it caused trouble to your application.

TonyWangX · 2020-08-02T11:54:04Z

I looked into the code, and I think I find the issue(s).

1. Numerical problem with `torch.cumsum()` in sine generator

This is the main issue.

Why

When generating sine, the phase is accumulated by torch.cumsum(f0/sampling_rate * 2 * np.pi). When the utterance is too long, for example 1 minute in length, the accumulated value will be very large and torch.cumsum() becomes inaccurate. The generated sine will be weird, especially for the segments near the end of the long utterances.

How to

The solution is to do phase wrapping over f0/sampling_rate. Because sin((f0/sampling_rate - K) * 2 *np.pi) = sin((f0/sampling_rate) * 2 *np.pi - 2 K pi) sin((f0/sampling_rate) * 2 *np.pi), we can wrap the value of f0/sampling_rate to make it between (0, 1).

In my CUDA implementation of NSF, the cumsum is a for loop, and I embedded a similar wrapping function "https://github.com/nii-yamagishilab/project-CURRENNT-public/blob/69ce115688b2686810a43cbb27d1daf699e2de15/CURRENNT_codes/currennt_lib/src/layers/SignalGenLayer.cu#L252"

In the Pytorch implementation, there is no for loop. Therefore, the implementation is to add -1.0 to f0/sampling_rate at time steps where the accumulated phase will cross 1.0.

The code is here

project-NN-Pytorch-scripts/project/hn-sinc-nsf-9/model.py

Lines 476 to 487 in 09ef1ba

    
           # To prevent torch.cumsum numerical overflow, 
        
           # it is necessary to add -1 whenever \sum_k=1^n rad_value_k > 1. 
        
           # Buffer tmp_over_one_idx indicates the time step to add -1. 
        
           # This will not change F0 of sine because (x-1) * 2*pi = x *2*pi 
        
           tmp_over_one = torch.cumsum(rad_values, 1) % 1 
        
           tmp_over_one_idx = (tmp_over_one[:, 1:, :] -  
        
                               tmp_over_one[:, :-1, :]) < 0 
        
           cumsum_shift = torch.zeros_like(rad_values) 
        
           cumsum_shift[:, 1:, :] = tmp_over_one_idx * -1.0 
        
           sines = torch.sin(torch.cumsum(rad_values + cumsum_shift, dim=1)  
        
                             * 2 * np.pi)

The same modification to hn-nsf/model.py in the branch https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/newfunctions.

cyc-noise-NSF doesn't need to update because it has used a similar method when aligning the generated pulse-train.

project-NN-Pytorch-scripts/project/hn-sinc-nsf-9/model.py

Lines 500 to 512 in 09ef1ba

    
           tmp_cumsum = torch.cumsum(rad_values, dim=1) 
        
           # different batch needs to be processed differently 
        
           for idx in range(f0_values.shape[0]): 
        
               temp_sum = tmp_cumsum[idx, u_loc[idx, :, 0], :] 
        
               temp_sum[1:, :] = temp_sum[1:, :] - temp_sum[0:-1, :] 
        
               # stores the accumulation of i.phase within  
        
               # each voiced segments 
        
               tmp_cumsum[idx, :, :] = 0 
        
               tmp_cumsum[idx, u_loc[idx, :, 0], :] = temp_sum 
        
           # rad_values - tmp_cumsum: remove the accumulation of i.phase 
        
           # within the previous voiced segment. 
        
           i_phase = torch.cumsum(rad_values - tmp_cumsum, dim=1)

Note

There is no need to re-train the model if the training data sequences have been truncated into short segments during training (see truncate_seq in config.py). Therefore, please just:

update model.py
do generation

I attached two samples from pre-trained hn-sinc-nsf-9. The file long-4-bad.wav is the one before revising model.py; long-4.wav is the one after modification.
https://www.dropbox.com/sh/gf3zp00qvdp3row/AACrDZEiUlFQzmBF0fFCEfGRa/temp/issue-long-wav-202008?dl=0&subfolder_nav_tracking=1

2. BLSTM in condition module

This is not really important for cmu-arctic. When concatenating multiple utterances, the silence between utterances will tell the BLSTM to reset. You can fix issue 1 and use the trained NSF to generate long utterances.

Why

BLSTM is not time-length agnostic. Training data are short trimmed segments, while generation data can be very long utterances

How to

Replace BLSTM in condition module with CNN layers. I added hn-sinc-nsf-10 as one example.
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/blob/newfunctions/project/hn-sinc-nsf-10/model.py

Note

No need to change 00_demo.sh and the model.py for CMU-arctic.

3. Concatenation of acoustic features

This is the least important one. If the data is well prepared, this is no need to worry about this issue.

why

During experiment on long utterances, we may concatenate the Mel-spectrogram or F0 of multiple utterances as a single file.
However, Mel-spectrogram and F0 of the same utterance may have different length due to the extraction tools. If we directly concatente features, the length mismatch will accumulate, and the Mel-spectrogram and F0 will not be well aligned.
Mel_1.shape: [0, 100]
F0_1.shape: [0, 99]
Mel_2.shape: [0, 110]
F0_2.shape: [0, 108]
cat(Mel_1, Mel_2).shape: [0, 210]
cat(F0_1, F0_2).shape: [0, 207]

how to

Do manual trimming before concatenation
Mel_1 <- Mel_1 [0, 0:99]
Mel_2 <- Mel_2 [0, 0:108]
(Mel_1+Mel_2).shape: [0, 207]
(F0_1+F0_2).shape: [0, 207]

note

This is not a problem for a single utterance because the default data IO will trim the longer feature sequence
Mel_1 <- Mel_1 [0, 0:99]

4. summary

The first issue is the most important. Please try the updated model.py

For default 00_demo.sh on CMU-arctic, it is OK to use the pre-trained model to generate very long utterances. For models trained on other corpus, the BLSTM may become an issue.

Depending the way to prepare the input features for long utterances, the third issue with feature concatenation may also matter.

taroushirani · 2020-08-02T12:59:24Z

Thank you very much for your rapid response. I changed SineGen._f02sine() as you did and I confirmed this problem is solved. I have tested hn-sinc-nsf only, but if the length of utterance is as long as 3 minutes or so, everything seems to be fine only with the fix of issue 1.

I attached the two links of the examples whose length is about 3 minutes. [1]Upper link is generated before fix of model.py and [2]lower link is generated after fix.

Because the problem is solved with longer utterances, I have not tried replacing BLSTM, and I prepared the acoustic features by extracting from the concatenated wav, not concatenating separately generated acoustic features, so I believe my data is free from the accumulation of length mismatch.

I am very grateful for your speedy and correct solution. Thank you very much.

TonyWangX · 2020-08-03T01:18:06Z

Great. Yes, the first issue is the most important.
I will close the issue and merge it to the master branch

TonyWangX · 2020-08-09T09:04:52Z

Now pushed to master.
It should have no impact on model training because most of the data sequences in one batch are not as long as 1 minute.

6ff3a78
(please ignore the wrong comment "dilated BLSTM".)

taroushirani mentioned this issue Jul 31, 2020

Neural Source Filter waveform models (NSF) support nnsvs/nnsvs#10

Closed

TonyWangX closed this as completed Aug 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The quality of longer wavs generated by hn-sinc-nsf worsens over time. #2

The quality of longer wavs generated by hn-sinc-nsf worsens over time. #2

taroushirani commented Jul 31, 2020 •

edited

TonyWangX commented Aug 1, 2020 •

edited

TonyWangX commented Aug 2, 2020 •

edited

taroushirani commented Aug 2, 2020 •

edited

TonyWangX commented Aug 3, 2020

TonyWangX commented Aug 9, 2020 •

edited

The quality of longer wavs generated by hn-sinc-nsf worsens over time. #2

The quality of longer wavs generated by hn-sinc-nsf worsens over time. #2

Comments

taroushirani commented Jul 31, 2020 • edited

TonyWangX commented Aug 1, 2020 • edited

TonyWangX commented Aug 2, 2020 • edited

1. Numerical problem with torch.cumsum() in sine generator

Why

How to

Note

2. BLSTM in condition module

Why

How to

Note

3. Concatenation of acoustic features

why

how to

note

4. summary

taroushirani commented Aug 2, 2020 • edited

TonyWangX commented Aug 3, 2020

TonyWangX commented Aug 9, 2020 • edited

taroushirani commented Jul 31, 2020 •

edited

TonyWangX commented Aug 1, 2020 •

edited

TonyWangX commented Aug 2, 2020 •

edited

1. Numerical problem with `torch.cumsum()` in sine generator

taroushirani commented Aug 2, 2020 •

edited

TonyWangX commented Aug 9, 2020 •

edited