How to generate with frame shift 12.5ms #4

Chunhui-Lu · 2020-12-14T06:10:56Z

Hi, I am runing a exp (cyc-noise-nsf-4) with 12.5ms frame shift, 50ms frame length (to match the config of Tacotron).
I only modify input_reso = [200, 200] in config.py, and corresponding args to extract mel and f0
But, the f0 of the synthesized audio looks dijscontinuous.
Can you help me?

TonyWangX · 2020-12-14T08:46:11Z

Thanks to trying the code.

Regarding "audio looks dis-continuous", I am sorry that I cannot tell too much from the spectrogram you showed. Could you please provide more information:

What kind of dis-continuity did you perceive, like the voicing error within a sound (i.e., voiced sound -> unvoiced sound)? Or it occurs during the transition from one sound to another?
How frequently does it happen? Only this one or in every generated utterances?
If it is allowed, please do attach a few audio samples here. Or, you can send the audios to me through emails.
If it is allowed, you may also send the input features files and trained model (trained_network.pt, config.py, model.py) to me too.

With the audios, I probably can identify the issue. With the input features and trained models, I may reply with a better answer.

Thanks in advance.

TonyWangX · 2020-12-17T09:09:31Z

@Chunhui-Lu

Chunhui-Lu · 2020-12-18T02:48:21Z

Hi, thanks for your reply.
I found that the pitch sequences I extracted are discontinuous
When I extracted pitch sequences using pyworld again, everything was OK

This is a pitch sequence of a song's segment extracted by amfm_decompy.pYAAPT. You can find that it is discontinuous:
219.178085
219.178085
219.178085
216.216217
213.333328
210.526321
207.792206
207.792206
207.792206
207.792206
207.792206
207.792206
205.128204
205.128204
205.128204
205.128204
205.128204
205.128204
207.792206
207.792206
207.792206
207.792206
207.792206
207.792206
207.792206
207.792206
207.792206
210.526321
210.526321
210.526321
213.333328
213.333328

TonyWangX · 2020-12-18T08:00:46Z

@Chunhui-Lu Thanks for the reply. It is good to know that.

Yes, no F0 extractor is guaranteed to work at all cases.
I am afraid that I cannot help to solve it : )

I remember that I used to use multiple F0 extractors and do voting ...
I don't have more informative suggestion on this

kikirizki · 2021-06-07T01:26:25Z

Hi, I am runing a exp (cyc-noise-nsf-4) with 12.5ms frame shift, 50ms frame length (to match the config of Tacotron).
I only modify input_reso = [200, 200] in config.py, and corresponding args to extract mel and f0
But, the f0 of the synthesized audio looks dijscontinuous.
Can you help me?

Hi, did you retrain cyc-noise-nsf-4 using frameshift 12.5ms, if yes, would you mind share the trained model ? I try to test tacotron2 + cyc-noise-nsf-4

TonyWangX closed this as completed Dec 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to generate with frame shift 12.5ms #4

How to generate with frame shift 12.5ms #4

Chunhui-Lu commented Dec 14, 2020

TonyWangX commented Dec 14, 2020 •

edited

TonyWangX commented Dec 17, 2020

Chunhui-Lu commented Dec 18, 2020

TonyWangX commented Dec 18, 2020 •

edited

kikirizki commented Jun 7, 2021

How to generate with frame shift 12.5ms #4

How to generate with frame shift 12.5ms #4

Comments

Chunhui-Lu commented Dec 14, 2020

TonyWangX commented Dec 14, 2020 • edited

TonyWangX commented Dec 17, 2020

Chunhui-Lu commented Dec 18, 2020

TonyWangX commented Dec 18, 2020 • edited

kikirizki commented Jun 7, 2021

TonyWangX commented Dec 14, 2020 •

edited

TonyWangX commented Dec 18, 2020 •

edited