Generator exploded after ~138K iters. #61

erogol · 2020-01-30T10:13:11Z

I observed a interesting behaviour after 138K iters where discriminator dominated the training and generator exploded in both train and validation losses. Do you have any idea why and how to prevent it?

I am training on LJSpeech and I basically use the same learning schedule you released with the v2 config for LJSpeech. (Train generator until 100K and enable the discriminator)

Here is the tensorboard screenshot.

kan-bayashi · 2020-01-31T03:45:38Z

Hi @erogol.
I also met this problem. I am trying to solve now (#54). You can see our discussion in (#27 (comment)).
The things I tried are as follows:

Use a large batchsize (according to @Rayhane-mamah)
But batchsize 10 and 20 also caused this problem.
Use halved LR for discriminator
This can prevent it but the discriminator gradually became stronger and the quality results are not so good.

erogol · 2020-01-31T14:55:52Z

A short literature review gave me these options if they make any sense;

Label Smoothing at Discriminator.
Setting lower LR for Generator.
Adding noise to Discriminator input.
Larger LR for D and Smaller LR for G
- 0.0004 and 0.0001

Have you tried anything similar to these, so I can try the rest maybe?

dathudeptrai · 2020-02-04T10:40:27Z

@erogol @kan-bayashi hi, use transpose convolution rather than Upsampling on generator will help (use kernel_size and stride large enough). I'm combine melgan and parallel wavegan as this discussion #46 , the result seem very good for me.

erogol · 2020-02-05T00:42:09Z

@dathudeptrai merging the models only architecturally or training methods as well?

dathudeptrai · 2020-02-05T01:53:10Z

@erogol both, i use training method on this repo (V2 as u are using).

kan-bayashi · 2020-02-05T02:06:14Z

That is interesting.
I will add MelGAN generator in this weekend.

dathudeptrai · 2020-02-05T02:10:11Z

keep in mind that all GAN arch, the author has tuned the parameters and arch carefully (generator won't stronger than discriminator and vice versar). In ur config, u don't use the original discriminator (u use residualdiscriminator instead), and that make the discriminator stronger, u need to modify generator too (increase number of layers, kernel size, ...) :D.

kan-bayashi · 2020-02-06T07:50:44Z

I added MelGAN generator in #62.
Training is on going :D

dathudeptrai · 2020-02-06T07:58:55Z

@kan-bayashi wow, very fast :)), the training progress will 5x faster than Parallelwavegan :D, let see

kan-bayashi · 2020-02-08T02:03:22Z

I added initial MelGAN results #65 (v1 config based).
The results seem to be reasonable and there is a room for improvement if we continue to train more iterations (the iterations are ongoing).
I'm also trying v2-based MelGAN (ongoing). Please look forward to seeing the results :D

dathudeptrai · 2020-02-08T02:41:19Z

@kan-bayashi let see, i’m training melgan (almost same as u are doing, i use multi-scale discriminator loss as melgan introduce and some modification) with fake quantization training 8bit. Also i convert it successfully to tflite, the inference time on cpu mobile 1 thread is around 2x faster than real time.

kan-bayashi · 2020-02-08T02:49:47Z

@dathudeptrai
You mean this feature?
https://pytorch.org/docs/stable/quantization.html#torch.quantization.FakeQuantize
That is amazing! Real-time generation on the device come true :D

dathudeptrai · 2020-02-08T02:58:50Z

@kan-bayashi No, i use this https://github.com/Xilinx/brevitas (need to expands input to 4D to training), this frame work just support conv2D :D. but it's enough :D.I also have a larger version of melgan generator then i convert it to tensorflow and use tensorrt to optimize it on server side :D. Seem all we need to do now is enhance the quality :)), the speed isn't a problem anymore :D

erogol · 2020-02-10T09:08:02Z

I can tell, even though there is a jump after 100K, still voice fidelity improves as the training goes. It is interesting to see.

@dathudeptrai is there an official way to convert models to tf, or is it just loading weights with the same architecture?

dathudeptrai · 2020-02-10T14:55:47Z

@erogol, ONNX is the way u can try, but according to my experience, loading weights with the same architecture is the best :)), i can implement both torch and tf very fast :D and the converting progress is just 1 for loop :D. in addition, when u use TF, u can ez to convert it into tflite and tensorrt to optimize for inference. I don't really like use intermediate representation like ONNX because the limitted support operator :D

erogol · 2020-02-14T13:43:12Z

I successfully trained MelGAN (Generator) in my fork. MelGAN requires the v1 (smaller), discriminator. I guess larger D is too strong for MelGAN to learn.

You can check the model with TTS here:
https://colab.research.google.com/drive/1Zg9jR27Pr-ziVa0krjtdoy2dKv6whv7b

There is still a slight background noise but it is because of the TTS model. Somehow, GAN models are not as robust as WaveRNN to spectrogram representation. Maybe it is a good idea to induce some random noise in training.

dathudeptrai · 2020-02-14T13:53:39Z

@erogol pls try ratios = [4,4,4,2,2] and increase n_réidual, maybe it can help u eliminate background nóie :D

kan-bayashi · 2020-02-14T13:56:43Z

Currently, I am trying to train the MelGAN generator with

residual discriminator (v2-based)
MelGAN discriminator

The training curve is not stable but it sounds more natural than v1.
Once I finished the training, I will upload the config and results.

kan-bayashi · 2020-02-14T14:03:12Z

pls try ratios = [4,4,4,2,2] and increase n_réidual, maybe it can help u eliminate background nóie :D

That's nice.
I will try the following config.

generator_type: "MelGANGenerator" # Generator type.
generator_params:
    in_channels: 80                  # Number of input channels.
    out_channels: 1                  # Number of output channels.
    kernel_size: 7                   # Kernel size of initial and final conv layers.
    channels: 1024                   # Initial number of channels for conv layers.
    upsample_scales: [4, 4, 4, 2, 2] # List of Upsampling scales.
    stack_kernel_size: 3             # Kernel size of dilated conv layers in residual stack.
    stacks: 4                        # Number of stacks in a single residual stack module.
    use_weight_norm: True            # Whether to use weight normalization.
    use_causal_conv: False           # Whether to use causal convolution.

dathudeptrai · 2020-02-14T14:06:00Z

I think melgan D stronger than v2 D :))) based on my knowledge :))) and seem its discriminator is very simmilar with gan TTS điscriminator (google iclr 2020 paper). https://arxiv.org/abs/1909.11646

erogol · 2020-02-14T14:17:50Z

@erogol pls try ratios = [4,4,4,2,2] and increase n_réidual, maybe it can help u eliminate background nóie :D

It'd not since my hop_length is 275. Now I am training a new couple of models with 256 hop_length. There I can try the upsampling params.

kan-bayashi · 2020-02-19T01:48:13Z

I added several new results.

melgan.v1: MelGAN G + PWG D Add MelGAN v1.long config and result #67 (sample)
melgan_large.v1: Large MelGAN G + PWG D Add large MelGAN results #75 (sample)
melgan.v3: MelGAN G + MelGAN D Add melgan.v3 result and config #77 (sample)

According to a few samples, it seems that melgan.v3 >> melgan_large.v1 >= melgan.v1 in terms of the naturalness.

But the training curve of v3 is not so stable, i.e., gradually decreasing both fake and real losses.
And feature matching loss kept increasing...

And STFT-based losses are much higher than v1 but higher naturalness.
This is interesting but difficult to estimate the quality from the loss value :(

dathudeptrai · 2020-02-19T02:08:23Z

@kan-bayashi same observation as me :))). stft, spectral and feature matching loss keep increase over training progress :D and it need training more longer, i trained around 3M steps. I'm thinking about the discriminator overpower problem, maybe it's a problem with loss curve but interm of the naturalness, it doesn't :v .Seem we need the other metric to measure sound quality (MOS score). I'm reading the Frechet DeepSpeech Distance and Ker- nel DeepSpeech Distance which the author find to be well correlated with MOS score. (see GANTTS paper: https://arxiv.org/pdf/1909.11646.pdf) :D

kan-bayashi · 2020-02-21T03:12:54Z

How about the noise?, in my experiment, melgan D has no such noise as PWG.

You mean noise observed in melgan.v1 model i.e., MelGAN G trained with PWG D?
If we train with MelGAN D (melgan.v3), such noise will disappear.

dathudeptrai · 2020-02-21T03:16:57Z

@kan-bayashi okay :D. Just curious, what is ur next plan :)), what is the next version u will training, v2 ? :D

kan-bayashi · 2020-02-21T03:23:24Z

Currently, I'm training MelGAN G + MelGAN D based on your idea (#61 (comment)), and I'm curious about the results with PWG G + MelGAN D. I want to try it.
I've trained several models based on v2 i.e., ResiduralParallelWaveGANDiscriminator, but the results are not so good. So v2 has no high priority.

kan-bayashi · 2020-02-28T08:23:49Z

I compared melgan.v3.long samples with the sample w/o STFT-loss after introducing the discriminator @ 4M iters. melgan.v3.long is clearly better while the training curve of the discriminator loss is almost the same.

Red: w/o STFT-loss after introducing the discriminator
Blue+Orange: melgan.v3.long

STFT-loss is not used for backpropagation, just monitoring.

So, I conclude MelGAN G + MelGAN D + STFT-loss can improve the quality, or at least the convergence speed.

dathudeptrai · 2020-02-28T08:26:20Z

okay :D, how about PWG G + MelganD + STFT-loss ?

kan-bayashi · 2020-02-28T08:33:23Z

okay :D, how about PWG G + MelganD + STFT-loss ?

On-going. The quality @ 1.1M iters is so so.

nartes · 2020-02-28T12:39:26Z

You can check the model with TTS here:
https://colab.research.google.com/drive/1Zg9jR27Pr-ziVa0krjtdoy2dKv6whv7b

waveform.flatten() is required, otherwise 3D tensor of shape 1x1xsample_length

nartes · 2020-02-28T14:26:23Z

First verse from "Milky Chance -The Game", vocoder 400K EN, LJSpeech, ParallelWaveGan, +spleeter, +manual lyrics alignment in audacity

kan-bayashi · 2020-02-28T14:33:13Z

@nartes Thank you for sharing interesting sample!
Did you remove vocal using spleeter and then mix generated voice with separated instrument music?

nartes · 2020-02-28T14:35:57Z

yeah, also feeding lyrics line by line into Tacotron synthesizes distinctive sentences,
and later i needed to stretch an audio with some default algo in audacity by 20%
to match an original verse by Milky Chance.

kan-bayashi · 2020-02-28T14:40:06Z

yeah, also feeding lyrics line by line into Tacotron synthesizes distinctive sentences,
and later i needed to stretch an audio with some default algo in audacity by 20%
to match an original verse by Milky Chance.

Interesting! Great work:)

nartes · 2020-03-08T15:18:20Z

What happens if 100K restriction for discriminator is being replaced with say 0K. I mean not to pretrain model before applying an adversary?
Another question what options are available to reduce model capacity so that support higher batch size, like 32 or 64 with 16K sampling rate at least.
What are GPU memory requirements at the moment?
I have about 12GiB of GPU RAM with vanilla WaveNet, 16K sampling rate, 6 batch size. It's about 4-6M of weight parameters.
Is it correct that without discriminator applicaiton, pure STFT loss does produce artificats in the synthesis. It was my conclusion after listening to intermediate ParallelWaveGan examples before checkpoint with 130K.

kan-bayashi · 2020-03-09T08:29:31Z

What happens if 100K restriction for discriminator is being replaced with say 0K. I mean not to pretrain model before applying an adversary?

I tried. The quality is OK but it contains a strange noise.
#27 (comment)

Another question what options are available to reduce model capacity so that support higher batch size, like 32 or 64 with 16K sampling rate at least.

You can try MelGAN or reduce batch_max_steps.

What are GPU memory requirements at the moment?
I have about 12GiB of GPU RAM with vanilla WaveNet, 16K sampling rate, 6 batch size. It's about 4-6M of weight parameters.

You mean training? Please check the header comments in config.
https://github.com/kan-bayashi/ParallelWaveGAN#results
There is information about the required gpu.

Is it correct that without discriminator applicaiton, pure STFT loss does produce artificats in the synthesis. It was my conclusion after listening to intermediate ParallelWaveGan examples before checkpoint with 130K.

If we use only STFT-loss, the sound will be like a robot.
The adversarial training is needed to improve the perceptual quality.

nartes · 2020-03-17T11:21:30Z

Q: Is there some place where melspectrogram is being transformed. I didn't
find it in the repo source code.

MelGAN vocoder
It is being trained to decode melspectrogram.
Input features are generated with

ParallelWaveGAN/parallel_wavegan/bin/preprocess.py

Line 25 in 53ae639

def logmelfilterbank(audio,
Yet comparing FastSpeech melspectrogram synthesizer with output
from logmelfilterbank. I get different boundaries for magnitude:

[-4.9; 0.34] (logmelfilterbank)
[-1.6; 2.35] (FastSpeech)

Different synthesized audio:

Below are examples of magnitude distribution using a public Google Colab notebook
linked in the repo.
I did replace synthesized melspectrogram c with mel_spec2.
Output is audible but noisy, and it gets better if spectrogram is being shifted
like mel_spec2 + 2.0. I did use 1000K MelGAN in the notebook.
With 400K MelGAN, without shifting the melspectrogram, an output is too quiet
and has a constant artificial frequency noise along.

kan-bayashi · 2020-03-17T11:34:34Z

The melspectrogram is normalized to be mean=0, variance=1 using the statistics of training data.

ParallelWaveGAN/egs/ljspeech/voc1/run.sh

Lines 97 to 102 in 53ae639

    
           parallel-wavegan-normalize \ 
        
               --config "${conf}" \ 
        
               --stats "${dumpdir}/${train_set}/stats.${stats_ext}" \ 
        
               --rootdir "${dumpdir}/${name}/raw/dump.JOB" \ 
        
               --dumpdir "${dumpdir}/${name}/norm/dump.JOB" \ 
        
               --verbose "${verbose}"

Yet comparing FastSpeech melspectrogram synthesizer with output
from logmelfilterbank. I get different boundaries for magnitude:

This is because I use the normalization to be mean=0, variacne=1.
The explicit maximum and minimum values are not defined.

With 400K MelGAN, without shifting the melspectrogram, an output is too quiet
and has a constant artificial frequency noise along.

I'm not sure what you are doing, but my pretrained models assume that the inputs are normalized using the statistics of training data.

nartes · 2020-03-17T11:48:20Z

Indeed, forgot about it, thanks!

John-K92 · 2020-03-24T09:58:08Z

@kan-bayashi No, i use this https://github.com/Xilinx/brevitas (need to expands input to 4D to training), this frame work just support conv2D :D. but it's enough :D.I also have a larger version of melgan generator then i convert it to tensorflow and use tensorrt to optimize it on server side :D. Seem all we need to do now is enhance the quality :)), the speed isn't a problem anymore :D

It's been a great honor to follow you and other researchers' work on this repository. :)
I am interested in the light version of the voice synthesis model, ideally on a mobile platform.
It seems that you have successfully converted the model into the mobile version.
Could you kindly share the model or at least tell us about the performance of it?
Thanks in advance.

kan-bayashi · 2020-03-24T10:04:37Z

@John-K92 Please check demo HP and results in README.
https://kan-bayashi.github.io/ParallelWaveGAN
https://github.com/kan-bayashi/ParallelWaveGAN#results
You can access the sample and model file.

MelGAN is very light while keeping comparable performance.
If you want to use in the mobile, the conversion notebook from pytorch to tensorflow will help you.

dathudeptrai · 2020-03-25T05:25:54Z

@John-K92 the performance i'm using for mobile device is on par with original melgan. (still use float model so the performance is same).

John-K92 · 2020-03-25T05:52:48Z

@John-K92 the performance i'm using for mobile device is on par with original melgan. (still use float model so the performance is same).

Thank you for sharing your experience. Then, may I ask your opinion on TTS(text-to-mel) model, such as Fast speech, Tacotron, etc., that how it could be applicable in mobile platforms along with the MelGAN?
As mentioned by @kan-bayashi and in your research, Mel-2-voice model seems to be light enough to be applied for the mobile device. However, would both models be light enough on mobile in an end-to-end TTS manner? Or would you reckon to put it in a different structure or process?

dathudeptrai · 2020-03-25T06:13:55Z

@John-K92 i just want to say that mel-2-voice is light enough but it is still slower than my text2mel models :D. My text2mel model is fully convolution, deploy rnn is very hard on mobile device so tacotron should be ignore :D

John-K92 · 2020-03-27T03:57:57Z

@John-K92 Please check demo HP and results in README.
https://kan-bayashi.github.io/ParallelWaveGAN
https://github.com/kan-bayashi/ParallelWaveGAN#results
You can access the sample and model file.

MelGAN is very light while keeping comparable performance.
If you want to use in the mobile, the conversion notebook from pytorch to tensorflow will help you.

I am currently trying to convert the model to tensorflow and to mobile-wise version(tensorrt or tensorflow lite)
https://colab.research.google.com/github/kan-bayashi/ParallelWaveGAN/blob/master/notebooks/convert_melgan_from_pytorch_to_tensorflow.ipynb#scrollTo=XOK6AuWW9R8N&line=3&uniqifier=1
But there seems to be an error in the code you've provided. Have you set a specific tensorflow version??

dathudeptrai · 2020-03-27T04:15:10Z

what error ?

John-K92 · 2020-03-27T04:30:20Z

what error ?

When I run the code, "audio = TFMelGANGenerator(**config["generator_params"])(inputs)" line raises an InaccessibleTensorError.

InaccessibleTensorError: The tensor 'Tensor("conv2d_346/dilation_rate:0", shape=(2,), dtype=int32)' cannot be accessed here: it is defined in another function or code block. Use return values, explicit Python locals or TensorFlow collections to access it. Defined in: FuncGraph(name=call, id=140430667155440); accessed from: FuncGraph(name=call, id=140430364789896).

Do you not meet any error on your side?

kan-bayashi added the discussion label Jan 31, 2020

kan-bayashi mentioned this issue Feb 6, 2020

Support MelGAN Generator #62

Merged

kan-bayashi mentioned this issue Feb 8, 2020

Add initial MelGAN config #65

Merged

This was referenced Feb 11, 2020

Add MelGAN v1.long config and result #67

Merged

Add melgan discriminator #69

Merged

Many iterations of discriminator training causes strange noise #27

Closed

kan-bayashi mentioned this issue Feb 17, 2020

Add large MelGAN results #75

Merged

kan-bayashi mentioned this issue Mar 12, 2020

Add parallel_wavegan.v3 config and results #103

Merged

kan-bayashi mentioned this issue Mar 24, 2020

How to train with low quality data? #123

Closed

John-K92 mentioned this issue Mar 29, 2020

> convert_melgan_from_pytorch_to_tensorflow (tf-lite conversion error) #126

Closed

kan-bayashi mentioned this issue Apr 2, 2020

Example of training on tacotron outputs? #86

Closed

kan-bayashi mentioned this issue May 23, 2020

How can we know multi GPU is working? #154

Closed

kan-bayashi closed this as completed Jun 11, 2020

Generator exploded after ~138K iters. #61

Generator exploded after ~138K iters. #61

Comments

erogol commented Jan 30, 2020

kan-bayashi commented Jan 31, 2020

erogol commented Jan 31, 2020 • edited

dathudeptrai commented Feb 4, 2020 • edited

erogol commented Feb 5, 2020

dathudeptrai commented Feb 5, 2020

kan-bayashi commented Feb 5, 2020

dathudeptrai commented Feb 5, 2020 • edited

kan-bayashi commented Feb 6, 2020

dathudeptrai commented Feb 6, 2020

kan-bayashi commented Feb 8, 2020 • edited

dathudeptrai commented Feb 8, 2020 • edited

kan-bayashi commented Feb 8, 2020 • edited

dathudeptrai commented Feb 8, 2020 • edited

erogol commented Feb 10, 2020 • edited

dathudeptrai commented Feb 10, 2020 • edited

erogol commented Feb 14, 2020

dathudeptrai commented Feb 14, 2020

kan-bayashi commented Feb 14, 2020 • edited

kan-bayashi commented Feb 14, 2020 • edited

dathudeptrai commented Feb 14, 2020 • edited

erogol commented Feb 14, 2020 • edited

kan-bayashi commented Feb 19, 2020 • edited

dathudeptrai commented Feb 19, 2020 • edited

kan-bayashi commented Feb 21, 2020 • edited

dathudeptrai commented Feb 21, 2020

kan-bayashi commented Feb 21, 2020

kan-bayashi commented Feb 28, 2020 • edited

dathudeptrai commented Feb 28, 2020

kan-bayashi commented Feb 28, 2020

nartes commented Feb 28, 2020

nartes commented Feb 28, 2020

kan-bayashi commented Feb 28, 2020 • edited

nartes commented Feb 28, 2020

kan-bayashi commented Feb 28, 2020

nartes commented Mar 8, 2020

kan-bayashi commented Mar 9, 2020

nartes commented Mar 17, 2020 • edited

kan-bayashi commented Mar 17, 2020

nartes commented Mar 17, 2020

John-K92 commented Mar 24, 2020

kan-bayashi commented Mar 24, 2020

dathudeptrai commented Mar 25, 2020

John-K92 commented Mar 25, 2020

dathudeptrai commented Mar 25, 2020 • edited

John-K92 commented Mar 27, 2020

dathudeptrai commented Mar 27, 2020

John-K92 commented Mar 27, 2020

erogol commented Jan 31, 2020 •

edited

dathudeptrai commented Feb 4, 2020 •

edited

dathudeptrai commented Feb 5, 2020 •

edited

kan-bayashi commented Feb 8, 2020 •

edited

dathudeptrai commented Feb 8, 2020 •

edited

kan-bayashi commented Feb 8, 2020 •

edited

dathudeptrai commented Feb 8, 2020 •

edited

erogol commented Feb 10, 2020 •

edited

dathudeptrai commented Feb 10, 2020 •

edited

kan-bayashi commented Feb 14, 2020 •

edited

kan-bayashi commented Feb 14, 2020 •

edited

dathudeptrai commented Feb 14, 2020 •

edited

erogol commented Feb 14, 2020 •

edited

kan-bayashi commented Feb 19, 2020 •

edited

dathudeptrai commented Feb 19, 2020 •

edited

kan-bayashi commented Feb 21, 2020 •

edited

kan-bayashi commented Feb 28, 2020 •

edited

kan-bayashi commented Feb 28, 2020 •

edited

nartes commented Mar 17, 2020 •

edited

dathudeptrai commented Mar 25, 2020 •

edited