Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generator exploded after ~138K iters. #61

Closed
erogol opened this issue Jan 30, 2020 · 58 comments
Closed

Generator exploded after ~138K iters. #61

erogol opened this issue Jan 30, 2020 · 58 comments

Comments

@erogol
Copy link
Contributor

erogol commented Jan 30, 2020

I observed a interesting behaviour after 138K iters where discriminator dominated the training and generator exploded in both train and validation losses. Do you have any idea why and how to prevent it?

I am training on LJSpeech and I basically use the same learning schedule you released with the v2 config for LJSpeech. (Train generator until 100K and enable the discriminator)

Here is the tensorboard screenshot.

image

@kan-bayashi
Copy link
Owner

Hi @erogol.
I also met this problem. I am trying to solve now (#54). You can see our discussion in (#27 (comment)).
The things I tried are as follows:

  • Use a large batchsize (according to @Rayhane-mamah)
    But batchsize 10 and 20 also caused this problem.
  • Use halved LR for discriminator
    This can prevent it but the discriminator gradually became stronger and the quality results are not so good.

@erogol
Copy link
Contributor Author

erogol commented Jan 31, 2020

A short literature review gave me these options if they make any sense;

  • Label Smoothing at Discriminator.
  • Setting lower LR for Generator.
  • Adding noise to Discriminator input.
  • Larger LR for D and Smaller LR for G
    • 0.0004 and 0.0001

Have you tried anything similar to these, so I can try the rest maybe?

@dathudeptrai
Copy link
Contributor

dathudeptrai commented Feb 4, 2020

@erogol @kan-bayashi hi, use transpose convolution rather than Upsampling on generator will help (use kernel_size and stride large enough). I'm combine melgan and parallel wavegan as this discussion #46 , the result seem very good for me.

@erogol
Copy link
Contributor Author

erogol commented Feb 5, 2020

@dathudeptrai merging the models only architecturally or training methods as well?

@dathudeptrai
Copy link
Contributor

@erogol both, i use training method on this repo (V2 as u are using).

@kan-bayashi
Copy link
Owner

That is interesting.
I will add MelGAN generator in this weekend.

@dathudeptrai
Copy link
Contributor

dathudeptrai commented Feb 5, 2020

keep in mind that all GAN arch, the author has tuned the parameters and arch carefully (generator won't stronger than discriminator and vice versar). In ur config, u don't use the original discriminator (u use residualdiscriminator instead), and that make the discriminator stronger, u need to modify generator too (increase number of layers, kernel size, ...) :D.

@kan-bayashi
Copy link
Owner

I added MelGAN generator in #62.
Training is on going :D

@dathudeptrai
Copy link
Contributor

@kan-bayashi wow, very fast :)), the training progress will 5x faster than Parallelwavegan :D, let see

@kan-bayashi
Copy link
Owner

kan-bayashi commented Feb 8, 2020

I added initial MelGAN results #65 (v1 config based).
The results seem to be reasonable and there is a room for improvement if we continue to train more iterations (the iterations are ongoing).
I'm also trying v2-based MelGAN (ongoing). Please look forward to seeing the results :D

@dathudeptrai
Copy link
Contributor

dathudeptrai commented Feb 8, 2020

@kan-bayashi let see, i’m training melgan (almost same as u are doing, i use multi-scale discriminator loss as melgan introduce and some modification) with fake quantization training 8bit. Also i convert it successfully to tflite, the inference time on cpu mobile 1 thread is around 2x faster than real time.

@kan-bayashi
Copy link
Owner

kan-bayashi commented Feb 8, 2020

@dathudeptrai
You mean this feature?
https://pytorch.org/docs/stable/quantization.html#torch.quantization.FakeQuantize
That is amazing! Real-time generation on the device come true :D

@dathudeptrai
Copy link
Contributor

dathudeptrai commented Feb 8, 2020

@kan-bayashi No, i use this https://github.com/Xilinx/brevitas (need to expands input to 4D to training), this frame work just support conv2D :D. but it's enough :D.I also have a larger version of melgan generator then i convert it to tensorflow and use tensorrt to optimize it on server side :D. Seem all we need to do now is enhance the quality :)), the speed isn't a problem anymore :D

@erogol
Copy link
Contributor Author

erogol commented Feb 10, 2020

I can tell, even though there is a jump after 100K, still voice fidelity improves as the training goes. It is interesting to see.

@dathudeptrai is there an official way to convert models to tf, or is it just loading weights with the same architecture?

@dathudeptrai
Copy link
Contributor

dathudeptrai commented Feb 10, 2020

@erogol, ONNX is the way u can try, but according to my experience, loading weights with the same architecture is the best :)), i can implement both torch and tf very fast :D and the converting progress is just 1 for loop :D. in addition, when u use TF, u can ez to convert it into tflite and tensorrt to optimize for inference. I don't really like use intermediate representation like ONNX because the limitted support operator :D

@erogol
Copy link
Contributor Author

erogol commented Feb 14, 2020

I successfully trained MelGAN (Generator) in my fork. MelGAN requires the v1 (smaller), discriminator. I guess larger D is too strong for MelGAN to learn.

You can check the model with TTS here:
https://colab.research.google.com/drive/1Zg9jR27Pr-ziVa0krjtdoy2dKv6whv7b

There is still a slight background noise but it is because of the TTS model. Somehow, GAN models are not as robust as WaveRNN to spectrogram representation. Maybe it is a good idea to induce some random noise in training.

@dathudeptrai
Copy link
Contributor

@erogol pls try ratios = [4,4,4,2,2] and increase n_réidual, maybe it can help u eliminate background nóie :D

@kan-bayashi
Copy link
Owner

kan-bayashi commented Feb 14, 2020

Currently, I am trying to train the MelGAN generator with

  • residual discriminator (v2-based)
  • MelGAN discriminator

The training curve is not stable but it sounds more natural than v1.
Once I finished the training, I will upload the config and results.

@kan-bayashi
Copy link
Owner

kan-bayashi commented Feb 14, 2020

pls try ratios = [4,4,4,2,2] and increase n_réidual, maybe it can help u eliminate background nóie :D

That's nice.
I will try the following config.

generator_type: "MelGANGenerator" # Generator type.
generator_params:
    in_channels: 80                  # Number of input channels.
    out_channels: 1                  # Number of output channels.
    kernel_size: 7                   # Kernel size of initial and final conv layers.
    channels: 1024                   # Initial number of channels for conv layers.
    upsample_scales: [4, 4, 4, 2, 2] # List of Upsampling scales.
    stack_kernel_size: 3             # Kernel size of dilated conv layers in residual stack.
    stacks: 4                        # Number of stacks in a single residual stack module.
    use_weight_norm: True            # Whether to use weight normalization.
    use_causal_conv: False           # Whether to use causal convolution.

@dathudeptrai
Copy link
Contributor

dathudeptrai commented Feb 14, 2020

I think melgan D stronger than v2 D :))) based on my knowledge :))) and seem its discriminator is very simmilar with gan TTS điscriminator (google iclr 2020 paper). https://arxiv.org/abs/1909.11646

@erogol
Copy link
Contributor Author

erogol commented Feb 14, 2020

@erogol pls try ratios = [4,4,4,2,2] and increase n_réidual, maybe it can help u eliminate background nóie :D

It'd not since my hop_length is 275. Now I am training a new couple of models with 256 hop_length. There I can try the upsampling params.

@kan-bayashi
Copy link
Owner

kan-bayashi commented Feb 19, 2020

I added several new results.

According to a few samples, it seems that melgan.v3 >> melgan_large.v1 >= melgan.v1 in terms of the naturalness.

But the training curve of v3 is not so stable, i.e., gradually decreasing both fake and real losses.
And feature matching loss kept increasing...
スクリーンショット 2020-02-19 午前10 29 15
And STFT-based losses are much higher than v1 but higher naturalness.
This is interesting but difficult to estimate the quality from the loss value :(

@dathudeptrai
Copy link
Contributor

dathudeptrai commented Feb 19, 2020

@kan-bayashi same observation as me :))). stft, spectral and feature matching loss keep increase over training progress :D and it need training more longer, i trained around 3M steps. I'm thinking about the discriminator overpower problem, maybe it's a problem with loss curve but interm of the naturalness, it doesn't :v .Seem we need the other metric to measure sound quality (MOS score). I'm reading the Frechet DeepSpeech Distance and Ker- nel DeepSpeech Distance which the author find to be well correlated with MOS score. (see GANTTS paper: https://arxiv.org/pdf/1909.11646.pdf) :D

@kan-bayashi
Copy link
Owner

kan-bayashi commented Feb 21, 2020

How about the noise?, in my experiment, melgan D has no such noise as PWG.

You mean noise observed in melgan.v1 model i.e., MelGAN G trained with PWG D?
If we train with MelGAN D (melgan.v3), such noise will disappear.

@dathudeptrai
Copy link
Contributor

@kan-bayashi okay :D. Just curious, what is ur next plan :)), what is the next version u will training, v2 ? :D

@kan-bayashi
Copy link
Owner

Currently, I'm training MelGAN G + MelGAN D based on your idea (#61 (comment)), and I'm curious about the results with PWG G + MelGAN D. I want to try it.
I've trained several models based on v2 i.e., ResiduralParallelWaveGANDiscriminator, but the results are not so good. So v2 has no high priority.

@kan-bayashi
Copy link
Owner

kan-bayashi commented Feb 28, 2020

I compared melgan.v3.long samples with the sample w/o STFT-loss after introducing the discriminator @ 4M iters. melgan.v3.long is clearly better while the training curve of the discriminator loss is almost the same.

Red: w/o STFT-loss after introducing the discriminator
Blue+Orange: melgan.v3.long
スクリーンショット 2020-02-28 午後5 21 14
スクリーンショット 2020-02-28 午後5 21 03

STFT-loss is not used for backpropagation, just monitoring.

So, I conclude MelGAN G + MelGAN D + STFT-loss can improve the quality, or at least the convergence speed.

@dathudeptrai
Copy link
Contributor

okay :D, how about PWG G + MelganD + STFT-loss ?

@kan-bayashi
Copy link
Owner

okay :D, how about PWG G + MelganD + STFT-loss ?

On-going. The quality @ 1.1M iters is so so.

@nartes
Copy link

nartes commented Feb 28, 2020

You can check the model with TTS here:
https://colab.research.google.com/drive/1Zg9jR27Pr-ziVa0krjtdoy2dKv6whv7b

waveform.flatten() is required, otherwise 3D tensor of shape 1x1xsample_length

image

@nartes
Copy link

nartes commented Feb 28, 2020

First verse from "Milky Chance -The Game", vocoder 400K EN, LJSpeech, ParallelWaveGan, +spleeter, +manual lyrics alignment in audacity

@kan-bayashi
Copy link
Owner

kan-bayashi commented Feb 28, 2020

@nartes Thank you for sharing interesting sample!
Did you remove vocal using spleeter and then mix generated voice with separated instrument music?

@nartes
Copy link

nartes commented Feb 28, 2020

yeah, also feeding lyrics line by line into Tacotron synthesizes distinctive sentences,
and later i needed to stretch an audio with some default algo in audacity by 20%
to match an original verse by Milky Chance.

@kan-bayashi
Copy link
Owner

yeah, also feeding lyrics line by line into Tacotron synthesizes distinctive sentences,
and later i needed to stretch an audio with some default algo in audacity by 20%
to match an original verse by Milky Chance.

Interesting! Great work:)

@nartes
Copy link

nartes commented Mar 8, 2020

  1. What happens if 100K restriction for discriminator is being replaced with say 0K. I mean not to pretrain model before applying an adversary?
  2. Another question what options are available to reduce model capacity so that support higher batch size, like 32 or 64 with 16K sampling rate at least.
  3. What are GPU memory requirements at the moment?
    I have about 12GiB of GPU RAM with vanilla WaveNet, 16K sampling rate, 6 batch size. It's about 4-6M of weight parameters.
  4. Is it correct that without discriminator applicaiton, pure STFT loss does produce artificats in the synthesis. It was my conclusion after listening to intermediate ParallelWaveGan examples before checkpoint with 130K.

@kan-bayashi
Copy link
Owner

What happens if 100K restriction for discriminator is being replaced with say 0K. I mean not to pretrain model before applying an adversary?

I tried. The quality is OK but it contains a strange noise.
#27 (comment)

Another question what options are available to reduce model capacity so that support higher batch size, like 32 or 64 with 16K sampling rate at least.

You can try MelGAN or reduce batch_max_steps.

What are GPU memory requirements at the moment?
I have about 12GiB of GPU RAM with vanilla WaveNet, 16K sampling rate, 6 batch size. It's about 4-6M of weight parameters.

You mean training? Please check the header comments in config.
https://github.com/kan-bayashi/ParallelWaveGAN#results
There is information about the required gpu.

Is it correct that without discriminator applicaiton, pure STFT loss does produce artificats in the synthesis. It was my conclusion after listening to intermediate ParallelWaveGan examples before checkpoint with 130K.

If we use only STFT-loss, the sound will be like a robot.
The adversarial training is needed to improve the perceptual quality.

@nartes
Copy link

nartes commented Mar 17, 2020

Q: Is there some place where melspectrogram is being transformed. I didn't
find it in the repo source code.

  1. MelGAN vocoder

  2. It is being trained to decode melspectrogram.

  3. Input features are generated with

    def logmelfilterbank(audio,

  4. Yet comparing FastSpeech melspectrogram synthesizer with output
    from logmelfilterbank. I get different boundaries for magnitude:

  • [-4.9; 0.34] (logmelfilterbank)
  • [-1.6; 2.35] (FastSpeech)
  1. Different synthesized audio:
  1. Below are examples of magnitude distribution using a public Google Colab notebook
    linked in the repo.
    I did replace synthesized melspectrogram c with mel_spec2.
    Output is audible but noisy, and it gets better if spectrogram is being shifted
    like mel_spec2 + 2.0. I did use 1000K MelGAN in the notebook.
    With 400K MelGAN, without shifting the melspectrogram, an output is too quiet
    and has a constant artificial frequency noise along.

image
image

@kan-bayashi
Copy link
Owner

The melspectrogram is normalized to be mean=0, variance=1 using the statistics of training data.

parallel-wavegan-normalize \
--config "${conf}" \
--stats "${dumpdir}/${train_set}/stats.${stats_ext}" \
--rootdir "${dumpdir}/${name}/raw/dump.JOB" \
--dumpdir "${dumpdir}/${name}/norm/dump.JOB" \
--verbose "${verbose}"

Yet comparing FastSpeech melspectrogram synthesizer with output
from logmelfilterbank. I get different boundaries for magnitude:

This is because I use the normalization to be mean=0, variacne=1.
The explicit maximum and minimum values are not defined.

With 400K MelGAN, without shifting the melspectrogram, an output is too quiet
and has a constant artificial frequency noise along.

I'm not sure what you are doing, but my pretrained models assume that the inputs are normalized using the statistics of training data.

@nartes
Copy link

nartes commented Mar 17, 2020

Indeed, forgot about it, thanks!

@John-K92
Copy link

@kan-bayashi No, i use this https://github.com/Xilinx/brevitas (need to expands input to 4D to training), this frame work just support conv2D :D. but it's enough :D.I also have a larger version of melgan generator then i convert it to tensorflow and use tensorrt to optimize it on server side :D. Seem all we need to do now is enhance the quality :)), the speed isn't a problem anymore :D

It's been a great honor to follow you and other researchers' work on this repository. :)
I am interested in the light version of the voice synthesis model, ideally on a mobile platform.
It seems that you have successfully converted the model into the mobile version.
Could you kindly share the model or at least tell us about the performance of it?
Thanks in advance.

@kan-bayashi
Copy link
Owner

@John-K92 Please check demo HP and results in README.
https://kan-bayashi.github.io/ParallelWaveGAN
https://github.com/kan-bayashi/ParallelWaveGAN#results
You can access the sample and model file.

MelGAN is very light while keeping comparable performance.
If you want to use in the mobile, the conversion notebook from pytorch to tensorflow will help you.

@dathudeptrai
Copy link
Contributor

@John-K92 the performance i'm using for mobile device is on par with original melgan. (still use float model so the performance is same).

@John-K92
Copy link

@John-K92 the performance i'm using for mobile device is on par with original melgan. (still use float model so the performance is same).

Thank you for sharing your experience. Then, may I ask your opinion on TTS(text-to-mel) model, such as Fast speech, Tacotron, etc., that how it could be applicable in mobile platforms along with the MelGAN?
As mentioned by @kan-bayashi and in your research, Mel-2-voice model seems to be light enough to be applied for the mobile device. However, would both models be light enough on mobile in an end-to-end TTS manner? Or would you reckon to put it in a different structure or process?

@dathudeptrai
Copy link
Contributor

dathudeptrai commented Mar 25, 2020

@John-K92 i just want to say that mel-2-voice is light enough but it is still slower than my text2mel models :D. My text2mel model is fully convolution, deploy rnn is very hard on mobile device so tacotron should be ignore :D

@John-K92
Copy link

@John-K92 Please check demo HP and results in README.
https://kan-bayashi.github.io/ParallelWaveGAN
https://github.com/kan-bayashi/ParallelWaveGAN#results
You can access the sample and model file.

MelGAN is very light while keeping comparable performance.
If you want to use in the mobile, the conversion notebook from pytorch to tensorflow will help you.

I am currently trying to convert the model to tensorflow and to mobile-wise version(tensorrt or tensorflow lite)
https://colab.research.google.com/github/kan-bayashi/ParallelWaveGAN/blob/master/notebooks/convert_melgan_from_pytorch_to_tensorflow.ipynb#scrollTo=XOK6AuWW9R8N&line=3&uniqifier=1
But there seems to be an error in the code you've provided. Have you set a specific tensorflow version??

@dathudeptrai
Copy link
Contributor

what error ?

@John-K92
Copy link

what error ?

When I run the code, "audio = TFMelGANGenerator(**config["generator_params"])(inputs)" line raises an InaccessibleTensorError.

InaccessibleTensorError: The tensor 'Tensor("conv2d_346/dilation_rate:0", shape=(2,), dtype=int32)' cannot be accessed here: it is defined in another function or code block. Use return values, explicit Python locals or TensorFlow collections to access it. Defined in: FuncGraph(name=call, id=140430667155440); accessed from: FuncGraph(name=call, id=140430364789896).

Do you not meet any error on your side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants