New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generator exploded after ~138K iters. #61
Comments
Hi @erogol.
|
A short literature review gave me these options if they make any sense;
Have you tried anything similar to these, so I can try the rest maybe? |
@erogol @kan-bayashi hi, use transpose convolution rather than Upsampling on generator will help (use kernel_size and stride large enough). I'm combine melgan and parallel wavegan as this discussion #46 , the result seem very good for me. |
@dathudeptrai merging the models only architecturally or training methods as well? |
@erogol both, i use training method on this repo (V2 as u are using). |
That is interesting. |
keep in mind that all GAN arch, the author has tuned the parameters and arch carefully (generator won't stronger than discriminator and vice versar). In ur config, u don't use the original discriminator (u use residualdiscriminator instead), and that make the discriminator stronger, u need to modify generator too (increase number of layers, kernel size, ...) :D. |
I added MelGAN generator in #62. |
@kan-bayashi wow, very fast :)), the training progress will 5x faster than Parallelwavegan :D, let see |
I added initial MelGAN results #65 (v1 config based). |
@kan-bayashi let see, i’m training melgan (almost same as u are doing, i use multi-scale discriminator loss as melgan introduce and some modification) with fake quantization training 8bit. Also i convert it successfully to tflite, the inference time on cpu mobile 1 thread is around 2x faster than real time. |
@dathudeptrai |
@kan-bayashi No, i use this https://github.com/Xilinx/brevitas (need to expands input to 4D to training), this frame work just support conv2D :D. but it's enough :D.I also have a larger version of melgan generator then i convert it to tensorflow and use tensorrt to optimize it on server side :D. Seem all we need to do now is enhance the quality :)), the speed isn't a problem anymore :D |
I can tell, even though there is a jump after 100K, still voice fidelity improves as the training goes. It is interesting to see. @dathudeptrai is there an official way to convert models to tf, or is it just loading weights with the same architecture? |
@erogol, ONNX is the way u can try, but according to my experience, loading weights with the same architecture is the best :)), i can implement both torch and tf very fast :D and the converting progress is just 1 for loop :D. in addition, when u use TF, u can ez to convert it into tflite and tensorrt to optimize for inference. I don't really like use intermediate representation like ONNX because the limitted support operator :D |
I successfully trained MelGAN (Generator) in my fork. MelGAN requires the v1 (smaller), discriminator. I guess larger D is too strong for MelGAN to learn. You can check the model with TTS here: There is still a slight background noise but it is because of the TTS model. Somehow, GAN models are not as robust as WaveRNN to spectrogram representation. Maybe it is a good idea to induce some random noise in training. |
@erogol pls try ratios = [4,4,4,2,2] and increase n_réidual, maybe it can help u eliminate background nóie :D |
Currently, I am trying to train the MelGAN generator with
The training curve is not stable but it sounds more natural than v1. |
That's nice. generator_type: "MelGANGenerator" # Generator type.
generator_params:
in_channels: 80 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_size: 7 # Kernel size of initial and final conv layers.
channels: 1024 # Initial number of channels for conv layers.
upsample_scales: [4, 4, 4, 2, 2] # List of Upsampling scales.
stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack.
stacks: 4 # Number of stacks in a single residual stack module.
use_weight_norm: True # Whether to use weight normalization.
use_causal_conv: False # Whether to use causal convolution. |
I think melgan D stronger than v2 D :))) based on my knowledge :))) and seem its discriminator is very simmilar with gan TTS điscriminator (google iclr 2020 paper). https://arxiv.org/abs/1909.11646 |
It'd not since my hop_length is 275. Now I am training a new couple of models with 256 hop_length. There I can try the upsampling params. |
I added several new results.
According to a few samples, it seems that But the training curve of |
@kan-bayashi same observation as me :))). stft, spectral and feature matching loss keep increase over training progress :D and it need training more longer, i trained around 3M steps. I'm thinking about the discriminator overpower problem, maybe it's a problem with loss curve but interm of the naturalness, it doesn't :v .Seem we need the other metric to measure sound quality (MOS score). I'm reading the Frechet DeepSpeech Distance and Ker- nel DeepSpeech Distance which the author find to be well correlated with MOS score. (see GANTTS paper: https://arxiv.org/pdf/1909.11646.pdf) :D |
You mean noise observed in |
@kan-bayashi okay :D. Just curious, what is ur next plan :)), what is the next version u will training, v2 ? :D |
Currently, I'm training MelGAN G + MelGAN D based on your idea (#61 (comment)), and I'm curious about the results with PWG G + MelGAN D. I want to try it. |
okay :D, how about PWG G + MelganD + STFT-loss ? |
On-going. The quality @ 1.1M iters is so so. |
|
First verse from "Milky Chance -The Game", vocoder 400K EN, LJSpeech, ParallelWaveGan, +spleeter, +manual lyrics alignment in audacity |
@nartes Thank you for sharing interesting sample! |
yeah, also feeding lyrics line by line into Tacotron synthesizes distinctive sentences, |
Interesting! Great work:) |
|
I tried. The quality is OK but it contains a strange noise.
You can try MelGAN or reduce batch_max_steps.
You mean training? Please check the header comments in config.
If we use only STFT-loss, the sound will be like a robot. |
Q: Is there some place where melspectrogram is being transformed. I didn't
|
The melspectrogram is normalized to be mean=0, variance=1 using the statistics of training data. ParallelWaveGAN/egs/ljspeech/voc1/run.sh Lines 97 to 102 in 53ae639
This is because I use the normalization to be mean=0, variacne=1.
I'm not sure what you are doing, but my pretrained models assume that the inputs are normalized using the statistics of training data. |
Indeed, forgot about it, thanks! |
It's been a great honor to follow you and other researchers' work on this repository. :) |
@John-K92 Please check demo HP and results in README. MelGAN is very light while keeping comparable performance. |
@John-K92 the performance i'm using for mobile device is on par with original melgan. (still use float model so the performance is same). |
Thank you for sharing your experience. Then, may I ask your opinion on TTS(text-to-mel) model, such as Fast speech, Tacotron, etc., that how it could be applicable in mobile platforms along with the MelGAN? |
@John-K92 i just want to say that mel-2-voice is light enough but it is still slower than my text2mel models :D. My text2mel model is fully convolution, deploy rnn is very hard on mobile device so tacotron should be ignore :D |
I am currently trying to convert the model to tensorflow and to mobile-wise version(tensorrt or tensorflow lite) |
what error ? |
When I run the code, "audio = TFMelGANGenerator(**config["generator_params"])(inputs)" line raises an InaccessibleTensorError. InaccessibleTensorError: The tensor 'Tensor("conv2d_346/dilation_rate:0", shape=(2,), dtype=int32)' cannot be accessed here: it is defined in another function or code block. Use return values, explicit Python locals or TensorFlow collections to access it. Defined in: FuncGraph(name=call, id=140430667155440); accessed from: FuncGraph(name=call, id=140430364789896). Do you not meet any error on your side? |
I observed a interesting behaviour after 138K iters where discriminator dominated the training and generator exploded in both train and validation losses. Do you have any idea why and how to prevent it?
I am training on LJSpeech and I basically use the same learning schedule you released with the v2 config for LJSpeech. (Train generator until 100K and enable the discriminator)
Here is the tensorboard screenshot.
The text was updated successfully, but these errors were encountered: