Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prenet dropout #50

Closed
xinqipony opened this issue Sep 29, 2018 · 32 comments
Closed

prenet dropout #50

xinqipony opened this issue Sep 29, 2018 · 32 comments
Labels
experiment experimental things
Projects

Comments

@xinqipony
Copy link

I was using another repo previously, and now I am switching to mozilla TTS;

according to my experience, the dropout in decoder prenet also used in inference, without dropout in inference, the quality is bad(tacotron 2), which is hard to understand,

do you get similar experience and why?

@erogol
Copy link
Contributor

erogol commented Oct 1, 2018

@xinqipony I never tried to dropout at inference time. However, I think, it shouldn't make a difference. Maybe the way to handle dropout at inference time is somehow wrong at the repo.

@erogol erogol closed this as completed Oct 1, 2018
@erogol erogol reopened this Nov 29, 2018
@erogol
Copy link
Contributor

erogol commented Nov 29, 2018

It happened as I train TTS with r=1. Attention does not align if I don't run prenet in train mode. However, I realized that if r=5 number of zeros of output after prenet is closer between train and eval outputs. If r=1 this gap is larger and probably it causes this problem. Now, I try to find a solution to this. Maybe, one solution is to apply batch normalization to keep the scale similar in eval and train cases that is likely to reduce the gap.

@xinqipony
Copy link
Author

hi erogol, thanks for your updates, I also tried some solutions to solve this,
I tried replace relu in prenet to tanh, and the gap is reduced, but never reach the quality of relu dropout in inreference time;

@erogol
Copy link
Contributor

erogol commented Nov 30, 2018

Another option would be using last 5 frames as input to prenet, so we know r=5 works just fine,

@xinqipony
Copy link
Author

but in my CMOS test voice quality of r=1 is much better than r > 1, so quality may be impact for r=5

@erogol
Copy link
Contributor

erogol commented Dec 2, 2018

I have different results currently, mostly alignment does not work well for r=1 due to the above problem. And if I use in train mode in inference, its results are not very reliable since for each run it gives different output due to dropout.

@erogol erogol added this to In Progress in v0.0.1 Dec 14, 2018
@erogol erogol added the experiment experimental things label Dec 14, 2018
@erogol
Copy link
Contributor

erogol commented Dec 17, 2018

I think the problem here is "dying ReLU".

So far, I replaced autoregressive connection with a queue of the last 5 frames instead of feeding only the last frame. It did not change anything but after I also changed PreNet activation function to RReLU and remove dropout things got better. However, I don't know yet if the boost is the combination of the queue and RReLU or just the RReLU. I also higher the learning rate.

@erogol erogol closed this as completed Dec 17, 2018
v0.0.1 automation moved this from In Progress to Done Dec 17, 2018
@erogol erogol reopened this Dec 17, 2018
v0.0.1 automation moved this from Done to In Progress Dec 17, 2018
@xinqipony
Copy link
Author

@erogol thanks for updates, do you feel voice quality degrade using rrelu and removing dropout thing in inference?

cause I tried lrelu, the things are better, but the voice quality is still not good enough as keeping dropout;

@erogol
Copy link
Contributor

erogol commented Dec 18, 2018

so far it is comparable with the original model but it needs more training. How many steps does it take to give good results by your model?

@xinqipony
Copy link
Author

about 120k, I use r=1, activation of prenet is tanh or lrelu, both shows improvement, but not comparable to relu with dropout version, I used wavenet for sample generation, so it sounds more sensitive to me than griffin lim;

@erogol
Copy link
Contributor

erogol commented Dec 18, 2018

@xinqipony which repo do you use for this model ? And do you have loss values to compare?

@xinqipony
Copy link
Author

xinqipony commented Dec 18, 2018

we write it ourselves original from the tacotron 1 paper and then change to tacotron 2 with tensorflow, input mel is normalized to (-4 , +4) range and the final mse loss would be around 0.13(decoder_loss + mel_loss)

@erogol
Copy link
Contributor

erogol commented Dec 18, 2018

Thx for the details. I use 0, 1 normalization and L1 loss but the architecture is much similar to Tacotron. I found Tacotron2 harder and slower to train., due to the model size. Would you agree?

@yweweler
Copy link
Contributor

Sorry to bust into this that late but I had to read the model implementation first.
I looked at the current model implementation and in my opinion it is nearly congruent with the Tacotron 1 architecture. The primary different being the attention mechanism, the StopNet and the auto-regressive feeding of the seq2seq target.

I created a quick schematic for the current decoder to make sure we are all on the same page:

decoder-schematic

As this discussion seems a bit scattered I would like to gather the information.
To my understanding the following aspects have been discussed so far for the decoder:

  • Replacing ReLU with tanh in the PreNet:
    • Effect: ?
    • Question: @xinqipony, What is the effect? (improved loss or perceived audio quality)
    • Question: @xinqipony, With or without dropout in the PreNet?
  • Replacing ReLU with LeakyReLU in the PreNet:
    • Effect: ?
    • Question: @xinqipony, What is the effect? (improved loss or perceived audio quality)
    • Question: @xinqipony, With or without dropout in the PreNet?
  • Replacing ReLU with RandomizedLeakyReLU, feeding the last 5 frames from a queue (instead of the last frame) and training without dropout.
    • Effect: Quality improved.
    • Question: @ergol, What is the approximate improvement you have seen on the loss?
    • Question: @ergol, Does this mean producing a seq2seq target with r=1 but remembering and concatenating the last 5 seq2seq targets (hence r=5) to produce inputs for the PreNet?
    • Status: It is not clear if the improvement is caused by using RLReLU alone or if it is due to the compination of RLReLU and the queue.

Another aspect of the discussion that is not clear to be is:

However, I realized that if r=5 number of zeros of output after prenet is closer between train and eval outputs. If r=1 this gap is larger and probably it causes this problem.

@ergol what do you refer to when talking about the number of zeros?

@erogol
Copy link
Contributor

erogol commented Dec 18, 2018

@yweweler Your figure depicts TTS decoder perfectly. However, Tacotron2 has some differences. @xinqipony states his findings on Tacotron2.

Answer: The best validation mel-spec loss 0.04 -> 0.028 with all these changes. (Note that, I don't know individual effects of each of these updates.)

  • RReLU on Prenet
  • Remove Prenet Dropout
  • Increase LR 1e-4 -> 1e-3
  • Queue of last 5 frames of prediction for autoregression. And yes in training, it means to produce PreNet input with the last 5 mel-spec target frames.

Answer: The meaning is that, somehow, there is a big difference between number of zero PreNEt outputs between model.train() and model.eval() modes. Also maximum PreNet output value is much larger in model.train() then model.eval(). (~4 and 0.9). So there is a scale difference in train and inference mode in PreNet. Therefore model does not align in model.eval(). These differences are negligible if r=5.

@yweweler
Copy link
Contributor

@xinqipony states his findings on Tacotron2.

Sorry I somehow completely overlooked this.

The best validation mel-spec loss 0.04 -> 0.028 with all these changes. (Note that, I don't know individual effects of each of these updates.)

Could you provide information on the number of training steps and what batch size you used?
So I have an indication for own measurements.

The meaning is that, somehow, there is a big difference between number of zero PreNEt outputs between model.train() and model.eval() modes.

That is fairly interesting. I implemented Tacotron 1 myself and experimented with different alignment mechanisms but never hat a similar behavior.
I will take a closer look at the distributions of the activation in the PreNet. later on

@erogol
Copy link
Contributor

erogol commented Dec 18, 2018

@yweweler here is the config file. This weirdness only happens when r=1.

{
    "model_name": "TTS-dist",
    "model_description": "Distributed training.",

    "audio":{
        "audio_processor": "audio",     // to use dictate different audio processors, if available.
        // Audio processing parameters
        "num_mels": 80,         // size of the mel spec frame. 
        "num_freq": 1025,       // number of stft frequency levels. Size of the linear spectogram frame.
        "sample_rate": 22050,   // wav sample-rate. If different than the original data, it is resampled.
        "frame_length_ms": 50,  // stft window length in ms.
        "frame_shift_ms": 12.5, // stft window hop-lengh in ms.
        "preemphasis": 0.97,    // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
        "min_level_db": -100,   // normalization range
        "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.
        "power": 1.5,           // value to sharpen wav signals after GL algorithm.
        "griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
        // Normalization parameters
        "signal_norm": true,    // normalize the spec values in range [0, 1]
        "symmetric_norm": false, // move normalization to range [-1, 1]
        "max_norm": 1,          // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
        "clip_norm": true,      // clip normalized values into the range.
        "mel_fmin": null,         // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
        "mel_fmax": null        // maximum freq level for mel-spec. Tune for dataset!!
    },

    "distributed":{
        "backend": "nccl",
        "url": "file:\/\/\/home/erogol/tmp.txt"
    },

    "embedding_size": 256,    
    "text_cleaner": "english_cleaners",
    "epochs": 1000,
    
    "lr": 0.001,
    "lr_decay": false,
    "warmup_steps": 4000,

    "batch_size": 32,
    "eval_batch_size":32,
    "r": 1,
    "wd": 0.000001,
    "checkpoint": true,
    "save_step": 5000,
    "print_step": 10,

    "run_eval": true,
    "run_test_synthesis": true,
    "data_path": "../../Data/LJSpeech-1.1/",  // can overwritten from command argument
    "meta_file_train": "metadata_train.csv",      // metafile for training dataloader
    "meta_file_val": "metadata_val.csv",    // metafile for validation dataloader
    "data_loader": "TTSDataset",      // dataloader, ["TTSDataset", "TTSDatasetCached", "TTSDatasetMemory"]
    "dataset": "ljspeech",     // one of TTS.dataset.preprocessors, only valid id dataloader == "TTSDataset", rest uses "tts_cache" by default.
    "min_seq_len": 0,
    "output_path": "../keep/",
    "num_loader_workers": 6,
    "num_val_loader_workers": 2
}```

@xinqipony
Copy link
Author

Thx for the details. I use 0, 1 normalization and L1 loss but the architecture is much similar to Tacotron. I found Tacotron2 harder and slower to train., due to the model size. Would you agree?

yes.

@yweweler
Copy link
Contributor

Update: not forgotten yet.
Due to the holidays I haven't found the time to work on this.
I will start looking on this in the next few days.

@erogol
Copy link
Contributor

erogol commented Dec 28, 2018

@yweweler the same here, no worries

@erogol erogol mentioned this issue Jan 17, 2019
@xinqipony
Copy link
Author

anyone got better updates on this topic?

@erogol
Copy link
Contributor

erogol commented Feb 12, 2019

@xinqipony I couldn't find anything works better here for all datasets. I like to replace dropout with batchnorm at the next step. I'll share here if it works.

@phypan11
Copy link

phypan11 commented Feb 14, 2019

@erogol did you apply the batch norm? I couldn't apply the batch norm for technical reason, so I applied the instance norm, but the resulting audio was horrible, even though the validation loss went significantly lower.

@erogol
Copy link
Contributor

erogol commented Feb 14, 2019

@phypan11 not yet sorry.

Probably, attention did not align. Did you compare it?

@erogol
Copy link
Contributor

erogol commented Feb 18, 2019

One solution I found is to train the network with r=5 and then finetune with r=1 with the updates explained here #108

@candlewill
Copy link

Here is my solution, very tricky:

drop_rate = tf.random_uniform(shape=[1], minval=0, maxval=min(self.drop_rate * 1.5, 0.5), dtype=tf.float32)[0]
x = tf.layers.dropout(dense, rate=drop_rate, training=self.is_training,
				                name='dropout_{}'.format(i + 1) + self.scope)

@xinqipony
Copy link
Author

@candlewill do you observe voice quality loss? I am trying rrelu, it seems to me the quality has gap;

@erogol
Copy link
Contributor

erogol commented Mar 4, 2019

@candlewill cool you are dropping our dropout :). What was the intuition to come up with that ?

@erogol
Copy link
Contributor

erogol commented Mar 11, 2019

My final solution is to use Batch Normalization in Prenet layer and removing Dropout completely. It works like magic and it also increases the model performance significantly in my case.
I am happy to conclude this long lasting issue 👍

@erogol erogol closed this as completed Mar 11, 2019
v0.0.1 automation moved this from In Progress to Done Mar 11, 2019
@chapter544
Copy link

chapter544 commented Mar 13, 2019

Hi @erogol,
Removing dropout completely means that you also remove dropouts in decoder rnn or just the dropout in prenet? Also, did you experiment this BN-replace-dropout approach with other datasets as well? In my case, the prenet dropout works for few experiments and fails (does not consistently provide good alignments) for some others, I also see this with different datasets. In my last BN-replace-dropout experiment, it does not provide stable alignments for our private 6hr dataset. Do you have any insights? Thanks.

@WhiteFu
Copy link

WhiteFu commented May 24, 2019

Hi @erogol,
What's meaning about removing dropout completely?

@erogol
Copy link
Contributor

erogol commented May 24, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experiment experimental things
Projects
v0.0.1
  
Done
Development

No branches or pull requests

7 participants