prenet dropout #50

xinqipony · 2018-09-29T17:35:31Z

I was using another repo previously, and now I am switching to mozilla TTS;

according to my experience, the dropout in decoder prenet also used in inference, without dropout in inference, the quality is bad(tacotron 2), which is hard to understand,

do you get similar experience and why?

erogol · 2018-10-01T12:25:21Z

@xinqipony I never tried to dropout at inference time. However, I think, it shouldn't make a difference. Maybe the way to handle dropout at inference time is somehow wrong at the repo.

erogol · 2018-11-29T13:20:14Z

It happened as I train TTS with r=1. Attention does not align if I don't run prenet in train mode. However, I realized that if r=5 number of zeros of output after prenet is closer between train and eval outputs. If r=1 this gap is larger and probably it causes this problem. Now, I try to find a solution to this. Maybe, one solution is to apply batch normalization to keep the scale similar in eval and train cases that is likely to reduce the gap.

xinqipony · 2018-11-30T01:49:47Z

hi erogol, thanks for your updates, I also tried some solutions to solve this,
I tried replace relu in prenet to tanh, and the gap is reduced, but never reach the quality of relu dropout in inreference time;

erogol · 2018-11-30T12:05:15Z

Another option would be using last 5 frames as input to prenet, so we know r=5 works just fine,

xinqipony · 2018-12-01T02:20:39Z

but in my CMOS test voice quality of r=1 is much better than r > 1, so quality may be impact for r=5

erogol · 2018-12-02T23:11:45Z

I have different results currently, mostly alignment does not work well for r=1 due to the above problem. And if I use in train mode in inference, its results are not very reliable since for each run it gives different output due to dropout.

erogol · 2018-12-17T12:45:15Z

I think the problem here is "dying ReLU".

So far, I replaced autoregressive connection with a queue of the last 5 frames instead of feeding only the last frame. It did not change anything but after I also changed PreNet activation function to RReLU and remove dropout things got better. However, I don't know yet if the boost is the combination of the queue and RReLU or just the RReLU. I also higher the learning rate.

xinqipony · 2018-12-18T02:34:00Z

@erogol thanks for updates, do you feel voice quality degrade using rrelu and removing dropout thing in inference?

cause I tried lrelu, the things are better, but the voice quality is still not good enough as keeping dropout;

erogol · 2018-12-18T08:38:40Z

so far it is comparable with the original model but it needs more training. How many steps does it take to give good results by your model?

xinqipony · 2018-12-18T09:11:27Z

about 120k, I use r=1, activation of prenet is tanh or lrelu, both shows improvement, but not comparable to relu with dropout version, I used wavenet for sample generation, so it sounds more sensitive to me than griffin lim;

erogol · 2018-12-18T09:22:14Z

@xinqipony which repo do you use for this model ? And do you have loss values to compare?

xinqipony · 2018-12-18T09:25:00Z

we write it ourselves original from the tacotron 1 paper and then change to tacotron 2 with tensorflow, input mel is normalized to (-4 , +4) range and the final mse loss would be around 0.13(decoder_loss + mel_loss)

erogol · 2018-12-18T11:22:33Z

Thx for the details. I use 0, 1 normalization and L1 loss but the architecture is much similar to Tacotron. I found Tacotron2 harder and slower to train., due to the model size. Would you agree?

yweweler · 2018-12-18T11:32:02Z

Sorry to bust into this that late but I had to read the model implementation first.
I looked at the current model implementation and in my opinion it is nearly congruent with the Tacotron 1 architecture. The primary different being the attention mechanism, the StopNet and the auto-regressive feeding of the seq2seq target.

I created a quick schematic for the current decoder to make sure we are all on the same page:

As this discussion seems a bit scattered I would like to gather the information.
To my understanding the following aspects have been discussed so far for the decoder:

Replacing ReLU with tanh in the PreNet:
- Effect: ?
- Question: @xinqipony, What is the effect? (improved loss or perceived audio quality)
- Question: @xinqipony, With or without dropout in the PreNet?
Replacing ReLU with LeakyReLU in the PreNet:
- Effect: ?
- Question: @xinqipony, What is the effect? (improved loss or perceived audio quality)
- Question: @xinqipony, With or without dropout in the PreNet?
Replacing ReLU with RandomizedLeakyReLU, feeding the last 5 frames from a queue (instead of the last frame) and training without dropout.
- Effect: Quality improved.
- Question: @ergol, What is the approximate improvement you have seen on the loss?
- Question: @ergol, Does this mean producing a seq2seq target with r=1 but remembering and concatenating the last 5 seq2seq targets (hence r=5) to produce inputs for the PreNet?
- Status: It is not clear if the improvement is caused by using RLReLU alone or if it is due to the compination of RLReLU and the queue.

Another aspect of the discussion that is not clear to be is:

However, I realized that if r=5 number of zeros of output after prenet is closer between train and eval outputs. If r=1 this gap is larger and probably it causes this problem.

@ergol what do you refer to when talking about the number of zeros?

erogol · 2018-12-18T12:07:09Z

@yweweler Your figure depicts TTS decoder perfectly. However, Tacotron2 has some differences. @xinqipony states his findings on Tacotron2.

Answer: The best validation mel-spec loss 0.04 -> 0.028 with all these changes. (Note that, I don't know individual effects of each of these updates.)

RReLU on Prenet
Remove Prenet Dropout
Increase LR 1e-4 -> 1e-3
Queue of last 5 frames of prediction for autoregression. And yes in training, it means to produce PreNet input with the last 5 mel-spec target frames.

Answer: The meaning is that, somehow, there is a big difference between number of zero PreNEt outputs between model.train() and model.eval() modes. Also maximum PreNet output value is much larger in model.train() then model.eval(). (~4 and 0.9). So there is a scale difference in train and inference mode in PreNet. Therefore model does not align in model.eval(). These differences are negligible if r=5.

yweweler · 2018-12-18T13:51:39Z

@xinqipony states his findings on Tacotron2.

Sorry I somehow completely overlooked this.

The best validation mel-spec loss 0.04 -> 0.028 with all these changes. (Note that, I don't know individual effects of each of these updates.)

Could you provide information on the number of training steps and what batch size you used?
So I have an indication for own measurements.

The meaning is that, somehow, there is a big difference between number of zero PreNEt outputs between model.train() and model.eval() modes.

That is fairly interesting. I implemented Tacotron 1 myself and experimented with different alignment mechanisms but never hat a similar behavior.
I will take a closer look at the distributions of the activation in the PreNet. later on

erogol · 2018-12-18T14:17:54Z

@yweweler here is the config file. This weirdness only happens when r=1.

{
    "model_name": "TTS-dist",
    "model_description": "Distributed training.",

    "audio":{
        "audio_processor": "audio",     // to use dictate different audio processors, if available.
        // Audio processing parameters
        "num_mels": 80,         // size of the mel spec frame. 
        "num_freq": 1025,       // number of stft frequency levels. Size of the linear spectogram frame.
        "sample_rate": 22050,   // wav sample-rate. If different than the original data, it is resampled.
        "frame_length_ms": 50,  // stft window length in ms.
        "frame_shift_ms": 12.5, // stft window hop-lengh in ms.
        "preemphasis": 0.97,    // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
        "min_level_db": -100,   // normalization range
        "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.
        "power": 1.5,           // value to sharpen wav signals after GL algorithm.
        "griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
        // Normalization parameters
        "signal_norm": true,    // normalize the spec values in range [0, 1]
        "symmetric_norm": false, // move normalization to range [-1, 1]
        "max_norm": 1,          // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
        "clip_norm": true,      // clip normalized values into the range.
        "mel_fmin": null,         // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
        "mel_fmax": null        // maximum freq level for mel-spec. Tune for dataset!!
    },

    "distributed":{
        "backend": "nccl",
        "url": "file:\/\/\/home/erogol/tmp.txt"
    },

    "embedding_size": 256,    
    "text_cleaner": "english_cleaners",
    "epochs": 1000,
    
    "lr": 0.001,
    "lr_decay": false,
    "warmup_steps": 4000,

    "batch_size": 32,
    "eval_batch_size":32,
    "r": 1,
    "wd": 0.000001,
    "checkpoint": true,
    "save_step": 5000,
    "print_step": 10,

    "run_eval": true,
    "run_test_synthesis": true,
    "data_path": "../../Data/LJSpeech-1.1/",  // can overwritten from command argument
    "meta_file_train": "metadata_train.csv",      // metafile for training dataloader
    "meta_file_val": "metadata_val.csv",    // metafile for validation dataloader
    "data_loader": "TTSDataset",      // dataloader, ["TTSDataset", "TTSDatasetCached", "TTSDatasetMemory"]
    "dataset": "ljspeech",     // one of TTS.dataset.preprocessors, only valid id dataloader == "TTSDataset", rest uses "tts_cache" by default.
    "min_seq_len": 0,
    "output_path": "../keep/",
    "num_loader_workers": 6,
    "num_val_loader_workers": 2
}```

xinqipony · 2018-12-19T02:29:46Z

Thx for the details. I use 0, 1 normalization and L1 loss but the architecture is much similar to Tacotron. I found Tacotron2 harder and slower to train., due to the model size. Would you agree?

yes.

yweweler · 2018-12-27T13:00:41Z

Update: not forgotten yet.
Due to the holidays I haven't found the time to work on this.
I will start looking on this in the next few days.

erogol · 2018-12-28T10:40:45Z

@yweweler the same here, no worries

xinqipony · 2019-02-12T02:49:42Z

anyone got better updates on this topic?

erogol · 2019-02-12T09:11:47Z

@xinqipony I couldn't find anything works better here for all datasets. I like to replace dropout with batchnorm at the next step. I'll share here if it works.

phypan11 · 2019-02-14T01:14:12Z

@erogol did you apply the batch norm? I couldn't apply the batch norm for technical reason, so I applied the instance norm, but the resulting audio was horrible, even though the validation loss went significantly lower.

erogol · 2019-02-14T10:16:23Z

@phypan11 not yet sorry.

Probably, attention did not align. Did you compare it?

erogol · 2019-02-18T13:05:11Z

One solution I found is to train the network with r=5 and then finetune with r=1 with the updates explained here #108

candlewill · 2019-03-01T14:05:34Z

Here is my solution, very tricky:

drop_rate = tf.random_uniform(shape=[1], minval=0, maxval=min(self.drop_rate * 1.5, 0.5), dtype=tf.float32)[0]
x = tf.layers.dropout(dense, rate=drop_rate, training=self.is_training,
				                name='dropout_{}'.format(i + 1) + self.scope)

xinqipony · 2019-03-04T01:57:10Z

@candlewill do you observe voice quality loss? I am trying rrelu, it seems to me the quality has gap;

erogol · 2019-03-04T11:30:20Z

@candlewill cool you are dropping our dropout :). What was the intuition to come up with that ?

erogol · 2019-03-11T14:24:25Z

My final solution is to use Batch Normalization in Prenet layer and removing Dropout completely. It works like magic and it also increases the model performance significantly in my case.
I am happy to conclude this long lasting issue 👍

chapter544 · 2019-03-13T14:45:05Z

Hi @erogol,
Removing dropout completely means that you also remove dropouts in decoder rnn or just the dropout in prenet? Also, did you experiment this BN-replace-dropout approach with other datasets as well? In my case, the prenet dropout works for few experiments and fails (does not consistently provide good alignments) for some others, I also see this with different datasets. In my last BN-replace-dropout experiment, it does not provide stable alignments for our private 6hr dataset. Do you have any insights? Thanks.

WhiteFu · 2019-05-24T08:07:57Z

Hi @erogol,
What's meaning about removing dropout completely?

erogol · 2019-05-24T09:14:58Z

@WhiteFu @chapter544 https://github.com/mozilla/TTS/blob/dev-tacotron2/layers/tacotron2.py#L76
should be clear after you check the code.

erogol closed this as completed Oct 1, 2018

erogol reopened this Nov 29, 2018

erogol added this to In Progress in v0.0.1 Dec 14, 2018

erogol added the experiment experimental things label Dec 14, 2018

erogol closed this as completed Dec 17, 2018

v0.0.1 automation moved this from In Progress to Done Dec 17, 2018

erogol reopened this Dec 17, 2018

v0.0.1 automation moved this from Done to In Progress Dec 17, 2018

erogol mentioned this issue Dec 17, 2018

Train Tacotron with r=1 for better mel-spectrograms. #67

Closed

erogol mentioned this issue Jan 17, 2019

Use WORLD vocoder #9

Closed

erogol mentioned this issue Feb 18, 2019

New LJSpeech model #108

Closed

alexdemartos mentioned this issue Mar 4, 2019

Effective judgement of whether quick alignment or not for this project by the plot Rayhane-mamah/Tacotron-2#342

Closed

erogol closed this as completed Mar 11, 2019

v0.0.1 automation moved this from In Progress to Done Mar 11, 2019

erogol mentioned this issue Mar 11, 2019

Tacotron2 + WaveRNN experiments #26

Closed

8 tasks

OswaldoBornemann mentioned this issue May 16, 2019

what's fine-tune mean actually #198

Closed

r9y9 mentioned this issue Dec 10, 2019

QUESTIONS: Changing Outputs for TTS & Averaging Checkpoints espnet/espnet#1424

Closed

machineko mentioned this issue Apr 15, 2020

fix dropout in prenet NVIDIA/DeepLearningExamples#457

Closed

kan-bayashi mentioned this issue Apr 5, 2021

Option to perform inference deterministically and disable dropout? espnet/espnet#3130

Closed

m-toman mentioned this issue May 5, 2021

Can I remove the dropout on forward function? NVIDIA/tacotron2#481

Open

Sytronik mentioned this issue Mar 31, 2022

Tacotron2 is generating new mel spectograms and alignment for the same input text sequence while inferencing. NVIDIA/tacotron2#553

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prenet dropout #50

prenet dropout #50

xinqipony commented Sep 29, 2018

erogol commented Oct 1, 2018

erogol commented Nov 29, 2018

xinqipony commented Nov 30, 2018

erogol commented Nov 30, 2018

xinqipony commented Dec 1, 2018

erogol commented Dec 2, 2018

erogol commented Dec 17, 2018

xinqipony commented Dec 18, 2018

erogol commented Dec 18, 2018 •

edited

xinqipony commented Dec 18, 2018

erogol commented Dec 18, 2018

xinqipony commented Dec 18, 2018 •

edited

erogol commented Dec 18, 2018

yweweler commented Dec 18, 2018

erogol commented Dec 18, 2018 •

edited

yweweler commented Dec 18, 2018

erogol commented Dec 18, 2018 •

edited

xinqipony commented Dec 19, 2018

yweweler commented Dec 27, 2018

erogol commented Dec 28, 2018

xinqipony commented Feb 12, 2019

erogol commented Feb 12, 2019

phypan11 commented Feb 14, 2019 •

edited

erogol commented Feb 14, 2019

erogol commented Feb 18, 2019

candlewill commented Mar 1, 2019

xinqipony commented Mar 4, 2019

erogol commented Mar 4, 2019

erogol commented Mar 11, 2019

chapter544 commented Mar 13, 2019 •

edited

WhiteFu commented May 24, 2019

erogol commented May 24, 2019

prenet dropout #50

prenet dropout #50

Comments

xinqipony commented Sep 29, 2018

erogol commented Oct 1, 2018

erogol commented Nov 29, 2018

xinqipony commented Nov 30, 2018

erogol commented Nov 30, 2018

xinqipony commented Dec 1, 2018

erogol commented Dec 2, 2018

erogol commented Dec 17, 2018

xinqipony commented Dec 18, 2018

erogol commented Dec 18, 2018 • edited

xinqipony commented Dec 18, 2018

erogol commented Dec 18, 2018

xinqipony commented Dec 18, 2018 • edited

erogol commented Dec 18, 2018

yweweler commented Dec 18, 2018

erogol commented Dec 18, 2018 • edited

yweweler commented Dec 18, 2018

erogol commented Dec 18, 2018 • edited

xinqipony commented Dec 19, 2018

yweweler commented Dec 27, 2018

erogol commented Dec 28, 2018

xinqipony commented Feb 12, 2019

erogol commented Feb 12, 2019

phypan11 commented Feb 14, 2019 • edited

erogol commented Feb 14, 2019

erogol commented Feb 18, 2019

candlewill commented Mar 1, 2019

xinqipony commented Mar 4, 2019

erogol commented Mar 4, 2019

erogol commented Mar 11, 2019

chapter544 commented Mar 13, 2019 • edited

WhiteFu commented May 24, 2019

erogol commented May 24, 2019

erogol commented Dec 18, 2018 •

edited

xinqipony commented Dec 18, 2018 •

edited

erogol commented Dec 18, 2018 •

edited

erogol commented Dec 18, 2018 •

edited

phypan11 commented Feb 14, 2019 •

edited

chapter544 commented Mar 13, 2019 •

edited