Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Universal ParallelWaveGAN #501

Closed
george-roussos opened this issue Aug 18, 2020 · 102 comments
Closed

Universal ParallelWaveGAN #501

george-roussos opened this issue Aug 18, 2020 · 102 comments
Labels
Projects

Comments

@george-roussos
Copy link
Contributor

george-roussos commented Aug 18, 2020

Hi, as Eren requested, this is an issue to follow progress of the training a larger PWGAN model for multiple speakers.

@erogol
Copy link
Contributor

erogol commented Aug 18, 2020

There it is the config I used for anyone to replica the progress here: https://discourse.mozilla.org/t/training-a-universal-vocoder/65388/14?u=erogol

{
"github_branch":"* generic_vocoder",
"restore_path":"/data2/rw/home/Trainings/LJSpeech/pwgan-generic-August-06-2020_01+07PM-29aec7c/checkpoint_275000.pth.tar",
"github_branch":"* generic_vocoder",
    "run_name": "pwgan-generic",
    "run_description": "parallel-wavegan generic vocoder",

    // AUDIO PARAMETERS
    "audio":{
        "fft_size": 1024,         // number of stft frequency levels. Size of the linear spectogram frame.
        "win_length": 1024,      // stft window length in ms.
        "hop_length": 256,       // stft window hop-lengh in ms.
        "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
        "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.

        // Audio processing parameters
        "sample_rate": 24000,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
        "preemphasis": 0.0,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
        "ref_level_db": 0,     // reference level db, theoretically 20db is the sound of air.

        // Silence trimming
        "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
        "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.

        // MelSpectrogram parameters
        "num_mels": 80,         // size of the mel spec frame.
        "mel_fmin": 50.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
        "mel_fmax": 7600.0,     // maximum freq level for mel-spec. Tune for dataset!!
        "spec_gain": 1.0,         // scaler value appplied after log transform of spectrogram.

        // Normalization parameters
        "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
        "min_level_db": -100,   // lower bound for normalization
        "symmetric_norm": true, // move normalization to range [-1, 1]
        "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
        "clip_norm": true,      // clip normalized values into the range.
        "stats_path": "/data/rw/home/Data/LibriTTS/scale_stats.npy"    // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
    },

    // DISTRIBUTED TRAINING
    // "distributed":{
    //     "backend": "nccl",
    //     "url": "tcp:\/\/localhost:54321"
    // },

    // MODEL PARAMETERS
    "use_pqmf": true,

    // LOSS PARAMETERS
    "use_stft_loss": true,
    "use_subband_stft_loss": false,  // USE ONLY WITH MULTIBAND MODELS
    "use_mse_gan_loss": true,
    "use_hinge_gan_loss": false,
    "use_feat_match_loss": false,  // use only with melgan discriminators

    // loss weights
    "stft_loss_weight": 0.5,
    "subband_stft_loss_weight": 0.5,
    "mse_G_loss_weight": 2.5,
    "hinge_G_loss_weight": 2.5,
    "feat_match_loss_weight": 25,

    // multiscale stft loss parameters
    "stft_loss_params": {
        "n_ffts": [1024, 2048, 512],
        "hop_lengths": [120, 240, 50],
        "win_lengths": [600, 1200, 240]
    },

    // subband multiscale stft loss parameters
    "subband_stft_loss_params":{
        "n_ffts": [384, 683, 171],
        "hop_lengths": [30, 60, 10],
        "win_lengths": [150, 300, 60]
    },

    "target_loss": "avg_G_loss",  // loss value to pick the best model to save after each epoch

    // DISCRIMINATOR
    "discriminator_model": "parallel_wavegan_discriminator",
    "discriminator_model_params":{
        "num_layers": 10
    },
    "steps_to_start_discriminator": 200000,      // steps required to start GAN trainining.1

    // GENERATOR
    "generator_model": "parallel_wavegan_generator",
    "generator_model_params": {
        "upsample_factors":[4, 4, 4, 4],
        "stacks": 3,
        "num_res_blocks": 30
    },

    // DATASET
    "data_path": "/data/rw/home/Data/LibriTTS/LibriTTS/train-clean-360/",
    "feature_path": null,
    "seq_len": 25600,
    "pad_short": 2000,
    "conv_pad": 0,
    "use_noise_augment": false,
    "use_cache": true,

    "reinit_layers": [],    // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.

    // TRAINING
    "batch_size": 12,       // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
    "sampling_rates": [16000, 22050, 24000],  // pick different sampling rate for each iteration to train generic PWGAN model.

    // VALIDATION
    "run_eval": true,
    "test_delay_epochs": 10,  //Until attention is aligned, testing only wastes computation time.
    "test_sentences_file": null,  // set a file to load sentences to be used for testing. If it is null then we use default english sentences.

    // OPTIMIZER
    "epochs": 10000,                // total number of epochs to train.
    "wd": 0.0,                // Weight decay weight.
    "gen_clip_grad": -1,      // Generator gradient clipping threshold. Apply gradient clipping if > 0
    "disc_clip_grad": -1,     // Discriminator gradient clipping threshold.
    "lr_scheduler_gen": "MultiStepLR",   // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
    "lr_scheduler_gen_params": {
        "gamma": 0.5,
        "milestones": [100000, 200000, 300000, 400000, 500000, 600000]
    },
    "lr_scheduler_disc": "MultiStepLR",   // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
    "lr_scheduler_disc_params": {
        "gamma": 0.5,
        "milestones": [100000, 200000, 300000, 400000, 500000, 600000]
    },
    "lr_gen": 1e-4,                  // Initial learning rate. If Noam decay is active, maximum learning rate.
    "lr_disc": 1e-4,

    // TENSORBOARD and LOGGING
    "print_step": 25,       // Number of steps to log traning on console.
    "print_eval": false,     // If True, it prints loss values for each step in eval run.
    "save_step": 25000,      // Number of training steps expected to plot training stats on TB and save model checkpoints.
    "checkpoint": true,     // If true, it saves checkpoints per "save_step"
    "tb_model_param_stats": false,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.

    // DATA LOADING
    "num_loader_workers": 8,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
    "num_val_loader_workers": 4,    // number of evaluation data loader processes.
    "eval_split_size": 10,

    // PATHS
    "output_path": "/data2/rw/home/Trainings/LJSpeech/"
}

@george-roussos
Copy link
Contributor Author

Thanks Eren! Is this the one for the ordinary, or the larger model?

I will give it a go later this week or next week. :)

@erogol
Copy link
Contributor

erogol commented Aug 18, 2020

No problem. This is the normal model. You need to change layer values in the code or add them to the config file to run a larger model.

@lexkoro
Copy link
Contributor

lexkoro commented Aug 18, 2020

How come you are using a scale_stats file? Is it okay to compute one for the vocoder? Also I've never maganed to train a multi-speaker vocoder with spec_gain: 1 but 20 worked. Is it correlated with the scale_stats file?
PS: will post my mutli band melgan results later this week.

@erogol
Copy link
Contributor

erogol commented Aug 19, 2020

scale stats is the mean spectrogram frame therefore anythong about spec. computation pertains it.

You can compute it yourself using ```bin/compute_statistics.py`` for your dataset.

I'll share the model I've trained which will include the file for LibriTTS

@erogol
Copy link
Contributor

erogol commented Aug 20, 2020

I've started a larger model training using https://github.com/erogol/TTS_experiments/tree/generic_vocoder

@george-roussos
Copy link
Contributor Author

george-roussos commented Aug 20, 2020

I've started a larger model training using https://github.com/erogol/TTS_experiments/tree/generic_vocoder

I will start training the larger model too, because you mentioned that at 500K steps, the noise was still there and I guess it will be the same, especially seeing as so far with the small model architecture I have never been able to get rid of it. Is that okay? I will train on LibriTTS, do you want me to resample it to 22050 or keep 24000? Also, is it okay if I train with fft_size = 1100, win_length = 1100 and hop_length = 275 instead? All the TTS's I have been training use these processing values. I tried to finetune using 1024 and 256, but the attention was worse for these models.

@erogol
Copy link
Contributor

erogol commented Aug 21, 2020

"the same" to exactly which model? Do you mean a background noise or worse than that?

How large is your model?

Do you train with a single sampling rate or multiple sampling-rates?

It is better if you use the default values to be compatible with common models. But it is your call.

@george-roussos
Copy link
Contributor Author

"the same" to exactly which model? Do you mean a background noise or worse than that?

When I trained I had background noise and then I tried with more speakers and the sound was muffled. I trained using your old ParallelWaveGAN fork.

How large is your model?

I guess it was the default model

Do you train with a single sampling rate or multiple sampling-rates?

It was single, 22050. I took LibriTTS and downsampled it.

It is better if you use the default values to be compatible with common models. But it is your call.

I will train using the default values 🙂 I will train my TTS again.

I can train the smaller model if you think it is better/smarter/better use of our time. And you can train the larger one

@lexkoro
Copy link
Contributor

lexkoro commented Aug 21, 2020

I've also started a training session with the current generic_vocoder branch.
With only changes made to:

        "preemphasis": 0.98,   
        "ref_level_db": 20,    
        "mel_fmin": 40.0,       
        "mel_fmax": 7800.0,    

And here are some of my results from the current dev branch: GDrive

@george-roussos
Copy link
Contributor Author

george-roussos commented Aug 21, 2020

These sound nice. How long did it take to reach 675K on PWGan and what GPU? Are you training using LibriTTS?

@lexkoro
Copy link
Contributor

lexkoro commented Aug 21, 2020

These sound nice. How long did it take to reach 675K on PWGan and what GPU? Are you training using LibriTTS?

~3 days on a V100 but the HDD and CPU are a bit of a bottleneck on the machine I'm using.
The datasets consists of audio mined from the games Gothic 1 - 3. Therefore the audio quality is pretty good.

@george-roussos
Copy link
Contributor Author

These sound nice. How long did it take to reach 675K on PWGan and what GPU? Are you training using LibriTTS?

~3 days on a V100 but the HDD and CPU are a bit of a bottleneck on the machine I'm using.
The datasets consists of audio mined from the games Gothic 1 - 3. Therefore the audio quality is pretty good.

How many speakers does it have and how many minutes of speech for each?

@lexkoro
Copy link
Contributor

lexkoro commented Aug 21, 2020

How many speakers does it have and how many minutes of speech for each?

Not really sure about the number of speakers, would say around ~50. Length of speech varies from few hours to just some minutes.

@george-roussos
Copy link
Contributor Author

How many speakers does it have and how many minutes of speech for each?

Not really sure about the number of speakers, would say around ~50. Length of speech varies from few hours to just some minutes.

How many hours is the whole dataset and how many workers are you using for the data loading? I am using a V100 and 8 workers on a 30 hour dataset and I am lucky if I get 300k steps in 3 days after the discriminator kicks in, with a batch size of 10 and a length 27500.

@lexkoro
Copy link
Contributor

lexkoro commented Sep 2, 2020

How many hours is the whole dataset and how many workers are you using for the data loading? I am using a V100 and 8 workers on a 30 hour dataset and I am lucky if I get 300k steps in 3 days after the discriminator kicks in, with a batch size of 10 and a length 27500.

~40 hours of data. I'm using 4 workes each with OMP_NUM_THREADS=1 as prefix when training. Using the default config values, batch_size: 6 and sq_len: 25600. But for now I'm giving up on training a pwgan vocoder. Can't get rid of the cracking and hiss sounds.

The MB MelGan vocoder yielded better results for me even though the sound produced sounds a bit more tinny/hollow.

@erogol
Copy link
Contributor

erogol commented Sep 7, 2020

I am training a larger PWGAN model. After discriminator, it raised an error when I was PTO and now I restarted it. However, from the first sight, I can tell the results look better even without discriminator training.

@sanjaesc can you share some exampled with MBMelGAN? BTW in the same paper, they use a larger MelGAN model with better results. So maybe you can try that one. They only increase the model's receptive field compared to the original MelGAN.

@george-roussos
Copy link
Contributor Author

I am very sorry for not having time to train universal (and for the off-topic to follow). I have been testing PWGAN with single speaker, but I have had trouble with getting it to converge. Before the discriminator, it produces waveforms that sound acceptable, but when the discriminator kicks in, it sounds more metallic, especially when the speaker glottalizes. I've tried different batch sizes but it did not really help. I also tried to start training the discriminator at 100K steps instead but I got no noticeable improvements, either. Is it worth to pretrain the generator for 400K steps instead? I wonder why it does not want to converge, the PWGAN on LJSpeech sounds very good on 675K steps and my dataset has considerably better audio than LJSpeech. I did not have a lot of time to try the generic_vocoder branch.

This is the config I use and I will use the same when I train the LibriTTS PWGAN, unless something is wrong with it. I decided I will go for 1100 and 275 instead, because in my tests, 1024 and 256 give me pronunciation issues with the TTS (all other settings are the same). This config here does not have normalization stats, though (because I have not used them with my TTS). Do they help a lot? I did not want to compute them, because I wanted more variety in the speech.

I will switch to universal training this week. 🙂

{
"github_branch":"* dev",
    "run_name": "pwgan",
    "run_description": "parallel-wavegan training",

    // AUDIO PARAMETERS
    "audio":{
        "fft_size": 1100,         // number of stft frequency levels. Size of the linear spectogram frame.
        "win_length": 1100,      // stft window length in ms.
        "hop_length": 275,       // stft window hop-lengh in ms.
        "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
        "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.

        // Audio processing parameters
        "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
        "preemphasis": 0.98,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
        "ref_level_db": 0,     // reference level db, theoretically 20db is the sound of air.

        // Silence trimming
        "do_trim_silence": false,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
        "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.

        // MelSpectrogram parameters
        "num_mels": 80,         // size of the mel spec frame.
        "mel_fmin": 0.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
        "mel_fmax": 8000.0,     // maximum freq level for mel-spec. Tune for dataset!!
        "spec_gain": 20.0,         // scaler value appplied after log transform of spectrogram.

        // Normalization parameters
        "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
        "min_level_db": -100,   // lower bound for normalization
        "symmetric_norm": true, // move normalization to range [-1, 1]
        "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
        "clip_norm": true,      // clip normalized values into the range.
        "stats_path": null    // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
    },

    // DISTRIBUTED TRAINING
    // "distributed":{
    //     "backend": "nccl",
    //     "url": "tcp:\/\/localhost:54321"
    // },

    // MODEL PARAMETERS
    "use_pqmf": true,

    // LOSS PARAMETERS
    "use_stft_loss": true,
    "use_subband_stft_loss": false,  // USE ONLY WITH MULTIBAND MODELS
    "use_mse_gan_loss": true,
    "use_hinge_gan_loss": false,
    "use_feat_match_loss": false,  // use only with melgan discriminators

    // loss weights
    "stft_loss_weight": 0.5,
    "subband_stft_loss_weight": 0.5,
    "mse_G_loss_weight": 2.5,
    "hinge_G_loss_weight": 2.5,
    "feat_match_loss_weight": 25,

    // multiscale stft loss parameters
    "stft_loss_params": {
        "n_ffts": [1024, 2048, 512],
        "hop_lengths": [120, 240, 50],
        "win_lengths": [600, 1200, 240]
    },

    // subband multiscale stft loss parameters
    "subband_stft_loss_params":{
        "n_ffts": [384, 683, 171],
        "hop_lengths": [30, 60, 10],
        "win_lengths": [150, 300, 60]
    },

    "target_loss": "avg_G_loss",  // loss value to pick the best model to save after each epoch

    // DISCRIMINATOR
    "discriminator_model": "parallel_wavegan_discriminator",
    "discriminator_model_params":{
        "num_layers": 10
    },
    "steps_to_start_discriminator": 200000,      // steps required to start GAN trainining.1

    // GENERATOR
    "generator_model": "parallel_wavegan_generator",
    "generator_model_params": {
        "upsample_factors":[5, 5, 11],
        "stacks": 3,
        "num_res_blocks": 30
    },

    // DATASET
    "data_path": "../../../dataset/wavs/",
    "feature_path": null,
    "seq_len": 27500,
    "pad_short": 2000,
    "conv_pad": 0,
    "use_noise_augment": false,
    "use_cache": true,

    "reinit_layers": [],    // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.

    // TRAINING
    "batch_size": 6,       // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.

    // VALIDATION
    "run_eval": true,
    "test_delay_epochs": 10,  //Until attention is aligned, testing only wastes computation time.
    "test_sentences_file": null,  // set a file to load sentences to be used for testing. If it is null then we use default english sentences.

    // OPTIMIZER
    "epochs": 10000,                // total number of epochs to train.
    "wd": 0.0,                // Weight decay weight.
    "gen_clip_grad": -1,      // Generator gradient clipping threshold. Apply gradient clipping if > 0
    "disc_clip_grad": -1,     // Discriminator gradient clipping threshold.
    "lr_scheduler_gen": "MultiStepLR",   // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
    "lr_scheduler_gen_params": {
        "gamma": 0.5,
        "milestones": [100000, 200000, 300000, 400000, 500000, 600000]
    },
    "lr_scheduler_disc": "MultiStepLR",   // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
    "lr_scheduler_disc_params": {
        "gamma": 0.5,
        "milestones": [100000, 200000, 300000, 400000, 500000, 600000]
    },
    "lr_gen": 1e-4,                  // Initial learning rate. If Noam decay is active, maximum learning rate.
    "lr_disc": 1e-4,

    // TENSORBOARD and LOGGING
    "print_step": 25,       // Number of steps to log traning on console.
    "print_eval": false,     // If True, it prints loss values for each step in eval run.
    "save_step": 10000,      // Number of training steps expected to plot training stats on TB and save model checkpoints.
    "checkpoint": true,     // If true, it saves checkpoints per "save_step"
    "tb_model_param_stats": false,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.

    // DATA LOADING
    "num_loader_workers": 8,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
    "num_val_loader_workers": 8,    // number of evaluation data loader processes.
    "eval_split_size": 10,

    // PATHS
    "output_path": "/home/"
}

@george-roussos
Copy link
Contributor Author

@erogol are you training LibriTTS using the config in the generic_vocoder branch, or the one above and the extra layers? What batch size do you use? I think I will switch to universal now too.

@lexkoro
Copy link
Contributor

lexkoro commented Sep 7, 2020

@erogol Here are some samples from the current run.

@george-roussos
Copy link
Contributor Author

I am training the larger model now too. I am training for 275 and 1100 and for 22050 only, in hopes for saving some time, but it is taking so long. I could only fit a batch size of 8. This is the config I use.

{
"github_branch":"* generic_vocoder",
"run_name": "pwgan",
"run_description": "parallel-wavegan training",

// AUDIO PARAMETERS
"audio":{
    "fft_size": 1100,         // number of stft frequency levels. Size of the linear spectogram frame.
    "win_length": 1100,      // stft window length in ms.
    "hop_length": 275,       // stft window hop-lengh in ms.
    "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
    "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.

    // Audio processing parameters
    "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
    "preemphasis": 0.0,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
    "ref_level_db": 0,     // reference level db, theoretically 20db is the sound of air.

    // Silence trimming
    "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
    "trim_db": 48,          // threshold for timming silence. Set this according to your dataset.

    // MelSpectrogram parameters
    "num_mels": 80,         // size of the mel spec frame.
    "mel_fmin": 0.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
    "mel_fmax": 8000.0,     // maximum freq level for mel-spec. Tune for dataset!!
    "spec_gain": 20.0,         // scaler value appplied after log transform of spectrogram.

    // Normalization parameters
    "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
    "min_level_db": -100,   // lower bound for normalization
    "symmetric_norm": true, // move normalization to range [-1, 1]
    "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
    "clip_norm": true,      // clip normalized values into the range.
    "stats_path": null    // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},

// DISTRIBUTED TRAINING
//    "distributed":{
//        "backend": "nccl",
//        "url": "tcp:\/\/localhost:54321"
//    },

// MODEL PARAMETERS
"use_pqmf": true,

// LOSS PARAMETERS
"use_stft_loss": true,
"use_subband_stft_loss": false,  // USE ONLY WITH MULTIBAND MODELS
"use_mse_gan_loss": true,
"use_hinge_gan_loss": false,
"use_feat_match_loss": false,  // use only with melgan discriminators

// loss weights
"stft_loss_weight": 0.5,
"subband_stft_loss_weight": 0.5,
"mse_G_loss_weight": 2.5,
"hinge_G_loss_weight": 2.5,
"feat_match_loss_weight": 25,

// multiscale stft loss parameters
"stft_loss_params": {
    "n_ffts": [1024, 2048, 512],
    "hop_lengths": [120, 240, 50],
    "win_lengths": [600, 1200, 240]
},

// subband multiscale stft loss parameters
"subband_stft_loss_params":{
    "n_ffts": [384, 683, 171],
    "hop_lengths": [30, 60, 10],
    "win_lengths": [150, 300, 60]
},

"target_loss": "avg_G_loss",  // loss value to pick the best model to save after each epoch

// DISCRIMINATOR
"discriminator_model": "parallel_wavegan_discriminator",
"discriminator_model_params":{
    "num_layers": 10
},
"steps_to_start_discriminator": 200000,      // steps required to start GAN trainining.1

// GENERATOR
"generator_model": "parallel_wavegan_generator",
"generator_model_params": {
    "upsample_factors":[5, 5, 11],
    "stacks": 3,
    "num_res_blocks": 30,
    "res_channels": 96,
    "gate_channels": 192,
    "skip_channels": 96
},

// DATASET
"data_path": "/home/george/libri/all",
"feature_path": null,
"seq_len": 27500,
"pad_short": 2000,
"conv_pad": 0,
"use_noise_augment": false,
"use_cache": true,

"reinit_layers": [],    // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.

// TRAINING
"batch_size": 8,       // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
"sampling_rates": [22050],  // pick different sampling rate for each iteration to train generic PWGAN model.

// VALIDATION
"run_eval": true,
"test_delay_epochs": 10,  //Until attention is aligned, testing only wastes computation time.
"test_sentences_file": null,  // set a file to load sentences to be used for testing. If it is null then we use default english sentences.

// OPTIMIZER
"epochs": 10000,                // total number of epochs to train.
"wd": 0.0,                // Weight decay weight.
"gen_clip_grad": -1,      // Generator gradient clipping threshold. Apply gradient clipping if > 0
"disc_clip_grad": -1,     // Discriminator gradient clipping threshold.
"lr_scheduler_gen": "MultiStepLR",   // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_gen_params": {
    "gamma": 0.5,
    "milestones": [100000, 200000, 300000, 400000, 500000, 600000]
},
"lr_scheduler_disc": "MultiStepLR",   // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_disc_params": {
    "gamma": 0.5,
    "milestones": [100000, 200000, 300000, 400000, 500000, 600000]
},
"lr_gen": 1e-4,                  // Initial learning rate. If Noam decay is active, maximum learning rate.
"lr_disc": 1e-4,

// TENSORBOARD and LOGGING
"print_step": 25,       // Number of steps to log traning on console.
"print_eval": false,     // If True, it prints loss values for each step in eval run.
"save_step": 25000,      // Number of training steps expected to plot training stats on TB and save model checkpoints.
"checkpoint": true,     // If true, it saves checkpoints per "save_step"
"tb_model_param_stats": false,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.

// DATA LOADING
"num_loader_workers": 8,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 8,    // number of evaluation data loader processes.
"eval_split_size": 10,

// PATHS
"output_path": "/home/george/"
}

@erogol erogol added this to To do in v0.0.5 Sep 9, 2020
@erogol erogol moved this from To do to In progress in v0.0.5 Sep 9, 2020
@george-roussos
Copy link
Contributor Author

@erogol can you give some info on the setup you're training on? Number of GPUs and batch size?

@erogol
Copy link
Contributor

erogol commented Sep 11, 2020

batch 6 gpu 3x1080ti

but the model did not work well even though it converged better

now i started a new training with a constant sampling rate and a 6 block resnet on provided spectrograms before the vocoder as in the wavernn we use.

@george-roussos
Copy link
Contributor Author

Thanks a lot, it helps to know. Mine was also not good. But I also trained on one GPU and it was extremely slow.

@erogol
Copy link
Contributor

erogol commented Sep 11, 2020

Maybe we need to add WaveRNN into Mozilla TTS the list of vocoders until we have something comparable. Is there anyone willing to do that?

@george-roussos
Copy link
Contributor Author

george-roussos commented Sep 15, 2020

@erogol Do you think increasing the resolutions in STFT loss may help? I think (I am not sure) I read a paper where they increase them.

@WeberJulian
Copy link
Contributor

What's the error when you do:

waveform = vocoder_model.inference(torch.FloatTensor(ap_vocoder._normalize(mel_postnet_spec.T).T).unsqueeze(0))

@george-roussos
Copy link
Contributor Author

george-roussos commented Sep 21, 2020

What's the error when you do:

waveform = vocoder_model.inference(torch.FloatTensor(ap_vocoder._normalize(mel_postnet_spec.T).T).unsqueeze(0))

If I do mel_postnet_spec = ap._denormalize(mel_postnet_spec), I get RuntimeError: Given groups=1, weight of size [80, 80, 1], expected input[1, 92, 80] to have 80 channels, but got 92 channels instead. If I do mel_postnet_spec = ap._denormalize(mel_postnet_spec.T), I get RuntimeError: [!] Mean-Var stats does not match the given feature dimensions.

@erogol erogol moved this from In progress to Done in v0.0.5 Sep 22, 2020
@erogol erogol moved this from Done to In progress in v0.0.5 Sep 22, 2020
@erogol
Copy link
Contributor

erogol commented Sep 22, 2020

you can see the model release under wiki/released models.

I close this issue. Thanks for your discussion and help. Feel free to comment here if you encounter with a problem.

@erogol erogol closed this as completed Sep 22, 2020
@erogol erogol moved this from In progress to Done in v0.0.5 Sep 22, 2020
@george-roussos
Copy link
Contributor Author

Thank you! Can't believe we finally have a universal GAN!

For everyone else, to run it I had to take the fullband melgan model from TTS_experiments/vocoder/models and the definition in generic_utils.py (thanks Eren). I wasn't able to get the interpolation to work though, I am probably doing it wrong. @WeberJulian did you get it to work with 16khz or 22khz?

@WeberJulian
Copy link
Contributor

Great thank you @erogol !
@george-roussos I'm trying this out today and if I get it to work, I'll probably do a PR :)

@erogol
Copy link
Contributor

erogol commented Sep 22, 2020

@george-roussos you should check the latest glow-TTS example which uses interpolation for sampling rate.

https://colab.research.google.com/drive/1NC4eQJFvVEqD8L4Rd8CVK25_Z-ypaBHD?usp=sharing

@WeberJulian
Copy link
Contributor

WeberJulian commented Sep 22, 2020

So I got the generic vocoder model to work with the 21k sr TTS model from this notebook thanks to interpolation.

But I observed somthing weird, the voice sounds better if I don't denormalze with the TTS's ap and then normalize with the vocoder's ap. Am I doing something wrong @erogol ? Or is it just a matter of subjective preference ?

mel_postnet_spec = mel_postnet_spec.T
#mel_postnet_spec = ap._denormalize(mel_postnet_spec)
mel_postnet_spec = torch.tensor(mel_postnet_spec)
mel_postnet_spec = interpolate(mel_postnet_spec, (1,21050/24000)).numpy()
#mel_postnet_spec = ap_voco._normalize(mel_postnet_spec)

without norm/denorm : sample
with norm/denorm : sample

@WeberJulian
Copy link
Contributor

And by the way @george-roussos I got the same error as you RuntimeError: Given groups=1, weight of size [80, 80, 1], expected input[1, 92, 80] to have 80 channels, but got 92 channels instead but it was fixed either by placing the .T like I did in the previous snippet or either by the interpolation, I'm not to sure.

@erogol
Copy link
Contributor

erogol commented Sep 22, 2020

So I got the generic vocoder model to work with the 21k sr TTS model from this notebook thanks to interpolation.

But I observed somthing weird, the voice sounds better if I don't denormalze with the TTS's ap and then normalize with the vocoder's ap. Am I doing something wrong @erogol ? Or is it just a matter of subjective preference ?

mel_postnet_spec = mel_postnet_spec.T
#mel_postnet_spec = ap._denormalize(mel_postnet_spec)
mel_postnet_spec = torch.tensor(mel_postnet_spec)
mel_postnet_spec = interpolate(mel_postnet_spec, (1,21050/24000)).numpy()
#mel_postnet_spec = ap_voco._normalize(mel_postnet_spec)

without norm/denorm : sample
with norm/denorm : sample

you need to change the display sample-rate to have a more fair comparison. That is why your samples sound deeper.

In the Glow-TTS notebook, it sounds better for me with denorm

@WeberJulian
Copy link
Contributor

WeberJulian commented Sep 22, 2020

you need to change the display sample-rate to have a more fair comparison. That is why your samples sound deeper.

You mean the rate parameter of the IPython.display.display function ?
Yeah you're right I forgot to change it but it doesn't explain the difference between the two because both use the 24kHz vocoder model you released. I just tested it with the right display sr and I got the same difference between the two. But I guess it's just a matter of preference

@george-roussos
Copy link
Contributor Author

Thanks guys! I was able to get interpolation to work on a TTS trained with mean-var, but no luck with TTS without mean-var :( Thanks for the vocoder Eren, appreciate it a great deal.

@george-roussos
Copy link
Contributor Author

Hey @erogol did you train the fullband melgan on 1 GPU with a batch size of 48? I am trying to train a run without mean-var on a V100 and it is soooooooo slow... Within 9 hours it's done, like, 25K steps. If I reduce the batch size to 16 it goes back to usual GAN training speeds, but I am afraid it might not be enough for learning. What would you suggest?

@WeberJulian
Copy link
Contributor

@george-roussos have you checked your CPU usage ? If you're on the P3 aws instance, you could be CPU bottlenecked.

@george-roussos
Copy link
Contributor Author

I use a google instance. Have you tried it with 48? Is it supposed to be as fast as 16?

@WeberJulian
Copy link
Contributor

Ok, no I haven't try this specifically, but I remeber beeing heavily CPU bound on a deep learning task because the v100 instance had only 8 vCPUs. So it's worth checking both CPU and GPU usage (htop and nvidia-smi respectively)

@george-roussos
Copy link
Contributor Author

That is how many I have! I have 8. Does it need more? Oh man. 😩

@WeberJulian
Copy link
Contributor

No not necessarily, it depends on the task. Just check your CPU/GPU usage to know ^^

@george-roussos
Copy link
Contributor Author

Thanks. I tried with 16 vCPUs instead but it was the exact same time. 2K steps in one hour with a batch size of 48.

@erogol
Copy link
Contributor

erogol commented Oct 1, 2020

Hey @erogol did you train the fullband melgan on 1 GPU with a batch size of 48? I am trying to train a run without mean-var on a V100 and it is soooooooo slow... Within 9 hours it's done, like, 25K steps. If I reduce the batch size to 16 it goes back to usual GAN training speeds, but I am afraid it might not be enough for learning. What would you suggest?

yes 1GPU and 48 batch size.

I agree that you might be bottlenecked by CPU.

You can enable the memory feature cache (use_cache:True) in training to load the data faster if it is not already enabled.

@george-roussos
Copy link
Contributor Author

Thanks, well that sucks. How many CPUs do you have? I tried with 16 but it was the exact same time (1 hour for 2000 steps). use_cache is enabled yeah

@erogol
Copy link
Contributor

erogol commented Oct 1, 2020

what are step_time and loader_time in your console logs?

@george-roussos
Copy link
Contributor Author

It varies. It can either be something like step_time: 0.23 loader_time: 5.4650 or something like step_time: 0.25 loader_time: 0.0044. These numbers are using 8 vCPUs and 1 GPU (V100) and htop shows 100% on all CPUs. I also tried 4 and 8 number of workers in config.json but it didn't help.

@WeberJulian
Copy link
Contributor

If cpu usage is 100% on all cores, you're CPU bottolnecked. Your GPU usage is probably low (you can check by typing watch -n 0.1 nvidia-smi). Increasing the number of workers won't help sorry

@george-roussos
Copy link
Contributor Author

I thought the same (yes GPU usage is low). But when I tried 16 cores instead, it was still the same times (16 CPUs also showed 100% usage)

@lexkoro
Copy link
Contributor

lexkoro commented Oct 1, 2020

It varies. It can either be something like step_time: 0.23 loader_time: 5.4650 or something like step_time: 0.25 loader_time: 0.0044. These numbers are using 8 vCPUs and 1 GPU (V100) and htop shows 100% on all CPUs. I also tried 4 and 8 number of workers in config.json but it didn't help.

I ran into the same issue with the CPU being the bottleneck. Using the prefix OMP_NUM_THREADS=1 as described here resolved my issue. I'm using 4 workers at most, most of the time i set them to 2.

@george-roussos
Copy link
Contributor Author

Yes! This seems to do things (now it's not 100% usage CPUs). I tried it so many times but I guess I wasn't putting it in the right order. Thanks. 🥇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
v0.0.5
  
Done
Development

No branches or pull requests

4 participants