Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Adversarial vocoding

Code from our paper Expediting TTS Synthesis with Adversarial Vocoding. Sound examples can be found here.

Adversarial vocoding is a method for transforming perceptually-informed spectrograms (e.g. mel spectrograms) into audio. It can be used in combination with TTS systems which produce spectrograms (e.g. Tacotron 2). We also provide a method for generating spectrograms with a GAN, which can then be vocoded to audio using adversarial vocoding.


To get started training adversarial vocoders, you must first install the lightweight advoc package. This package is a well-tested set of modules which handle audio IO, spectral processing, and heuristic vocoding in both numpy and tensorflow.

To install in a virtual environment, follow these instructions:

virtualenv -p python3 --no-site-packages advoc
cd advoc
git clone
source bin/activate
cd advoc
pip install -e .

To install globally, run sudo pip install -e ..

To run our suite of tests to affirm reproducibility of our feature representations, heuristic inversion code, etc. run:

python test

Our setup script installed tensorflow==1.13.1 which is linked to CUDA 10 by default. If you are using CUDA 9, you may want to instead use tensorflow-gpu==1.12.0:

pip uninstall tensorflow-gpu
pip install tensorflow-gpu==1.12.0

Adversarial Vocoder (AdVoc)

An adversarial vocoder is a magnitude estimation method: it takes mel spectrograms (logarithmic frequency axis) and produces magnitude spectrograms (linear frequency axis). We pair these estimated magnitude spectrograms with phase estimates from LWS to produce audio.


To train the adversarial vocoder model on the LJ Speech Dataset, download the data and extract the .wav files into models/advoc/data/ljspeech/wavs. Split this data into training, validation and test sets using scripts/ as follows:

cd scripts
python \
  --source_dir models/advoc/data/ljspeech/wavs \
  --out_dir models/advoc/data/ljspeech/wavs_split

This script should create train, valid, and test directories in models/advoc/data/ljspeech/wavs_split.

Train the the adversarial vocoder model (AdVoc) on the training set as follows:

cd models/advoc
python train \
  ${WORK_DIR} \
  --data_cfg ../../datacfg/ljspeech.txt \
  --data_dir ./data/ljspeech/wavs_split/train \

To train the smaller version of adversarial vocoder (AdVoc-small) use:

python train \
  ${WORK_DIR}$ \
  --data_cfg ../../datacfg/ljspeech.txt \
  --data_dir ./data/ljspeech/wavs_split/train \
  --model_type small

For custom datasets, see here.

Monitoring and continuous evaluation

Training logs can be visualized and audio samples can be listened to by launching tensorboard in the WORK_DIR as follows:

tensorboard --logdir=${WORK_DIR}$

To back up checkpoints every hour (GAN training may occasionally collapse so it's good to have backups)

python $WORK_DIR$ 60

To evaluate each checkpoint on the validation set, run the following:

python eval \
  ${WORK_DIR}$ \
  --data_cfg ../../datacfg/ljspeech.txt \
  --data_dir ./data/ljspeech/wavs_split/valid \


Extract mel-spectrograms for audio files from the test dataset using scripts/ as follows:

cd scripts
python \
  --wave_dir ../models/advoc/data/ljspeech/wavs_split/test \
  --out_dir ../models/advoc/data/ljspeech/mel_specs/test \

Running this script should save the extracted mel-spectrogram in models/advoc/data/ljspeech/mel_specs/test as .npy files.

The mel-spectrograms can be vocoded either using the pre-trained models provided at the bottom of this page or training the model from scratch using the steps given above. To vocode mel-spectrograms from an AdVoc checkpoint, use scripts/

cd scripts
python \
--spec_dir ../models/advoc/data/ljspeech/mel_specs/test \
--out_dir ../models/advoc/data/ljspeech/vocoded_output/test \
--model_ckpt <PATH TO PRETRAINED CKPT> \

The above command should save the vocoded audio in models/advoc/data/ljspeech/vocoded_output/test.

Mel spectrogram GAN

Adversarial vocoding can be used to factorize audio generation into P(spectrogram) * P(audio | spectrogram). This is useful because it is currently easier to generate spectrograms with GANs than raw audio. In our paper, we show that this factorized strategy can be used to achieve state-of-the-art results on unsupervised generation of small-vocabulary speech.

Training a new model

To train a mel spectrogram GAN on spoken digits, first download the SC09 dataset and unzip to models/melspecgan/data. Then, run the following from this directory:

cd models/melspecgan
python train \
  ${WORK_DIR} \
  --data_cfg ../../datacfg/sc09.txt \
  --data_dir ./data/sc09/train \

Then train an adversarial vocoder on this same dataset

cd models/advoc
python train \
  ${WORK_DIR} \
  --data_cfg ../../datacfg/sc09.txt \
  --data_dir ../melspecgan/data/sc09/train \

To train on a different dataset, see here.


To generate mel spectrograms of spoken digits with MelspecGAN, first download our pretrained checkpoint and extract to scripts/sc09_melspecgan. Then, use scripts/ as follows:

cd scripts
python \
  --out_dir ./melspecgan_gen \
  --ckpt_fp ./sc09_melspecgan/best_score-55089 \
  --n 1000 \

You can also run on a MelspecGAN you trained by changing --ckpt_fp.

Dataset configuration

Both models in this repository (MelspecGAN and Adversarial Vocoder) use configuration files to set data processing properties appropriate for particular datasets. These configuration files affect the loader which streams batches of examples directly from audio files (requires no preprocessing). The loader works by decoding individual audio files and extracting a variable number of "slices" (fixed-length examples) from each.

We provide configuration files for LJSpeech (datacfg/ljspeech.txt) and SC09 (datacfg/sc09.txt), but you may need to create your own if you want to use a different dataset. If you create one, you will need to set the following properties in the config file:

  • sample_rate: The number of audio samples per second.
  • fastwav: Set to 1 to use scipy to load WAV files, 0 to use librosa to load arbitrary audio files. Scipy is faster but only works for standard WAV files (16-bit PCM or 32-bit float), and does not support resampling. Librosa is slower but supports many types of audio files and resampling.
  • normalize: Set to 1 to normalize each audio file, 0 to skip normalization.
  • slice_first_only: Set to 1 to only use the first slice from each audio file, 0 to use as many slices as possible. Enabling this is appropriate for sound effects datasets, disabling is appropriate for datasets of longer audio files.
  • slice_randomize_offset: Set to 1 to randomize the starting position for slicing, 0 to always start at the beginning. Enabling this is more appropriate for datasets of longer audio files.
  • slice_pad_end: Set to 1 to zero-pad the end of each audio file to produce as many slices as possible, 0 to ignore incomplete slices at the end of each audio file.

Pretrained checkpoints


Vocode spectrograms to audio with generative adversarial networks



No releases published


No packages published


You can’t perform that action at this time.