https://www.quora.com/How-do-I-use-mel-spectrogram-as-the-input-of-a-CNN

https://www.youtube.com/watch?v=bdM9c2OFYuw

# Tacotron2: WaveNet-basd text-to-speech demo

- Tacotron2 (mel-spectrogram prediction part): https://github.com/Rayhane-mamah/Tacotron-2
- WaveNet: https://github.com/r9y9/wavenet_vocoder

This is a proof of concept for Tacotron2 text-to-speech synthesis. Models used here were trained on [LJSpeech dataset](https://keithito.com/LJ-Speech-Dataset/).

**Notice**: The waveform generation is super slow since it implements naive autoregressive generation. It doesn't use parallel generation method described in [Parallel WaveNet](https://arxiv.org/abs/1711.10433). 

**Estimated time to complete**: 2 ~ 3 hours.

## Setup

### Install dependencies

In [1]:
import os
from os.path import exists, join, expanduser

os.chdir(expanduser("~"))

wavenet_dir = "wavenet_vocoder"
if not exists(wavenet_dir):
  ! git clone https://github.com/r9y9/$wavenet_dir
    
taco2_dir = "Tacotron-2"
if not exists(taco2_dir):
  ! git clone https://github.com/r9y9/$taco2_dir
  ! cd $taco2_dir && git checkout -B wavenet3 origin/wavenet3

Cloning into 'wavenet_vocoder'...
remote: Enumerating objects: 30, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 1086 (delta 11), reused 23 (delta 10), pack-reused 1056[K
Receiving objects: 100% (1086/1086), 20.14 MiB | 8.47 MiB/s, done.
Resolving deltas: 100% (539/539), done.
Cloning into 'Tacotron-2'...
remote: Enumerating objects: 570, done.[K
remote: Total 570 (delta 0), reused 0 (delta 0), pack-reused 570[K
Receiving objects: 100% (570/570), 8.08 MiB | 4.97 MiB/s, done.
Resolving deltas: 100% (352/352), done.
Branch 'wavenet3' set up to track remote branch 'wavenet3' from 'origin'.
Switched to a new branch 'wavenet3'


In [2]:
# Install dependencies
! pip install -q --upgrade "tensorflow<=1.9.0"

os.chdir(join(expanduser("~"), taco2_dir))
! pip install -q -r requirements.txt

os.chdir(join(expanduser("~"), wavenet_dir))
! pip install -q -e '.[train]'

tcmalloc: large alloc 1073750016 bytes == 0x58a84000 @  0x7f43288832a4 0x594e17 0x626104 0x51190a 0x4f5277 0x510c78 0x5119bd 0x4f5277 0x4f3338 0x510fb0 0x5119bd 0x4f5277 0x4f3338 0x510fb0 0x5119bd 0x4f5277 0x4f3338 0x510fb0 0x5119bd 0x4f6070 0x510c78 0x5119bd 0x4f5277 0x4f3338 0x510fb0 0x5119bd 0x4f6070 0x4f3338 0x510fb0 0x5119bd 0x4f6070


In [3]:
import torch
import tensorflow
tensorflow.__version__

'1.9.0'

### Download pretrained models

#### Tacotron2 (mel-spectrogram prediction part)

In [4]:
os.chdir(join(expanduser("~"), taco2_dir))
! mkdir -p logs-Tacotron
if not exists("logs-Tacotron/pretrained"):
  ! curl -O -L "https://www.dropbox.com/s/vx7y4qqs732sqgg/pretrained.tar.gz"
  ! tar xzvf pretrained.tar.gz
  ! mv pretrained logs-Tacotron

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1246    0  1246    0     0    641      0 --:--:--  0:00:01 --:--:--     0
100  288M  100  288M    0     0  17.4M      0  0:00:16  0:00:16 --:--:-- 37.8M
pretrained/
pretrained/checkpoint
pretrained/model.ckpt-189500.meta
pretrained/model.ckpt-189500.data-00000-of-00001
pretrained/model.ckpt-189500.index


#### WaveNet

In [5]:
os.chdir(join(expanduser("~"), wavenet_dir))
wn_preset = "20180510_mixture_lj_checkpoint_step000320000_ema.json"
wn_checkpoint_path = "20180510_mixture_lj_checkpoint_step000320000_ema.pth"

if not exists(wn_preset):
  !curl -O -L "https://www.dropbox.com/s/0vsd7973w20eskz/20180510_mixture_lj_checkpoint_step000320000_ema.json"
if not exists(wn_checkpoint_path):
  !curl -O -L "https://www.dropbox.com/s/zdbfprugbagfp2w/20180510_mixture_lj_checkpoint_step000320000_ema.pth"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1488  100  1488    0     0   1251      0  0:00:01  0:00:01 --:--:--     0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1246    0  1246    0     0    750      0 --:--:--  0:00:01 --:--:--  1430
100  282M  100  282M    0     0  25.3M      0  0:00

## Input texts to be synthesized

Choose your favorite sentences :)

In [0]:
os.chdir(join(expanduser("~"), taco2_dir))

In [7]:
%%bash
cat << EOS > text_list.txt
This is really awesome!
This is text-to-speech online demonstration by Tacotron 2 and WaveNet.
Thanks for your patience.
EOS

cat text_list.txt

This is really awesome!
This is text-to-speech online demonstration by Tacotron 2 and WaveNet.
Thanks for your patience.


## Mel-spectrogram prediction by Tacoron2

In [8]:
# Remove old files if exist
! rm -rf tacotron_output
! python synthesize.py --model='Tacotron' --mode='eval' \
  --hparams='symmetric_mels=False,max_abs_value=4.0,power=1.1,outputs_per_step=1' \
  --text_list=./text_list.txt

loaded model at logs-Tacotron/pretrained/model.ckpt-189500
Hyperparameters:
  allow_clipping_in_normalization: True
  attention_dim: 128
  attention_filters: 32
  attention_kernel: (31,)
  cleaners: english_cleaners
  cumulative_weights: True
  decoder_layers: 2
  decoder_lstm_units: 1024
  embedding_dim: 512
  enc_conv_channels: 512
  enc_conv_kernel_size: (5,)
  enc_conv_num_layers: 3
  encoder_lstm_units: 256
  fft_size: 1024
  fmax: 7600
  fmin: 125
  frame_shift_ms: None
  griffin_lim_iters: 60
  hop_size: 256
  impute_finished: False
  input_type: raw
  log_scale_min: -32.23619130191664
  mask_encoder: False
  mask_finished: False
  max_abs_value: 4.0
  max_iters: 2500
  min_level_db: -100
  num_freq: 513
  num_mels: 80
  outputs_per_step: 1
  postnet_channels: 512
  postnet_kernel_size: (5,)
  postnet_num_layers: 5
  power: 1.1
  predict_linear: False
  prenet_layers: [256, 256]
  quantize_channels: 65536
  ref_level_db: 20
  rescale: True
  rescaling_max: 0.999
  sample_rate: 2

## Waveform synthesis by WaveNet

In [0]:
import librosa.display
import IPython
from IPython.display import Audio
import numpy as np
import torch

In [10]:
os.chdir(join(expanduser("~"), wavenet_dir))

# Setup WaveNet vocoder hparams
from hparams import hparams
with open(wn_preset) as f:
    hparams.parse_json(f.read())

# Setup WaveNet vocoder
from train import build_model
from synthesis import wavegen
import torch

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

model = build_model().to(device)

print("Load checkpoint from {}".format(wn_checkpoint_path))
checkpoint = torch.load(wn_checkpoint_path)
model.load_state_dict(checkpoint["state_dict"])

This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was *originally* set to 'module://ipykernel.pylab.backend_inline' by the following code:
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-121>", line 2, in initialize
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-pac

Load checkpoint from 20180510_mixture_lj_checkpoint_step000320000_ema.pth


In [11]:
from glob import glob
from tqdm import tqdm

with open("../Tacotron-2/tacotron_output/eval/map.txt") as f:
  maps = f.readlines()
maps = list(map(lambda x:x[:-1].split("|"), maps))
# filter out invalid ones
maps = list(filter(lambda x:len(x) == 2, maps))

print("List of texts to be synthesized")
for idx, (text,_) in enumerate(maps):
  print(idx, text)

List of texts to be synthesized
0 This is really awesome!
1 This is text-to-speech online demonstration by Tacotron 2 and WaveNet.
2 Thanks for your patience.


### Waveform generation

**Note**: This will takes hours to finish depending on the number and lenght of texts. Try short sentences first if you would like to see samples quickly.

In [12]:
waveforms = []

for idx, (text, mel) in enumerate(maps):
  print("\n", idx, text)
  mel_path = join("../Tacotron-2", mel)
  c = np.load(mel_path)
  if c.shape[1] != hparams.num_mels:
    np.swapaxes(c, 0, 1)
  # Range [0, 4] was used for training Tacotron2 but WaveNet vocoder assumes [0, 1]
  c = np.interp(c, (0, 4), (0, 1))
 
  # Generate
  waveform = wavegen(model, c=c, fast=True, tqdm=tqdm)
  
  waveforms.append(waveform)

  # Audio
  IPython.display.display(Audio(waveform, rate=hparams.sample_rate))

  0%|          | 3/27904 [00:00<18:10, 25.58it/s]


 0 This is really awesome!


100%|██████████| 27904/27904 [16:41<00:00, 27.85it/s]


  0%|          | 3/82176 [00:00<1:00:04, 22.80it/s]


 1 This is text-to-speech online demonstration by Tacotron 2 and WaveNet.


  4%|▍         | 3621/82176 [02:11<47:47, 27.40it/s]

KeyboardInterrupt: ignored

## Summary: audio samples

In [0]:
for idx, (text, mel) in enumerate(maps):
  print(idx, text)
  IPython.display.display(Audio(waveforms[idx], rate=hparams.sample_rate))

For more information, please visit https://github.com/r9y9/wavenet_vocoder. More samples can  be  found at https://r9y9.github.io/wavenet_vocoder/. 