[View in Colaboratory](https://colab.research.google.com/github/lserafin/ColabRepo/blob/master/DeepVoice3_multi_speaker_TTS_en_demo_StarTrek.ipynb)

# DeepVoice3: Multi-speaker text-to-speech demo

In this notebook, you can try DeepVoice3-based multi-speaker text-to-speech (en) using a model trained on [VCTK dataset](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html). The notebook is supposed to be executed on [Google colab](https://colab.research.google.com) so you don't have to setup your machines locally.

**Estimated time to complete**: 5 miniutes.

- Code: https://github.com/r9y9/deepvoice3_pytorch
- Audio samples: https://r9y9.github.io/deepvoice3_pytorch/

## Setup

### Install dependencies

In [0]:
import os
from os.path import exists, join, expanduser

# Clone
name = "deepvoice3_pytorch"
if not exists(name):
  ! git clone https://github.com/r9y9/$name

In [0]:
# Change working directory to the project dir 
os.chdir(join(expanduser("~"), name))

# Use pytorch v0.3.1
!pip install -q torch==0.3.1

In [0]:
%pylab inline
! pip install -q librosa nltk

import torch
import numpy as np
import librosa
import librosa.display
import IPython
from IPython.display import Audio
# need this for English text processing frontend
import nltk
! python -m nltk.downloader cmudict

### Download a pre-trained model

In [0]:
checkpoint_path = "20171222_deepvoice3_vctk108_checkpoint_step000300000.pth"

In [0]:
if not exists(checkpoint_path):
  !curl -O -L "https://www.dropbox.com/s/uzmtzgcedyu531k/20171222_deepvoice3_vctk108_checkpoint_step000300000.pth"

### git checkout to the working commit

In [0]:
# Copy preset file (json) from master
# The preset file describes hyper parameters
! git checkout master --quiet
preset = "./presets/deepvoice3_vctk.json"
! cp -v $preset .
preset = "./deepvoice3_vctk.json"

# And then git checkout to the working commit
# This is due to the model was trained a few months ago and it's not compatible
# with the current master. 
! git checkout 0421749 --quiet
! pip install -q -e '.[train]'

## Synthesis

### Setup hyper parameters

In [0]:
import hparams
import json

# Newly added params. Need to inject dummy values
for dummy, v in [("fmin", 0), ("fmax", 0), ("rescaling", False),
                 ("rescaling_max", 0.999), 
                 ("allow_clipping_in_normalization", False)]:
  if hparams.hparams.get(dummy) is None:
    hparams.hparams.add_hparam(dummy, v)
    
# Load parameters from preset
with open(preset) as f:
  hparams.hparams.parse_json(f.read())

# Tell we are using multi-speaker DeepVoice3
hparams.hparams.builder = "deepvoice3_multispeaker"
  
# Inject frontend text processor
import synthesis
import train
from deepvoice3_pytorch import frontend
synthesis._frontend = getattr(frontend, "en")
train._frontend =  getattr(frontend, "en")

# alises
fs = hparams.hparams.sample_rate
hop_length = hparams.hparams.hop_size

### Define utility functions

In [0]:
def tts(model, text, p=0, speaker_id=0, fast=True, figures=True):
  from synthesis import tts as _tts
  waveform, alignment, spectrogram, mel = _tts(model, text, p, speaker_id, fast)
  if figures:
      visualize(alignment, spectrogram)
  IPython.display.display(Audio(waveform, rate=fs))
  
def visualize(alignment, spectrogram):
  label_fontsize = 16
  figure(figsize=(16,16))

  subplot(2,1,1)
  imshow(alignment.T, aspect="auto", origin="lower", interpolation=None)
  xlabel("Decoder timestamp", fontsize=label_fontsize)
  ylabel("Encoder timestamp", fontsize=label_fontsize)
  colorbar()

  subplot(2,1,2)
  librosa.display.specshow(spectrogram.T, sr=fs, 
                           hop_length=hop_length, x_axis="time", y_axis="linear")
  xlabel("Time", fontsize=label_fontsize)
  ylabel("Hz", fontsize=label_fontsize)
  tight_layout()
  colorbar()

### Load the model checkpoint

In [0]:
from train import build_model
from train import restore_parts, load_checkpoint

model = build_model()
model = load_checkpoint(checkpoint_path, model, None, True)

### Generate speech

In [0]:
# Try your favorite senteneces:)
text = ""
N = 15
print("Synthesizing \"{}\" with {} different speakers".format(text, N))
for speaker_id in range(N):
  print(speaker_id)
  tts(model, text, speaker_id=speaker_id, figures=False)

In [0]:
# For the Star Trek Fans
speakerID=13
texts = ["Command functions are offline","Already in use","Security clearance required", "Access denied","Enter access code"]
for text in texts:
  tts(model, text, speaker_id=speakerID, figures=False)
  

For details, please visit https://github.com/r9y9/deepvoice3_pytorch