[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/interspeech2019-tutorial/blob/master/notebooks/interspeech2019_tts/interspeech2019_tts.ipynb)

# [T6] Advanced methods for neural end-to-end speech processing - unification, integration, and implementation -

### Part4 : Building End-to-End TTS System

Speaker: [**Tomoki Hayashi**](https://github.com/kan-bayashi)

Department of informatics, Nagoya University  
Human Dataware Lab. Co., Ltd.


Good afternoon, everyone.  
This is Tomoki Hayashi, doctroral researcher @ Nagoya University.  
From here, I will introduce the demonstration of the development of E2E-TTS system in ESPnet.  
I will use google colaboratory for this hands-on.  
So please access the tutorial material page and open the TTS hands on in your browser.

## Google colaboratory

**OUR HANDS-ON NOTEBOOK URL: https://bit.ly/2kz7wGD**

- Online Jupyter notebook environment
    - Can run python codes
    - Can also run linux command with ! mark
    - Can use a signal GPU (K80)
- What you need to use
    - Internet connection
    - Google account
    - Chrome browser (recommended)

Maybe most of you know the google colaboratory, here I briefly explain it.  
Google colaboratory is the online jupyter notebook environment.  
We can run the python code and linux command with exclamation mark in front of commands.  
What you need to use is internet connection, google account, and chrome browser.  
Maybe you can use in other browser, but we do not check so recommend to use chrome.

## Usage of Google colaboratory

<div align=center>
    <img src=figs/colab_usage.png width=80%>
</div>

- Do not close the browser
- Do not sleep your laptop

In [None]:
# example of the commands
print("hello, world.")
!echo "hello, world"

This is the usage of google colaboratory.  
Basically it is the same as jupyter notebook.  
You can run the cell by clicking this bottun or ctrl+enter.  
Also you can add new code or text cell by clicking these bottons.

## TOC

0. Installation
1. Introduction of ESPnet TTS
2. Demonstration of the ESPnet TTS recipe
3. Demonstration of the use of TTS pretrained models
4. Demonstration of the use of ASR pretrained models
5. Conclusion

This is the table of contents of my presentation.

## 0. Installation
 
It takes around 3 minutes. Please keep waiting for a while.


In [None]:
# OS setup
!cat /etc/os-release
!apt-get install -qq bc tree sox

# espnet setup
!git clone --depth 5 https://github.com/espnet/espnet
!pip install -q torch==1.1
!cd espnet; pip install -q -e .

# download pre-compiled warp-ctc and kaldi tools
!espnet/utils/download_from_google_drive.sh \
    "https://drive.google.com/open?id=13Y4tSygc8WtqzvAVGK_vRV9GlV7TRC0w" espnet/tools tar.gz > /dev/null
!cd espnet/tools/warp-ctc/pytorch_binding && \
    pip install -U dist/warpctc_pytorch-0.1.1-cp36-cp36m-linux_x86_64.whl

# make dummy activate
!mkdir -p espnet/tools/venv/bin && touch espnet/tools/venv/bin/activate
!echo "setup done."

First, installation.  
Please run the cell and keep waiting for a while.  

## 1. Introduction of ESPnet TTS

- Follow the [Kaldi](https://github.com/kaldi-asr/kaldi) style recipe
- Support three E2E-TTS models and their variants
- Support four corpus including English, Japanese, Italy, Spanish, and Germany
- Support pretrained WaveNet-vocoder (Softmax and MoL version)

Samples are available in https://espnet.github.io/espnet-tts-sample/

During the installation, I will introduce the ESPnet TTS systems.  
TTS recipes also follow the kaldi-style recipe, the most of preprocessing steps are the exactly same as the ASR.  
Currently we support three E2E-TTS models and their variants.  
We support four corpus including English, Japanese, Italy, Spanish and Germany.  
Also, we support pretrained WaveNet vocoder. So you can generate higher quality speech.  

### Supported E2E-TTS models

- [**Tacotron 2**](https://arxiv.org/abs/1712.05884): Standard Tacontron 2
- [**Multi-speaker Tacotron2**](https://arxiv.org/pdf/1806.04558.pdf): Pretrained speaker embedding (X-vector) + Tacotron 2
- [**Transformer**](https://arxiv.org/pdf/1809.08895.pdf): TTS-Transformer
- [**Multi-speaker Transformer**](): Pretrained speaker embedding (X-vector) + TTS-Transformer
- [**FastSpeech**](https://arxiv.org/pdf/1905.09263.pdf): Feed-forward TTS-Transformer


We support following E2E-TTS models.
Tacotron2, Multi-speaker Tacotron2 using pretrained speaker embedding, Transformer, Multi-speaker Transformer with pretrained speaker embedding, and FastSpeech also known as feed-forward Transformer.

### Other remarkable functions

- [**CBHG** (Convolutional Bank Highway network Gated recurrent unit)](https://arxiv.org/pdf/1703.10135.pdf): Network to convert Mel-filter bank to linear spectrogram
- [**Forward attention**](https://arxiv.org/pdf/1807.06736.pdf): Attention mechanism with causal regularization
- [**Guided attention loss**](https://arxiv.org/pdf/1710.08969.pdf): Loss function to force attention to be diagonal

As other remarkable functions, we support CBHG convolutional bank highway network gated recurrent unit network, which convert mel-filter bank to linear spectrogram, and forward attention, which has a causal regularization, and guided attention loss, which force attention to be diagonal.  
During the training, we can combine these function to train E2E-TTS models.

### Supported corpora

- [`egs/jsut/tts1`](https://sites.google.com/site/shinnosuketakamichi/publication/jsut): Japanese single female speaker. (48 kHz, ~10 hours)
- [`egs/libritts/tts1`](http://www.openslr.org/60/): English multi speakers (24 kHz, ~500 hours).
- [`egs/ljspeech/tts1`](https://keithito.com/LJ-Speech-Dataset/): English single female speaker (22.05 kHz, ~24 hours).
- [`egs/m_ailabs/tts1`](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/): Various language speakers (16 kHz, 16~48 hours).

Here is the list of supported corpus, and recipes.  
JSUT, LibriTTS, LJspeech and M ailabs speech dataset.  
You can build various language TTS model through the ESPnet!

## 2. Demonstration of ESPnet-TTS recipes
 
Here use the recipe `egs/an4/tts1` as an example.  

Unfortunately, `egs/an4/tts1` is too small to train,
but the flow itself is the same as the other recipes.

Maybe you finished the installation, right?  
Let us move on the demonstration of ESPnet TTS recipe.  
Here we use the recipe `egs/an4/tts1`, which is a TTS version of `egs/an4/asr1`.  
An4 is provided by CMU and it is free and suitable for demonstration.  
Unfortunately, `egs/an4/tts1` is too small to train, but the flow itself is the same as the other recipes.  
So you can understand the basic flow of TTS recipes.

Always we organize each recipe placed in `egs/xxx/tts1` in the same manner:

- `run.sh`: Main script of the recipe.
- `cmd.sh`: Command configuration script to control how-to-run each job.
- `path.sh`: Path configuration script. Basically, we do not need to touch.
- `conf/`: Directory containing configuration files e.g.g.
- `local/`: Directory containing the recipe-specific scripts e.g. data preparation.
- `steps/` and `utils/`: Directory containing kaldi tools.

We alway organize each recipe placed in `egs/xxx/tts1` in the same manner.  
Let us move on the recipe directory and check the file. please run the following cell.  
Each recipe contains the main script run.sh, command configuration script cmd.sh, path configuration script path.sh.  
Also there are some directories, conf is the directory including configuration files, local is the directory including the recipe-specific scripts such as data preparation, and steps and utils the directories including kaldi tools.  

In [None]:
# move on the recipe directory
import os
os.chdir("espnet/egs/an4/tts1")

# check files
!tree -L 1

Main script `run.sh` consists of several stages:

<div align=center>
    <img src=figs/tts_brief_overview.png width=75%>
</div>

- **stage -1**: Download data if the data is available online.
- **stage 0**: Prepare data to make kaldi-stype data directory.
- **stage 1**: Extract feature vector, calculate statistics, and normalize.
- **stage 2**: Prepare a dictionary and make json files for training.
- **stage 3**: Train the E2E-TTS network.
- **stage 4**: Decode mel-spectrogram using the trained network.
- **stage 5**: Generate a waveform using Griffin-Lim.

From **stage -1 to 2 are the same as the ASR** recipe.

The main script run.sh consists of several stages from -1 to 5, in total 6 stages.  
Data download, kaldi-style data preparation, feature extraction, data conversion for json format, Training, decoding and synthesis.  
Stage 0 and 1 are performed with kaldi and librosa, and stage 3 and 4 are performed by pytorch.  
From stage -1 to stage 2 are the same as ASR.  
We have the unified flow in both TTS and ASR.

### Detail overview
<div align=center>
    <img src=figs/tts_overview.png width=90%>
</div>

This figure shows the detail flow of each recipe.  
I will introduce each stage step-by-step.

### Stage -1: Data download

This stage downloads corpus if it is available online.

<div align=center>
    <img src=figs/tts_stage-1.png width=80%>
</div>

First, stage -1 data download.  
This stage downloads corpus from cloud if the data is available online.

In [None]:
# run stage -1 and then stop
!./run.sh --stage -1 --stop_stage -1

Please run the cell. And keep waiting for a while.

`downloads` directory is created, which containing downloaded an4 dataset.

In [None]:
!tree -L 2 downloads

Downloaded files are save in `downloads` directory.  
Let me check the files.  
You can see the corpus is stored in the directory.

### Stage 0: Data preparation

This stage prepares kaldi-style data directories.

<div align=center>
    <img src=figs/tts_stage0.png width=80%>
</div>

Next, stage 0 data preparation.  
This stage prepares the kaldi-style data directories.

In [None]:
# run stage 0 and then stop
!./run.sh --stage 0 --stop_stage 0

Please run the cell.

Two kaldi-style data directories are created:  
- `data/train`: data directory of training set
- `data/test`: data directory of evaluation set  

In [None]:
!tree -L 2 data

Through this stage, 2 kaldi-style data directories are created under the `data` directory.  
Let me check the files.

`wav.scp`: 
- Each line has `<utt_id> <wavfile_path or command pipe>`
- `<utt_id>` must be unique

`text`:
- Each line has `<utt_id> <transcription>`
- Assume that `<transcription>` is cleaned

`utt2spk`:
- Each line has `<utt_id> <speaker_id>`

`spk2utt`:
- Each line has `<speaker_id> <utt_id> ... <utt_id> `
- Can be automatically created from `utt2spk` 

In the ESPnet, speaker information is **not used for any processing.**   
Therefore, **`utt2spk` and `spk2utt` can be a dummy.**

In [None]:
!head -n 3 data/train/*

Maybe most of you are the specialist of the kaldi, you know what these files are.  
So here briefly check the files. please run the cell.
These fundamental files are the same as kaldi but there is an important point.  
That is, in the ESPnet, any speaker information is not used for any processing.  
Therefore utt2spk and spk2utt can be a dummy.  

### Stage 1: Feature extration

This stage performs feature extraction, statistics calculation and normalization.

<div align=center>
    <img src=figs/tts_stage1.png width=80%>
</div>

Next, stage 1 feature extration.  
This stage performes feature extraction, calculatin of statistics of feature vector, and feature normalization using the claculated statistics.

In [None]:
# hyperparameters related to stage 1
!head -n 28 run.sh | tail -n 8

Let me check the hyperparameters related to stage 1.  
In the TTS, we use mel-filterbank as speech features and they are extracted by librosa.

In [None]:
# run stage 1 with default settings
!./run.sh --stage 1 --stop_stage 1 --nj 4

Please run the stage 1.

Raw filterbanks are saved in `fbank/` directory with `ark/scp` format. 

- `.ark`: binary file of feature vector
- `.scp`: list of the correspondance b/w `<utt_id>` and `<path_in_ark>`.  

Since feature extraction can be performed for split small sets in parallel, raw_fbank is split into `raw_fbank_*.{1..4}.{scp,ark}`.

In [None]:
!tree -L 2 fbank

In [None]:
!head -n 3 fbank/raw_fbank_train.1.scp

Through the stage 1, extracted raw filterbanks are saved in `fbank` directory with ark/scp format.  
Let me check the files.
.ark is binary file containing the feature vector and .scp is the list of the correspondance between utt_id and path_in_ark.  
Since the feature extraction is performed in parallel, ark/scp files are split into several pieces.

These files can be loaded in python via a great tool **kaldiio** as follows:

In [None]:
import kaldiio
import matplotlib.pyplot as plt

# load scp file
scp_dict = kaldiio.load_scp("fbank/raw_fbank_train.1.scp")
for key in scp_dict:
    plt.imshow(scp_dict[key].T[::-1])
    plt.title(key)
    plt.colorbar()
    plt.show()
    break
    
# load ark file
ark_generator = kaldiio.load_ark("fbank/raw_fbank_train.1.ark")
for key, array in ark_generator:
    plt.imshow(array.T[::-1])
    plt.title(key)
    plt.colorbar()
    plt.show()
    break

These ark/scp format files can be loaded as numpy.array, thanks to the great tool Kaldiio created by Mr. Kamo.  
Please run the cell.  scp can be loaded as dict and ark can be loaded as a generator.

Some files are added in `data/train`:
- `feats.scp`: concatenated scp file of `fbank/raw_fbank_train.{1..4}.scp`.  
- `utt2num_frames`: Each line has `<utt_id> <number_of_frames>` .

In [None]:
!tree data/train

In [None]:
!head -n 3 data/train/*

Through feature extraction, some files are added in `data` directories.  
Let me check these files.  
feats.scp is the concatenated scp file of scp files in fbank directory.  
utt2num_frames has the information of number of frames of each utterance.

`data/train/` directory is split into two directories:
- `data/train_nodev/`: data directory for training
- `data/train_dev/`: data directory for validation


In [None]:
!tree data/train_*

And then data/train directory is split into two directories.  
data/train_nodev and data/train_dev. These are for training and validation respectively.  
Let me check the files. You can see both directories has the same files.

<div align=center>
    <img src=figs/tts_stage1.png width=80%>
</div>

So far, I explained this part.  
Next is statistics calculatin and normalization.

`cmvn.ark` is saved in `data/train_nodev`, which is the statistics file.  
(cepstral mean variance normalization: `cmvn`)  
This file also can be loaded in python via kaldiio.

In [None]:
!tree data/train_nodev

statistics file is saved as `cmvn.ark` in data/train_nodev directory.  
Let me check the file.  
This file can also be loaded as numpy.array via kaldiio.

Normalized features for train, dev, and eval sets are dumped in
- `dump/{train_nodev,train_dev,test}/*.{ark,scp}`.  

These ark and scp can be loaded as the same as the above procedure.

In [None]:
!tree dump/*

And finally, normalized features for training, validation, and evaluation set are saved in `dump` directory with the ark/scp format.
Let me check the files.  
We also support online normalization during training, but basically to reduce the cpu calculation, we normalize features and then dump them like this.  
Also, we can use kaldi pipe processing for normalization, but in espnet, to make it more python friendly, we explicitly dump the feature vector.

### Stage 2: Dictionary and json preparation

This stage creates char dict and integrate files into a single json file.

<div align=center>
    <img src=figs/tts_stage2.png width=80%>
</div>

Next stage 2 dictionary and json data preparation.  
This stage creates char dict and integrate kaldi-style directories into a single json file.

In [None]:
# run stage 2 and then stop
!./run.sh --stage 2 --stop_stage 2

Please run the stage 2. 

- Dictionary file is created in `data/lang_1char/`.  
- Dictionary file consists of `<token>` `<token index>`.  
    - `<token index>` starts from 1 because 0 is used as padding index.


In [None]:
!tree data/lang_1char

In [None]:
!cat data/lang_1char/train_nodev_units.txt

Through this stage, dictionary file is created in data/lang_1char/.  
Let me check the file.  
The dictionary file consists of token and token index.  
The index starts from 1 because 0 is used as padding index in TTS.  

But in the case of ASR, 0 is used for blank symbol of CTC, please be careful.

Three json files are created for train, dev, and eval sets as 
- `dump/{train_nodev,train_dev,test}/data.json`.

In [None]:
!tree dump -L 2

And json file for training, validation, and evaluaton set are created under the `dump` directory.  
Let me check the files. You can see the `data.json` in each directory.

Each json file contains all of the information in the data directory.

- `shape`: Shape of the input or output sequence.
- `text`: Original transcription.
- `token`: Token sequence of the transcription.
- `tokenid` Token id sequence converted with `dict` of the transcription

In [None]:
!head -n 27 dump/train_nodev/data.json

Let me check the content of json file. Please run the cell.  
Each json contains allfo the information in the kaldi-style data directory.  
for example, shape, text, token, tokenid, spk and so on.  
Some of you noticed that input and output is reversed in terms of TTS.   
This is because we use totally the same json file for TTS as ASR.  

I'm sorry I kept you waiting, now ready to start training!

Now ready to start training!

### Stage 3: Network training

This stage trains E2E-TTS network.

<div align=center>
    <img src=figs/tts_stage3.png width=80%>
</div>

Next, Stage 3 E2E-TTS training.  
This stage trains E2E-TTS network using prepared json files.

Training setting can be specified by `train_config`.

In [None]:
# check hyperparmeters in run.sh
!head -n 31 run.sh | tail -n 2

Training configuration is written in `.yaml` format file.  
Let us check the default configuration `conf/train_pytroch_tacotron2.yaml`.

The network configuration can be specified by `--train_config` option.  
And the configuration file is written in yaml format.  
Let me check the default configuration conf/train_pytorch_tacotron2.yaml.

In [None]:
!cat conf/train_pytorch_tacotron2.yaml

There are several hyperparameters.  
The default values are based on the official paper.  
But these values are a little bit big for demonstration.  
So let us change the hyperparameters by editing yaml file.

Let's change the hyperparameters.

In [None]:
# load configuration yaml
import yaml
with open("conf/train_pytorch_tacotron2.yaml") as f:
    params = yaml.load(f, Loader=yaml.Loader)

# change hyperparameters by yourself!
params.update({
    "embed-dim": 16,
    "elayers": 1, 
    "eunits": 16,
    "econv-layers": 1,
    "econv-chans": 16,
    "econv-filts": 5,
    "dlayers": 1,
    "dunits": 16,
    "prenet-layers": 1,
    "prenet-units": 16,
    "postnet-layers": 1,
    "postnet-chans": 16,
    "postnet-filts": 5,
    "adim": 16,
    "aconv-chans": 16,
    "aconv-filts": 5,
    "reduction-factor": 5,
    "batch-size": 128,
    "epochs": 5,
    "report-interval-iters": 10,
})

# save
with open("conf/train_pytorch_tacotron2_mini.yaml", "w") as f:
    yaml.dump(params, f, Dumper=yaml.Dumper)

# check modified version
!cat conf/train_pytorch_tacotron2_mini.yaml

Here we show the how to edit in python but of course you can edit it by your favorite editor such as vim.  
We save modified yaml file as train_pytorch_tacotron2_mini.yaml.

Also, we provide `transformer` and `fastspeech` configs.  

In [None]:
!cat ../../ljspeech/tts1/conf/tuning/train_pytorch_transformer.v1.yaml

In [None]:
!cat ../../ljspeech/tts1/conf/tuning/train_fastspeech.v2.yaml

We can easily switch the model to be trained by only changing `--train_config`.  
(NOTE: FastSpeech needs a teacher model, pretrained Transformer)

Also we provide the transformer and fastspeech configs. 
Let me check them.
But before the explanation of these configuration, let us run the training using the modified config.  
Please move on the next cell.

(After return)  
During the training, I will explain the transformer and fastspeech configs.  
Please go back to the previous cell.  
These are the transformer and fastspeech configs.  
We can easily switch the model to train by only changing `--train_config`.
Compared to the tacotron2, Trasnformer is more difficult to train due to the interesting attention weight behavior and a large batchsize requirement.  
But it leads the training of FastSpeech which makes it possible to do super fast generation.
I can show you the behavior in the latter pretrained model demonstration.

Let's train the network.  
You can specify the config file via `--train_config` option.  
It takes several minutes.

In [None]:
# use modified configuration file as train config
!./run.sh --stage 3 --stop_stage 3 --train_config conf/train_pytorch_tacotron2_mini.yaml --verbose 1

OK. let us train the network with modified configuration!  
You can specify the config file vie --train_config option like this.  
It takes several minutes. Please keep waiting for a while.

You can see the training log in `exp/train_*/train.log`.

In [None]:
!cat exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/train.log

OK. Maybe the training was finished, right?
Let me check the training log.

The models are saved in `exp/train_*/results/` directory.

- `exp/train_*/results/model.loss.best`: contains only the model parameters.  
- `exp/train_*/results/snapshot.ep.*`: contains the model parameters, optimizer states, and iterator states. 

In [None]:
!tree -L 1 exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/results

The models are saved in exp/train_*/results directory.  
model.*.best contain only the model parameters on the other hand snapshot contains the model parameters, optimizer states, and iterator states.

`exp/train_*/results/*.png` are the figures of training curve.  
Let us check them.

In [None]:
from IPython.display import Image, display_png
print("all loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/results/all_loss.png", width=500))
print("l1 loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/results/l1_loss.png", width=500))
print("mse loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/results/mse_loss.png", width=500))
print("bce loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/results/bce_loss.png", width=500))

Also, the training curve figures are saved in results directory.  
Let me check the figures.  
These figures are continuously updated during training, so you can monitor with these figures.

`exp/train_*/results/att_ws/*.png` are the figures of attention weights in each epoch.  
In the case of E2E-TTS, it is very important to check that they are diagonal.

In [None]:
print("Attention weights of initial epoch")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/results/att_ws/fash-cen1-b.ep.1.png", width=500))

Also, the figures of attention weights for validation data are saved every epochs in results/att_ws directory.  
Let me check the figure.  
In the case of TTS, monitoring attention weight is very important to check the training going well.  
In the case of this figure, attention weight is not diagonal, it means the model cannot generate any speech.  

Example of a good diagonal attention weights:
<div align=center>
    <img src=figs/ex_attention_weights.png width=60%>
</div>
We should monitor whether the attention weight becomes like this figure.

Here is the example of good attenion weights.  
As you can see, attention weight is diagonal and causal.  
We should monitor whether the attention weight becomes like this figure during training.

Also, we support tensorboard.  
You can see the training log through tensorboard.

In [None]:
# only available in colab
%load_ext tensorboard
%tensorboard --logdir tensorboard/train_nodev_pytorch_train_pytorch_tacotron2_mini/

Also we support tensorboard.  
You can see training curve or attention weights explained on the above through tensorboard.

### Stage 4: Network decoding

This stage performs decoding with trained model.

<div align=center>
    <img src=figs/tts_stage4.png width=80%>
</div>

Then, next, Stage 4 E2E-TTS decoding.  
This stage performs decoding using trained E2E-TTS model.

Decoding parameters can be specified by `--decode_config`.

In [None]:
!head -n 32 run.sh | tail -n 1

Decoding configuration in written in `.yaml` format file.  
Let us check the default configuration `conf/decode.yaml`.

In [None]:
!cat conf/decode.yaml

Decoding parameters can be specified by `--decod_config`.  
Let me check the default decoding yaml file `conf/decode.yaml`.  
threshold is the threshold to stop the feature genearation and the others are for avoiding endless generation.  
Contrast to ASR, TTS decoding parameters are not needed to tune carefully.

In [None]:
# run stage 4 and then stop
!./run.sh --stage 4 --stop_stage 4 --nj 8 --verbose 1 --train_config conf/train_pytorch_tacotron2_mini.yaml 

Please run the stage 4. It takes several seconds.

(it takes time, be careful.)

Generated features are saved as `ark/scp` format.  
Also figures of attention weights and stop probabilities are saved as `{att_ws/probs}/*.png`.

In [None]:
!tree -L 2 exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/outputs_model.loss.best_decode/

Generated features are saved as ark/scp format.  
Also attention weights and stop probabilities saved as figures.  
Let me check the files.  
By checking the attention weights, we can check the generation is succeeded, or repeating, or deletion.  
Again, in the TTS, checking attention weights is very important.

### Stage 5: Waveform synthesis

This stage synthesizes waveform with Griffin-Lim.

<div align=center>
    <img src=figs/tts_stage5.png width=80%>
</div>

This is the final stage, waveform synthesis.  
This stage performed denormalization of generated features and then synthesizes waveform with Griffin-Lim.

In [None]:
# run stage 5 and then stop
!./run.sh --stage 5 --stop_stage 5 --nj 8 \
    --griffin_lim_iters 8 \
    --train_config conf/train_pytorch_tacotron2_mini.yaml

Please run stage 5.

Generated wav files are saved in 
- `exp/train_nodev_pytorch_*/outputs_model.loss.best_decode_denorm/{train_dev,test}/wav/`

In [None]:
!tree -L 2 exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/*_denorm

Generated waveforms are saved in here.  
Let me check the files.  
You can see generated wavefiles.  
Now you finished building your own E2E-TTS model!  
Unfortunately, this model cannnot generate a good speech.  
So let us listen to the samples in demo HP.

Now you finish building your own E2E-TTS model!

But unfortunately, this model cannot generate a good speech.  
Let us listen to the samples in demo HP to check the quality.  
https://espnet.github.io/espnet-tts-sample/

## 3. Demonstration of the use of TTS pretrained models

We provide pretrained TTS models and these are easy to use with `espnet/utils/synth_wav.sh`.

In [None]:
# move on directory
os.chdir("../../librispeech/asr1")
!pwd

OK. Let us move on the next demonstration.  
That is the use of TTS pretrained model.  
ESPnet provides some utilities tools to use pretrained model.  
Here I will introduce the usage of them.  
Please run the cell to move on the directory.

Let us check the usage of `espnet/utils/synth_wav.sh`.  
It will automatically downloads pretrained model from online, you do not need to prepare anything.  

In [None]:
!../../../utils/synth_wav.sh --help

To use the TTS pretrained model, we provide `synth_wav.sh`.  
Let me check the usage of it.  
This script will automatically download specified pretrained model and then perform generation. 
So you do not need to prepare anything.

Let us generate your own text with pretrained models!

In [None]:
# generate your sentence!
!rm -rf decode/example
print("Please input your favorite sentence!")
text = input()
text = text.upper()
with open("example.txt", "w") as f:
    f.write(text)

# you can change here to select the pretrained model
# !../../../utils/synth_wav.sh --models ljspeech.fastspeech.v1 example.txt
# !../../../utils/synth_wav.sh --models ljspeech.tacotron2.v3 example.txt
!../../../utils/synth_wav.sh --models ljspeech.transformer.v1 example.txt

# check generated audio
from IPython.display import display, Audio, Image, display_png
display(Audio("decode/example/wav/example.wav"))
!sox decode/example/wav/example.wav -n rate 22050 spectrogram
display_png(Image("spectrogram.png", width=750))

# check attention and probs
if os.path.exists("decode/example/outputs/att_ws/example_att_ws.png"):
    display_png(Image("decode/example/outputs/att_ws/example_att_ws.png", width=1000))
    display_png(Image("decode/example/outputs/probs/example_prob.png", width=500))

Let us generate your sentence. Please run the cell.  
At first we use pretrained fastspeech.  
Let me check the sample.
Next let me use tacotron2.  
You can see the attention weights and stop probabilities.  
The attention weights are diagonal, which is an evidence of successful generation.
Finally, let us use Transformer.
You can see many attention weigths.  
Interestingly, only the some of the heads are diagonal in Transformer.  
And the attention is not continuous.

Also you can try the wavenet vocoder, but it takes time to decode.

In [None]:
# generate your sentence!
!rm -rf decode/example_short
print("Please input your favorite sentence!")
text = input()
text = text.upper()
with open("example_short.txt", "w") as f:
    f.write(text)
    
# extend stop_stage
!../../../utils/synth_wav.sh --stop_stage 4 --models ljspeech.tacotron2.v3 example_short.txt

# check generated audio
display(Audio("decode/example_short/wav/example_short.wav"))
display(Audio("decode/example_short/wav_wnv/example_short_gen.wav"))

You also try the wavenet vocoder, but it takes time.  
Let us generate very short sentence.  
You can confirm that the naturalness improved significantly but the pronunciation itself is the same.

## 3. Demonstration of the use of ASR pretrained models

ESPnet also provides the `espnet/utils/recog_wav.sh` to use pretrained ASR models.  
Let us recognize the generated speech!

In [None]:
!../../../utils/recog_wav.sh --help

ESPnet also provides the utility `recong_wav.sh` to use pretrained ASR model.  
Let us check the usage.  
This script will also automatically download pretrained ASR model.  
So you do not need to prepare anything.

In [None]:
# downsample to 16 kHz for ASR model
!sox decode/example/wav/example.wav -b 16 decode/example/wav/example_16k.wav rate 16k

# make decode config
import yaml
with open("conf/decode_sample.yaml", "w") as f:
    yaml.dump({
        "batchsize": 0,
        "beam-size": 5,
        "ctc-weight": 0.4,
        "lm-weight": 0.6,
        "maxlenratio": 0.0,
        "minlenratio": 0.0,
        "penalty": 0.0,
    }, f, Dumper=yaml.Dumper)

# let's recognize generated speech
!../../../utils/recog_wav.sh --models librispeech.transformer.v1 \
    --decode_config conf/decode_sample.yaml \
    decode/example/wav/example_16k.wav

Here, let us recognize the generated speech.  
Librispeech transformer models is big, it takes several seconds.  
Yes, the model successfully recognized the generated speech.

## Conclusion

- Can build E2E-TTS models with unified-design recipe
- Can try various models by just changing the yaml file

Through ESPnet, you can build / use E2E-TTS and E2E-ASR in the same manner!

Thank you for your attention!

[*Go to the next notebook from here!*](https://colab.research.google.com/github/espnet/interspeech2019-tutorial/blob/master/notebooks/interspeech2019_asr/interspeech2019_asr.ipynb)