<a href="https://colab.research.google.com/github/hourglasshoro/research-notebook/blob/main/espnet2_tts_interjection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb)

# ESPnet2-TTS realtime demonstration

This notebook provides a demonstration of the realtime E2E-TTS using ESPnet2-TTS and ParallelWaveGAN (+ MelGAN).

- ESPnet2-TTS: https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1
- ParallelWaveGAN: https://github.com/kan-bayashi/ParallelWaveGAN

Author: Tomoki Hayashi ([@kan-bayashi](https://github.com/kan-bayashi))

## Installation

In [None]:
# NOTE: pip shows imcompatible errors due to preinstalled libraries but you do not need to care
!pip install -q espnet==0.9.7 parallel_wavegan==0.4.8
!pip install -q espnet_model_zoo

[K     |████████████████████████████████| 727kB 7.6MB/s 
[K     |████████████████████████████████| 51kB 7.8MB/s 
[K     |████████████████████████████████| 184kB 34.6MB/s 
[K     |████████████████████████████████| 2.0MB 36.8MB/s 
[K     |████████████████████████████████| 1.4MB 40.2MB/s 
[K     |████████████████████████████████| 61kB 9.8MB/s 
[K     |████████████████████████████████| 225kB 46.7MB/s 
[K     |████████████████████████████████| 13.1MB 214kB/s 
[K     |████████████████████████████████| 645kB 57.8MB/s 
[K     |████████████████████████████████| 92kB 12.4MB/s 
[K     |████████████████████████████████| 51kB 8.0MB/s 
[K     |████████████████████████████████| 1.0MB 45.3MB/s 
[K     |████████████████████████████████| 317kB 55.5MB/s 
[K     |████████████████████████████████| 1.3MB 52.2MB/s 
[K     |████████████████████████████████| 3.1MB 45.9MB/s 
[K     |████████████████████████████████| 245kB 53.6MB/s 
[K     |████████████████████████████████| 133kB 42.1MB/s 
[K  

### (Optional)

If you want to try Japanese TTS, please run the following cell to install pyopenjtalk.

In [None]:
!mkdir tools && cd tools && git clone https://github.com/r9y9/hts_engine_API.git
!mkdir -p tools/hts_engine_API/src/build && cd tools/hts_engine_API/src/build && \
    cmake -DCMAKE_INSTALL_PREFIX=../.. .. && make -j && make install
!cd tools && git clone https://github.com/r9y9/open_jtalk.git
!mkdir -p tools/open_jtalk/src/build && cd tools/open_jtalk/src/build && \
    cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON \
        -DHTS_ENGINE_LIB=../../../hts_engine_API/lib \
        -DHTS_ENGINE_INCLUDE_DIR=../../../hts_engine_API/include .. && \
    make install
!cp tools/open_jtalk/src/build/*.so* /usr/lib64-nvidia
!cd tools && git clone https://github.com/r9y9/pyopenjtalk.git
!cd tools/pyopenjtalk && pip install .

Cloning into 'hts_engine_API'...
remote: Enumerating objects: 114, done.[K
remote: Counting objects: 100% (114/114), done.[K
remote: Compressing objects: 100% (79/79), done.[K
remote: Total 1101 (delta 68), reused 74 (delta 32), pack-reused 987[K
Receiving objects: 100% (1101/1101), 404.59 KiB | 16.18 MiB/s, done.
Resolving deltas: 100% (810/810), done.
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Buil

## Single speaker model demo

### Model Selection

Please select models by comment out.

English, Japanese, and Mandarin are supported.

You can try Tacotron2, FastSpeech, and FastSpeech2 as the text2mel model.  
And you can use Parallel WaveGAN and Multi-band MelGAN as the vocoder model.

In [None]:
###################################
#          ENGLISH MODELS         #
###################################
# fs, lang = 22050, "English"
# tag = "kan-bayashi/ljspeech_tacotron2"
# tag = "kan-bayashi/ljspeech_fastspeech"
# tag = "kan-bayashi/ljspeech_fastspeech2"
# tag = "kan-bayashi/ljspeech_conformer_fastspeech2"
# vocoder_tag = "ljspeech_parallel_wavegan.v1"
# vocoder_tag = "ljspeech_full_band_melgan.v2"
# vocoder_tag = "ljspeech_multi_band_melgan.v2"

###################################
#         JAPANESE MODELS         #
###################################
fs, lang = 24000, "Japanese"
tag = "kan-bayashi/jsut_tacotron2"
# tag = "kan-bayashi/jsut_transformer"
# tag = "kan-bayashi/jsut_fastspeech"
# tag = "kan-bayashi/jsut_fastspeech2"
# tag = "kan-bayashi/jsut_conformer_fastspeech2"
# tag = "kan-bayashi/jsut_conformer_fastspeech2_accent"
# tag = "kan-bayashi/jsut_conformer_fastspeech2_accent_with_pause"
vocoder_tag = "jsut_parallel_wavegan.v1"
# vocoder_tag = "jsut_multi_band_melgan.v2"

###################################
#         MANDARIN MODELS         #
###################################
# fs, lang = 24000, "Mandarin"
# tag = "kan-bayashi/csmsc_tacotron2"
# tag = "kan-bayashi/csmsc_transformer"
# tag = "kan-bayashi/csmsc_fastspeech"
# tag = "kan-bayashi/csmsc_fastspeech2"
# tag = "kan-bayashi/csmsc_conformer_fastspeech2"
# vocoder_tag = "csmsc_parallel_wavegan.v1"
# vocoder_tag = "csmsc_multi_band_melgan.v2"

### Model Setup

In [None]:
import time
import torch
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.tts_inference import Text2Speech
from parallel_wavegan.utils import download_pretrained_model
from parallel_wavegan.utils import load_model
d = ModelDownloader()
text2speech = Text2Speech(
    **d.download_and_unpack(tag),
    device="cuda",
    # Only for Tacotron 2
    threshold=0.5,
    minlenratio=0.0,
    maxlenratio=10.0,
    use_att_constraint=False,
    backward_window=1,
    forward_window=3,
    # Only for FastSpeech & FastSpeech2
    speed_control_alpha=1.0,
)
text2speech.spc2wav = None  # Disable griffin-lim
# NOTE: Sometimes download is failed due to "Permission denied". That is 
#   the limitation of google drive. Please retry after serveral hours.
vocoder = load_model(download_pretrained_model(vocoder_tag)).to("cuda").eval()
vocoder.remove_weight_norm()

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.
https://zenodo.org/record/3963886/files/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_train.loss.best.zip?download=1: 100%|██████████| 102M/102M [00:13<00:00, 7.92MB/s]
Downloading...
From: https://drive.google.com/uc?id=1qok91A6wuubuz4be-P9R2zKhNmQXG0VQ
To: /root/.cache/parallel_wavegan/jsut_parallel_wavegan.v1.tar.gz
15.5MB [00:00, 74.4MB/s]


### Synthesis

In [None]:
from scipy.io.wavfile import write

In [None]:
!mkdir /content/data
!mkdir /content/data/e
!mkdir /content/data/eto
!mkdir /content/data/ano
!mkdir /content/data/a
!mkdir /content/data/ma

In [None]:
%cd /content/data/

/content/data


In [None]:
# decide the input sentence by yourself
print(f"Input your favorite sentence in {lang}.")
x = input()

# synthesis
with torch.no_grad():
    start = time.time()
    wav, c, *_ = text2speech(x)
    wav = vocoder.inference(c)
rtf = (time.time() - start) / (len(wav) / fs)
print(f"RTF = {rtf:5f}")

# let us listen to generated samples
from IPython.display import display, Audio
display(Audio(wav.view(-1).cpu().numpy(), rate=fs))

Input your favorite sentence in Japanese.
あ
RTF = 0.195408


In [None]:
x1 = 'ま'
x2 = 'まー'
x3 = 'まーー'

for num in range(100):
  x = ''
  if num % 3 == 0:
    x = x1
  elif num % 3 == 1:
    x = x2
  else:
    x = x3
  with torch.no_grad():
      start = time.time()
      wav, c, *_ = text2speech(x)
      wav = vocoder.inference(c)
  # rtf = (time.time() - start) / (len(wav) / fs)
  # print(f"RTF = {rtf:5f}")
  write('ma'+ str(num) + '.wav', fs, wav.cpu().numpy())

In [None]:
!zip -r data.zip /content/data