# 음성 합성(Speech Synthesis)

* https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2

## Tacotron 2

* 텍스트에서 멜 스펙트로그램 생성

* https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/

<img src="https://pytorch.org/assets/images/tacotron2_diagram.png" alt="alt" width="50%"/>

## WaveGlow

* 멜 스펙트로그램에서 음성 생성

* https://pytorch.org/hub/nvidia_deeplearningexamples_waveglow/

<img src="https://pytorch.org/assets/images/waveglow_diagram.png" alt="alt" width="50%"/>

In [1]:
!pip install numpy scipy librosa unidecode inflect

Collecting unidecode
  Downloading Unidecode-1.2.0-py2.py3-none-any.whl (241 kB)
[K     |████████████████████████████████| 241 kB 3.2 MB/s 
Installing collected packages: unidecode
Successfully installed unidecode-1.2.0


* LJ Speech dataset에서 사전 학습된 Tacotron2와 WaveGlow 모델 로드

In [2]:
import torch

tacotron2 = torch.hub.load('nvidia/DeepLearningExamples:torchhub', 'nvidia_tacotron2')
waveglow = torch.hub.load('nvidia/DeepLearningExamples:torchhub', 'nvidia_waveglow')

Downloading: "https://github.com/nvidia/DeepLearningExamples/archive/torchhub.zip" to /root/.cache/torch/hub/torchhub.zip
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/tacotron2_pyt_ckpt_fp32/versions/19.09.0/files/nvidia_tacotron2pyt_fp32_20190427


RuntimeError: ignored

In [None]:
tacotron2 = tacotron2.to('cuda')
tacotron2.eval()

In [None]:
waveglow = waveglow.remove_weightnorm(waveglow) # 뒷부분을 붙여서 쓰는거라서
waveglow = waveglow.to('cuda')
waveglow.eval()

## Text To Speech(TTS)

In [3]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')

def plot_data(data, figsize=(16,4)):
    fig, axes = plt.subplots(1, len(data), figsize=figsize)
    for i in range(len(data)):
        axes[i].imshow(data[i], aspect='auto', origin='bottom',
                       interpolation='none', cmap='viridis')

def TTS(text):

    sampling_rate = 22050

    sequence = np.array(tacotron2.text_to_sequence(text, ['english_cleaners']))[None,:]
    sequence = torch.from_numpy(sequence).to(device='cuda', dtype=torch.int64)

    with torch.no_grad():
        mel_outputs, mel_outputs_postnet, _, alignments = tacotron2.infer(sequence)
        audio = waveglow.infer(mel_outputs_postnet)

    mel_output = mel_outputs.data.cpu().numpy()[0]
    mel_output_postnet = mel_outputs_postnet.data.cpu().numpy()[0]
    alignment = alignments.data.cpu().numpy()[0].T
    audio_np = audio[0].data.cpu().numpy()

    return mel_output, mel_output_postnet, alignment, audio_np, sampling_rate

In [None]:
import librosa.display
from IPython.display import Audio

text = 'Hello, how are you?'
mel_output, mel_output_postnet, alignment, audio_np, sampling_rate = TTS(text)

fig = plt.figure(figsize=(14, 4)
librosa.display.waveplot(audio_np, sr=sampling_rate)
plot_data((mel_output, mel_output_postnet, alignment))
Audio(audio_np, rate=sampling_rate)

In [None]:
text = 'What do you think about speech synthesis?'
mel_output, mel_output_postnet, alignment, audio_np, sampling_rate = TTS(text)

fig = plt.figure(figsize=(14, 4)
librosa.display.waveplot(audio_np, sr=sampling_rate)
plot_data((mel_output, mel_output_postnet, alignment))
Audio(audio_np, rate=sampling_rate)