<a href="https://colab.research.google.com/github/bshall/Tacotron/blob/main/tacotron-demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tacotron (with Dynamic Convolution Attention)

A PyTorch implementation of [Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis](https://arxiv.org/abs/1910.10288). 

Audio samples can be found [here](https://bshall.github.io/Tacotron/).

Demo for https://github.com/bshall/Tacotron

Install the necessary packages:

In [None]:
!pip install -q omegaconf
!pip install -q librosa==0.8.0
!pip install -q univoc
!pip install -q tacotron

In [None]:
import torch
import soundfile as sf
from univoc import Vocoder
from tacotron import load_cmudict, text_to_id, Tacotron
import matplotlib.pyplot as plt
from IPython.display import Audio

Download pretrained weights for the vocoder and move to the GPU

In [None]:
vocoder = Vocoder.from_pretrained(
    "https://github.com/bshall/UniversalVocoding/releases/download/v0.2/univoc-ljspeech-7mtpaq.pt"
).cuda()

Download pretrained weights for tacotron

In [None]:
tacotron = Tacotron.from_pretrained(
    "https://github.com/bshall/Tacotron/releases/download/v0.1/tacotron-ljspeech-yspjx3.pt"
).cuda()

Load the CMU pronunciation dictionary and add the pronunciation of "PyTorch"

In [None]:
cmudict = load_cmudict()
cmudict["PYTORCH"] = "P AY1 T AO2 R CH"

The text to be synthesized:

In [None]:
text = "A PyTorch implementation of location-relative attention mechanisms for long-form speech synthesis."

Synthesize the audio!

In [None]:
x = torch.LongTensor(text_to_id(text, cmudict)).unsqueeze(0).cuda()
with torch.no_grad():
    mel, alpha = tacotron.generate(x)
    wav, sr = vocoder.generate(mel.transpose(1, 2))

Listen to the results (IPython normalizes the audio so the result is louder than it would normally be)

In [None]:
Audio(wav, rate=sr)

and plot the attention matrix

In [None]:
plt.imshow(alpha.squeeze().cpu().numpy(), vmin=0, vmax=0.8, origin="lower")
plt.xlabel("Decoder steps")
plt.ylabel("Encoder steps")