## Tacotron 2 inference code 
Edit the variables **checkpoint_path** and **text** to match yours and run the entire code to generate plots of mel outputs, alignments and audio synthesis from the generated mel-spectrogram using Griffin-Lim.

#### Import libraries and setup matplotlib

In [49]:
import matplotlib
%matplotlib inline
import matplotlib.pylab as plt

import IPython.display as ipd

import sys
sys.path.append('waveglow/')
import numpy as np
import torch

from hparams import create_hparams
from model import Tacotron2
from layers import TacotronSTFT, STFT
from audio_processing import griffin_lim
from train import load_model
from text import text_to_sequence
from denoiser import Denoiser

In [50]:
def plot_data(data, figsize=(16, 4)):
    fig, axes = plt.subplots(1, len(data), figsize=figsize)
    for i in range(len(data)):
        axes[i].imshow(data[i], aspect='auto', origin='lower', 
                       interpolation='none')

#### Setup hparams

In [51]:
hparams = create_hparams()
hparams.sampling_rate = 22050
hparams.max_decoder_steps = 10000

#### Load model from checkpoint

In [52]:
checkpoint_path = "/home/makthird/Desktop/jarviz/models/attenborough_checkpoint_824800"
model = load_model(hparams)
model.load_state_dict(torch.load(checkpoint_path)['state_dict'])
_ = model.cuda().eval().half()

#### Load WaveGlow for mel2audio synthesis and denoiser

In [53]:
waveglow_path = '/home/makthird/Desktop/jarviz/models/waveglow_256channels_universal_v5.pt'
waveglow = torch.load(waveglow_path)['model']
waveglow.cuda().eval().half()
for k in waveglow.convinv:
    k.float()
denoiser = Denoiser(waveglow)



#### Prepare text input

In [54]:
text = """
In spirituality we call this the the 'spiritual ego', or 'the spiritual ego trap' and its a nasty little bastard to put it mildly. It creeps up on you in the guise of something good, but turns out not to be under closer inspection.

At first, you're proud of yourself for taking the effort to look after yourself, but after some time you can soak in this pride and it ends up becoming its own thing. You stop meditating and pursuing whatever other practices you have, not because they're good for you. But because they make you feel superior to others, and its sometimes quite hard to differentiate when you're in the thick of it yourself. You feel good, confident and empowered but is it because you are looking after yourself? Or, is it because your constantly feeding your ego?

You ask yourself, do I feel confident because I'm detaching from other peoples opinions of me, or because I spend so much time doing this that I feel better than everybody else? With a lack of self-awareness, its very hard to tell the difference. Especially if you don't have any previous experience of looking inward.

Thankfully there are tons of resources out there to combat it, Buddhists have known about it for as long as its existed. Knowing that it actually exists is a good way of staying away from it, and thankfully, if youre in those sorts of communities anyway, it is well known about.
"""

# Break the text piece by removing format characters

text = text.strip().replace("\n", " ").replace("\r", " ")

text_pieces = [t.lstrip() for t in text.split(".") if len(t) > 0]

In [69]:
def inference(text, silence_padding=True):
    """
        Take text input to return a numpy array of the audio waveform
    """
    
    
    # Convert text into vector
    sequence = np.array(
        text_to_sequence(text, ['english_cleaners'])
    )[None, :]
    sequence = torch.autograd.Variable(
        torch.from_numpy(sequence)
    ).cuda().long()

    # Infer mel spectrogram from input vector
    mel_outputs, mel_outputs_postnet, _, alignments = model.inference(
        sequence
    )

    with torch.no_grad():
        audio = waveglow.infer(mel_outputs_postnet, sigma=1)
        audio_denoised = denoiser(audio, strength=0.005)[:, 0]
        
    audio_wav_arr = audio_denoised.cpu().numpy()[0]
        
    if silence_padding:
        audio_wav_arr = np.transpose(audio_wav_arr)

        silence = np.transpose(
            np.zeros(hparams.sampling_rate)
        )
        audio_wav_arr = np.concatenate(
            (audio_wav_arr, silence)
        )
        
        audio_wav_arr = np.transpose(audio_wav_arr)
        
    return audio_wav_arr
    

#### Synthesize audio from text pieces

In [86]:
speech_sample = np.array([])
speech_fragments = list(map(inference, text_pieces))

for s_f in speech_fragments:
        speech_sample = np.concatenate([speech_sample, s_f])

In [89]:
ipd.Audio(speech_sample, rate=hparams.sampling_rate)

In [91]:
from scipy.io.wavfile import write

write(
    'sample.wav',
    hparams.sampling_rate,
	speech_sample
)