# Demo zero-shot TTS with YourTTS

This is a Colab notebook which essentially does the same thing as [this demo from Coqui.ai which you can play with here](https://coqui.ai/). Coqui have developed this model which can capture the "essence" of your voice and then give you a representation of what you might sound like if you could speak French or Brazilian Portuguese. So it's a little limited in that respect but it's interesting to play with none-the-less.

There are quite a few cells to run just to get setup...

##TTS Model setup

### Download and install Coqui TTS


In [2]:
!git clone https://github.com/Edresson/Coqui-TTS -b multilingual-torchaudio-SE TTS
!pip install -q -e TTS/
!pip install -q torchaudio==0.9.0

fatal: destination path 'TTS' already exists and is not an empty directory.
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone


###Download TTS Checkpoint

In [3]:
# TTS checkpoints

# download config  
! gdown --id 1-PfXD66l1ZpsZmJiC-vhL055CDSugLyP
# download language json 
! gdown --id 1_Vb2_XHqcC0OcvRF82F883MTxfTRmerg
# download speakers json
! gdown --id 1SZ9GE0CBM-xGstiXH2-O2QWdmSXsBKdC -O speakers.json
# download checkpoint
! gdown --id 1sgEjHt0lbPSEw9-FSbC_mBoOPwNi87YR -O best_model.pth.tar  

Downloading...
From: https://drive.google.com/uc?id=1-PfXD66l1ZpsZmJiC-vhL055CDSugLyP
To: /content/config.json
100% 12.3k/12.3k [00:00<00:00, 5.25MB/s]
Downloading...
From: https://drive.google.com/uc?id=1_Vb2_XHqcC0OcvRF82F883MTxfTRmerg
To: /content/language_ids.json
100% 47.0/47.0 [00:00<00:00, 39.9kB/s]
Downloading...
From: https://drive.google.com/uc?id=1SZ9GE0CBM-xGstiXH2-O2QWdmSXsBKdC
To: /content/speakers.json
100% 671k/671k [00:00<00:00, 44.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1sgEjHt0lbPSEw9-FSbC_mBoOPwNi87YR
To: /content/best_model.pth.tar
100% 380M/380M [00:26<00:00, 16.9MB/s]


### Imports

In [4]:
import sys
TTS_PATH = "TTS/"

# add libraries into environment
sys.path.append(TTS_PATH) # set this if TTS is not installed globally

import os
import string
import time
import argparse
import json

import numpy as np
import IPython
from IPython.display import Audio


import torch

from TTS.tts.utils.synthesis import synthesis
from TTS.tts.utils.text.symbols import make_symbols, phonemes, symbols
try:
  from TTS.utils.audio import AudioProcessor
except:
  from TTS.utils.audio import AudioProcessor


from TTS.tts.models import setup_model
from TTS.config import load_config
from TTS.tts.models.vits import *

### Paths definition

In [5]:
OUT_PATH = 'out/'

# create output path
os.makedirs(OUT_PATH, exist_ok=True)

# model vars 
MODEL_PATH = 'best_model.pth.tar'
CONFIG_PATH = 'config.json'
TTS_LANGUAGES = "language_ids.json"
TTS_SPEAKERS = "speakers.json"
USE_CUDA = torch.cuda.is_available()

### Restore model

In [6]:
# load the config
C = load_config(CONFIG_PATH)


# load the audio processor
ap = AudioProcessor(**C.audio)

speaker_embedding = None

C.model_args['d_vector_file'] = TTS_SPEAKERS
C.model_args['use_speaker_encoder_as_loss'] = False

model = setup_model(C)
model.language_manager.set_language_ids_from_file(TTS_LANGUAGES)
# print(model.language_manager.num_languages, model.embedded_language_dim)
# print(model.emb_l)
cp = torch.load(MODEL_PATH, map_location=torch.device('cpu'))
# remove speaker encoder
model_weights = cp['model'].copy()
for key in list(model_weights.keys()):
  if "speaker_encoder" in key:
    del model_weights[key]

model.load_state_dict(model_weights)


model.eval()

if USE_CUDA:
    model = model.cuda()

# synthesize voice
use_griffin_lim = False

 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:None
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:45
 | > do_sound_norm:False
 | > do_amp_to_db_linear:False
 | > do_amp_to_db_mel:True
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Using model: vits
 > Speaker manager is loaded with 6 speakers: female-en-5, female-en-5
, female-pt-4
, male-en-2, male-en-2
, male-pt-3



##Speaker encoder setup

### Install helper libraries

In [7]:
! pip install -q pydub ffmpeg-normalize

### Paths definition

In [8]:
CONFIG_SE_PATH = "config_se.json"
CHECKPOINT_SE_PATH = "SE_checkpoint.pth.tar"

# download config 
! gdown --id  19cDrhZZ0PfKf2Zhr_ebB-QASRw844Tn1 -O $CONFIG_SE_PATH
# download checkpoint  
! gdown --id   17JsW6h6TIh7-LkU2EvB_gnNrPcdBxt7X -O $CHECKPOINT_SE_PATH

Downloading...
From: https://drive.google.com/uc?id=19cDrhZZ0PfKf2Zhr_ebB-QASRw844Tn1
To: /content/config_se.json
100% 3.49k/3.49k [00:00<00:00, 7.14MB/s]
Downloading...
From: https://drive.google.com/uc?id=17JsW6h6TIh7-LkU2EvB_gnNrPcdBxt7X
To: /content/SE_checkpoint.pth.tar
100% 44.6M/44.6M [00:00<00:00, 122MB/s] 


###Imports

In [9]:
from TTS.tts.utils.speakers import SpeakerManager
from pydub import AudioSegment
from google.colab import files
import librosa

###Load the Speaker encoder

In [10]:
SE_speaker_manager = SpeakerManager(encoder_model_path=CHECKPOINT_SE_PATH, encoder_config_path=CONFIG_SE_PATH, use_cuda=USE_CUDA)

 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400


###Define helper function

In [11]:
def compute_spec(ref_file):
  y, sr = librosa.load(ref_file, sr=ap.sample_rate)
  spec = ap.spectrogram(y)
  spec = torch.FloatTensor(spec).unsqueeze(0)
  return spec

## TTS

###Upload, normalize and resample your reference wav files



Here you can upload a `.wav` file of your voice. You don't need to use your voice if you don't want to, but that's the fun bit. But then maybe try finding something - for example Wikipedia [has a bunch of clips from famous speeches](https://commons.wikimedia.org/wiki/Category:Audio_files_of_speeches) which can be fun to try. Although many are in `.ogg` format so you'd need to convert them, perhaps using [FFMPEG](https://www.ffmpeg.org/)?

In [17]:
print("Select speaker reference audios files:")
reference_files = files.upload()
reference_files = list(reference_files.keys())
for sample in reference_files:
    !ffmpeg-normalize $sample -nt rms -t=-27 -o $sample -ar 16000 -f

Select speaker reference audios files:


Saving save_america.wav to save_america.wav


###Compute embedding

In [18]:
reference_emb = SE_speaker_manager.compute_d_vector_from_clip(reference_files)

###Define inference variables

In [19]:
model.length_scale = 1  # scaler for the duration predictor. The larger it is, the slower the speech.
model.inference_noise_scale = 0.3 # defines the noise variance applied to the random z vector at inference.
model.inference_noise_scale_dp = 0.3 # defines the noise variance applied to the duration predictor z vector at inference.
text = "It took me quite a long time to develop a voice and now that I have it I am not going to be silent."

###Chose language id

This next cell just shows you the language options:

- 1 = English
- 2 = French
- 3 = Brazilian Portuguese

In [20]:
model.language_manager.language_id_mapping

{'en': 0, 'fr-fr': 1, 'pt-br': 2}

Assign 1, 2 or 3 to `language_id` below:

In [23]:
language_id = 0

### Sythesis

In [24]:
print(" > text: {}".format(text))
wav, alignment, _, _ = synthesis(
                    model,
                    text,
                    C,
                    "cuda" in str(next(model.parameters()).device),
                    ap,
                    speaker_id=None,
                    d_vector=reference_emb,
                    style_wav=None,
                    language_id=language_id,
                    enable_eos_bos_chars=C.enable_eos_bos_chars,
                    use_griffin_lim=True,
                    do_trim_silence=False,
                ).values()
print("Generated Audio")
IPython.display.display(Audio(wav, rate=ap.sample_rate))
file_name = text.replace(" ", "_")
file_name = file_name.translate(str.maketrans('', '', string.punctuation.replace('_', ''))) + '.wav'
out_path = os.path.join(OUT_PATH, file_name)
print(" > Saving output to {}".format(out_path))
ap.save_wav(wav, out_path)

 > text: It took me quite a long time to develop a voice and now that I have it I am not going to be silent.
Generated Audio


 > Saving output to out/It_took_me_quite_a_long_time_to_develop_a_voice_and_now_that_I_have_it_I_am_not_going_to_be_silent.wav
