# Coqui Testing

This is a notebook just hacking around with Coqui TTS.
Intention is to try the various models and see what's what.

I found that there wasn't a whole lot of examples out there for using the various models. Maybe this will be helpful to you. Maybe not.

Note that for my own purposes, I just searched for a Jennifer Garner video on youtube and used one that I found as a source voice. In case there are leftover references to Jennifer Garner in this notebook.

All of these examples were run on an Nvidia RTX 4070ti GPU. 

| Model | Processing Time (s) | VRAM Usage | Quality (H/M/L rated by me) | Voice Cloning |
|-------|---------------------|------------|-----------------------------|---------------|
| XttsV2 | 12.4s               |  2.3 GB     |   H                   | Y |
| Speedy Speech |  2s  | 0.7 GB | L | N | 
| Your TTS | 1.2s   | 0.7 GB | M | Y |
| Bark | 105s | 4.4 GB | L | Y?? |

Before using the notebook, create a venv and pip install tts. Also note that, as of this notebook date, coqui tts install doesn't work with python 3.12. So when creating your venv, probably best to use 3.11 or 3.10. Easy enough (ex. python3.11 -m venv .\venv)

Because the default tts from pip won't include any GPU accelerated pytorch on Windows, you will likely need to pip install the cu version of Pytorch. Typically I just Google pytorch cuda install and the top hit is the page that gives you the proper pip install.

In [1]:
# The following script will be used throughout the notebook. It was a stream of thought.

script = ('I had this dream... So strange... Like, I was this little turtle. Crawling... so slowly.'
          'And there was this giant tiger behind me. Just staring at me...'
          'It seemed just so, sad.')

## XTTSv2

The following is just hacking around with XTTSv2 with Coqui. It's a really really amazing TTS model that allows voice cloning and it does it so well.

The following code is ripped right out of the Coqui docs.

On a 4070ti, xttsv2 takes about 28 seconds to load the model and then generate output with the script above.

In [2]:
import torch
from TTS.api import TTS

# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Init TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

# Run TTS
# ❗ Since this model is multi-lingual voice cloning model, we must set the target speaker_wav and language
# Text to speech to a file
tts.tts_to_file(text=script, 
                speaker_wav="samples/jennifergarner.wav", 
                language="en", 
                file_path="output/output_xttsv2_1.wav")

 > tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.


  from .autonotebook import tqdm as notebook_tqdm


 > Using model: xtts
 > Text splitted to sentences.
['I had this dream...', 'So strange...', 'Like, I was this little turtle.', 'Crawling... so slowly.', 'And there was this giant tiger behind me.', 'Just staring at me..', '.It seemed just so, sad.']
 > Processing time: 13.626079559326172
 > Real-time factor: 0.6809951366345015


'output/output_xttsv2_1.wav'

<audio controls>
  <source src="./output/output_xttsv2_1.wav" type="audio/wav">
  Your browser does not support the audio element.
</audio>

Once the model is loaded, voice generation takes a fraction of the time.
On my 4070ti, the following code takes 12.5s
Interestingly, takes longer with the xtts streaming server.

In [3]:
tts.tts_to_file(text=script, 
                speaker_wav="samples/jennifergarner.wav", 
                language="en", 
                file_path="output/output_xttsv2_2.wav")

 > Text splitted to sentences.
['I had this dream...', 'So strange...', 'Like, I was this little turtle.', 'Crawling... so slowly.', 'And there was this giant tiger behind me.', 'Just staring at me..', '.It seemed just so, sad.']
 > Processing time: 12.358268737792969
 > Real-time factor: 0.6098485016031415


'output/output_xttsv2_2.wav'

<audio controls>
  <source src="./output/output_xttsv2_2.wav" type="audio/wav">
  Your browser does not support the audio element.
</audio>

### What are all the Coqui models anyway?

The following code dumps the models to a json file for viewing. Formats really well in vscode.

In [4]:
import json
models = TTS().list_models()

with open('models.json', 'w') as f:
    json.dump(models.models_dict, f, indent=4)

Let's try other models...

## Speedy Speech

Trying out speedy speech. Although much faster than xttsv2, it doesn't do voice cloning, as far as I can tell.

In [3]:
import torch
from TTS.api import TTS

# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"
print('using device, ' + device)

# Init TTS
tts = TTS("tts_models/en/ljspeech/speedy-speech").to(device)

# Run TTS
# Text to speech to a file
tts.tts_to_file(text=script, 
                speaker_wav="samples/jennifergarner.wav", 
                file_path="output/output_speedy_speech.wav")

using device, cuda
 > tts_models/en/ljspeech/speedy-speech is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: speedy_speech
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Vocoder Model: hifigan
 > Setting up Audio Processor...
 | > sam

Removing weight norm...
 > Text splitted to sentences.
['I had this dream...', 'So strange...', 'Like, I was this little turtle.', 'Crawling... so slowly.', 'And there was this giant tiger behind me.', 'Just staring at me..', '.It seemed just so, sad.']
soʊ stɹeɪnd͡ʒ...
 [!] Character '͡' not found in the vocabulary. Discarding it.
 > Processing time: 4.207754373550415
 > Real-time factor: 0.2870840881256085


'output/output_speedy_speech.wav'

<audio controls>
  <source src="./output/output_speedy_speech.wav" type="audio/wav">
  Your browser does not support the audio element.
</audio>

In [4]:
tts.tts_with_vc_to_file(
    script,
    speaker_wav="samples/jennifergarner.wav",
    file_path="output/output_speedyspeech2.wav"
)

 > Text splitted to sentences.
['I had this dream...', 'So strange...', 'Like, I was this little turtle.', 'Crawling... so slowly.', 'And there was this giant tiger behind me.', 'Just staring at me..', '.It seemed just so, sad.']
 > Processing time: 0.40735602378845215
 > Real-time factor: 0.02779283728320514


<audio controls>
  <source src="./output/output_speedyspeech2.wav" type="audio/wav">
  Your browser does not support the audio element.
</audio>

In [7]:
tts.tts_to_file(text=script, 
                file_path="output/output_speedy_speech3.wav")

 > Text splitted to sentences.
['I had this dream...', 'So strange...', 'Like, I was this little turtle.', 'Crawling... so slowly.', 'And there was this giant tiger behind me.', 'Just staring at me..', '.It seemed just so, sad.']


 > Processing time: 0.3594348430633545
 > Real-time factor: 0.024523300316683275


'output/output_speedy_speech3.wav'

## Your TTS

Your TTS does voice cloning but the quality is really not that great. You can kind of hear the Jennifer Garner in these outputs, but just barely.


In [5]:
import torch
from TTS.api import TTS

tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=True).to("cuda")

 > tts_models/multilingual/multi-dataset/your_tts is already downloaded.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-

In [9]:
tts.tts_to_file(script, 
                speaker_wav="samples/example_reference.mp3", 
                language="en", 
                file_path="output/output_yourtts_en.wav")

 > Text splitted to sentences.
['I had this dream...', 'So strange...', 'Like, I was this little turtle.', 'Crawling... so slowly.', 'And there was this giant tiger behind me.', 'Just staring at me..', '.It seemed just so, sad.']
 > Processing time: 1.6458244323730469
 > Real-time factor: 0.10130020510697649


'output/output_yourtts_en.wav'

<audio controls>
  <source src="./output/output_yourtts_en.wav" type="audio/wav">
  Your browser does not support the audio element.
</audio>


Going to try just sending the english script through to the different languages. Also note that the docs aren't quite correct for specifying the languages.

In [8]:
tts.tts_to_file(script, speaker_wav="samples/jennifergarner.wav", language="fr-fr", file_path="output/output_yourtts_fr-fr.wav")
tts.tts_to_file(script, speaker_wav="samples/jennifergarner.wav", language="pt-br", file_path="output/output_yourtts_pt-br.wav")

 > Text splitted to sentences.
['I had this dream...', 'So strange...', 'Like, I was this little turtle.', 'Crawling... so slowly.', 'And there was this giant tiger behind me.', 'Just staring at me..', '.It seemed just so, sad.']


 > Processing time: 1.7092227935791016
 > Real-time factor: 0.1386568340698549
 > Text splitted to sentences.
['I had this dream...', 'So strange...', 'Like, I was this little turtle.', 'Crawling... so slowly.', 'And there was this giant tiger behind me.', 'Just staring at me..', '.It seemed just so, sad.']
 > Processing time: 1.4384911060333252
 > Real-time factor: 0.08454252753648693


'output/output_yourtts_pt-br.wav'

<audio controls>
  <source src="./output/output_yourtts_fr-fr.wav" type="audio/wav">
  Your browser does not support the audio element.
</audio>

<audio controls>
  <source src="./output/output_yourtts_pt-br.wav" type="audio/wav">
  Your browser does not support the audio element.
</audio>

## Bark

Let's now try Bark :)

(when I first ran this, Coqui tried to download the model files, but a couple of the .pth files got corrupted. May have to manually download the larger files from here : https://huggingface.co/suno/bark/tree/main)

In [4]:
import torch
from TTS.api import TTS

tts = TTS(model_name="tts_models/multilingual/multi-dataset/bark", progress_bar=True).to("cuda")

 > tts_models/multilingual/multi-dataset/bark is already downloaded.
 > Using model: bark


In [8]:
tts.tts_to_file(text=script, 
                speaker_wav="samples/jennifergarner.wav", 
                file_path="output/output_bar_1.wav")

 > Text splitted to sentences.
['I had this dream...', 'So strange...', 'Like, I was this little turtle.', 'Crawling... so slowly.', 'And there was this giant tiger behind me.', 'Just staring at me..', '.It seemed just so, sad.']


  0%|          | 0/100 [00:00<?, ?it/s]

  y = torch.nn.functional.scaled_dot_product_attention(q, k, v, dropout_p=self.dropout, is_causal=is_causal)
100%|██████████| 100/100 [00:07<00:00, 14.23it/s] 
100%|██████████| 22/22 [00:17<00:00,  1.24it/s]
100%|██████████| 100/100 [00:03<00:00, 28.11it/s] 
100%|██████████| 13/13 [00:09<00:00,  1.31it/s]
100%|██████████| 100/100 [00:01<00:00, 77.41it/s]
100%|██████████| 5/5 [00:03<00:00,  1.35it/s]
100%|██████████| 100/100 [00:01<00:00, 50.55it/s]
100%|██████████| 7/7 [00:05<00:00,  1.25it/s]
100%|██████████| 100/100 [00:01<00:00, 63.56it/s] 
100%|██████████| 6/6 [00:04<00:00,  1.36it/s]
100%|██████████| 100/100 [00:02<00:00, 48.37it/s]
100%|██████████| 8/8 [00:06<00:00,  1.33it/s]
100%|██████████| 100/100 [00:05<00:00, 17.00it/s]
100%|██████████| 22/22 [00:16<00:00,  1.30it/s]


 > Processing time: 104.59143352508545
 > Real-time factor: 2.7478804561387546


'output/output_bar_1.wav'

<audio controls>
  <source src="./output/output_bar_1.wav" type="audio/wav">
  Your browser does not support the audio element.
</audio>

## XTTS Manual Inference

In [2]:
import os
import time
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

print("Loading model...")
config = XttsConfig()
config.load_json("C:\\Users\\jamie\\AppData\\Local\\tts\\tts_models--multilingual--multi-dataset--xtts_v2\\config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="C:\\Users\\jamie\\AppData\\Local\\tts\\tts_models--multilingual--multi-dataset--xtts_v2\\", use_deepspeed=False)
model.cuda()

print("Computing speaker latents...")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=[".\\samples\\example_reference.mp3"])

print("Inference...")
t0 = time.time()
chunks = model.inference_stream(
    "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
    "en",
    gpt_cond_latent,
    speaker_embedding
)

wav_chuncks = []
for i, chunk in enumerate(chunks):
    if i == 0:
        print(f"Time to first chunck: {time.time() - t0}")
    print(f"Received chunk {i} of audio length {chunk.shape[-1]}")
    wav_chuncks.append(chunk)
wav = torch.cat(wav_chuncks, dim=0)
torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 24000)

  from .autonotebook import tqdm as notebook_tqdm


Loading model...
Computing speaker latents...
Inference...




Time to first chunck: 0.9614393711090088
Received chunk 0 of audio length 21248
Received chunk 1 of audio length 22272
Received chunk 2 of audio length 22272
Received chunk 3 of audio length 22272
Received chunk 4 of audio length 22272
Received chunk 5 of audio length 22272
Received chunk 6 of audio length 22272
Received chunk 7 of audio length 1024


In [6]:
import pyaudio

print("Inference...")
t0 = time.time()
chunks = model.inference_stream(
    script,
    "en",
    gpt_cond_latent,
    speaker_embedding
)

wav_chuncks = []
for i, chunk in enumerate(chunks):
    if i == 0:
        print(f"Time to first chunck: {time.time() - t0}")
    print(f"Received chunk {i} of audio length {chunk.shape[-1]}")
    wav_chuncks.append(chunk)
wav = torch.cat(wav_chuncks, dim=0)

torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 24000)

Inference...
Time to first chunck: 0.793259859085083
Received chunk 0 of audio length 21248
Received chunk 1 of audio length 22272
Received chunk 2 of audio length 22272
Received chunk 3 of audio length 22272
Received chunk 4 of audio length 22272
Received chunk 5 of audio length 22272
Received chunk 6 of audio length 22272
Received chunk 7 of audio length 22272
Received chunk 8 of audio length 22272
Received chunk 9 of audio length 22272
Received chunk 10 of audio length 22272
Received chunk 11 of audio length 22272
Received chunk 12 of audio length 22272
Received chunk 13 of audio length 22528
Received chunk 14 of audio length 22272
Received chunk 15 of audio length 22272
Received chunk 16 of audio length 22272
Received chunk 17 of audio length 22272
Received chunk 18 of audio length 4352


: 