# VITS Italian TTS Inference

This notebook demonstrates inference with a finetuned VITS model for Italian.
We'll look at how to:
- Run the base TTS post-finetuning
- Apply text tweaks (accents, apostrophes, padding)
- Prompt the model with a reference audio to mimic intonation

**Note:** The model checkpoint (`best_model.pth`) and config file (`config.json`) are downloaded automatically.

In [1]:
!pip install coqui-tts

Collecting coqui-tts
  Downloading coqui_tts-0.27.0-py3-none-any.whl.metadata (19 kB)
Collecting anyascii>=0.3.0 (from coqui-tts)
  Downloading anyascii-0.3.3-py3-none-any.whl.metadata (1.6 kB)
Collecting coqpit-config<0.3.0,>=0.2.0 (from coqui-tts)
  Downloading coqpit_config-0.2.1-py3-none-any.whl.metadata (11 kB)
Collecting coqui-tts-trainer<0.4.0,>=0.3.0 (from coqui-tts)
  Downloading coqui_tts_trainer-0.3.1-py3-none-any.whl.metadata (8.1 kB)
Collecting encodec>=0.1.1 (from coqui-tts)
  Downloading encodec-0.1.1.tar.gz (3.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m73.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gruut>=2.4.0 (from gruut[de,es,fr]>=2.4.0->coqui-tts)
  Downloading gruut-2.4.0.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.3/85.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone

In [None]:
from IPython.display import Audio
    

Uncomment the following lines to try your own examples:

In [30]:
import torch
from TTS.utils.synthesizer import Synthesizer
import soundfile as sf
import os


Using device: cuda
Audio saved to inference_output.wav


In [69]:
!gdown --id 1Uro2gsqQ8SrcWHwx4BK8FCrBu-Oc0rPt --output best_model.pth
!wget -O config.json "https://drive.google.com/file/d/1gKd85Q7-yBO8A-f-fxWvazoWkDnuChfs"

Downloading...
From (original): https://drive.google.com/uc?id=1Uro2gsqQ8SrcWHwx4BK8FCrBu-Oc0rPt
From (redirected): https://drive.google.com/uc?id=1Uro2gsqQ8SrcWHwx4BK8FCrBu-Oc0rPt&confirm=t&uuid=82b1d775-3a8e-4293-b7b5-83ba34ef53c5
To: /content/best_model1.pth
100% 998M/998M [00:09<00:00, 109MB/s]
--2025-09-04 00:27:47--  https://drive.google.com/file/d/1gKd85Q7-yBO8A-f-fxWvazoWkDnuChfs
Resolving drive.google.com (drive.google.com)... 108.177.127.113, 108.177.127.101, 108.177.127.102, ...
Connecting to drive.google.com (drive.google.com)|108.177.127.113|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://drive.google.com/file/d/1gKd85Q7-yBO8A-f-fxWvazoWkDnuChfs/ [following]
--2025-09-04 00:27:47--  https://drive.google.com/file/d/1gKd85Q7-yBO8A-f-fxWvazoWkDnuChfs/
Reusing existing connection to drive.google.com:443.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://drive.google.com/file/d/1gKd85Q7-yBO8A-f-fxW

In [None]:
model_checkpoint_path = "best_model.pth"
config_path = "config.json"

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

synthesizer = Synthesizer(
    tts_checkpoint=model_checkpoint_path,
    tts_config_path=config_path,
    use_cuda=torch.cuda.is_available(),

)

In [62]:
wav = synthesizer.tts(text="Mercoledì ventiquattro luglio, alle undici ventiquattro ora locale, un violento terremoto ha scosso la penisola di Kamchatca, in Russia, provocando un allerta tsunami in tutto il Pacifico. ", language_name='it')
sf.write("0_base_inference_80k_steps.wav", wav, synthesizer.output_sample_rate)
Audio("0_base_inference_80k_steps.wav")

\+ generally good

\- need better padding at EOS

\- Kamchatka pronunciation

\- general stresses?

---

Trying to tweak the text prompt:
- accents, double letters
- Italian spelling
- lowercase

In [33]:
wav = synthesizer.tts(text="mercoledì ventiquattro luglio, alle undici ventiquattro ora locale, un violento terrem'oto ha scosso la pen'isola di camci'atca, in russia, provocando un all'erta tsun'ami in tutto il paciifico. ", language_name='it')
sf.write("1_accents_apostrophies_double_chars.wav", wav, synthesizer.output_sample_rate)
Audio("1_accents_apostrophies_double_chars.wav")

\+ sounds better

\- exaggerated stress on pacìfico

---

Trying helper padding at BOS

In [41]:
wav = synthesizer.tts(text=", mercoledì ventiquattro luglio, alle undici ventiquattro ora locale, un violento terremoto ha scosso la penisola di camciatca, in russia, provocando un allerta tsunami in tutto il pacifico. ", language_name='it')
sf.write("2_small_letters_padding_comma.wav", wav, synthesizer.output_sample_rate)
Audio("2_small_letters_padding_comma.wav")

\+ sounds really good

-> let's try adding styling prompt

---

Using a real example from sports news narration

In [53]:
Audio("upbeat_narration_sports_example.wav")

Using a sample in a basic case without text prompt preprocessing

In [63]:
wav = synthesizer.tts(
    text="Mercoledì ventiquattro luglio, alle undici ventiquattro ora locale, un violento terremoto ha scosso la penisola di Kamchatca, in Russia, provocando un allerta tsunami in tutto il Pacifico. ",
    language_name='it',
    speaker_wav='upbeat_narration_sports_example.wav'
    )
sf.write("3_prompting_with_voice_sample.wav", wav, synthesizer.output_sample_rate)
Audio("3_prompting_with_voice_sample.wav")

---

Tweaking stresses:

In [66]:
wav = synthesizer.tts(
    text="mercoledì ventiquattro luglio, alle undici ventiquattro ora locale, un violento terrem'oto ha scosso la pen'isola di camci'atca, in russia, provocando un all'erta tsun'ami in tutto il pacifico. ",
    language_name='it',
    speaker_wav='upbeat_narration_sports_example.wav'
    )
sf.write("4_prompting_with_voice_sample_and_stresses.wav", wav, synthesizer.output_sample_rate)
Audio("4_prompting_with_voice_sample_and_stresses.wav")

---
Tweaking stresses and leading comma

In [68]:
wav = synthesizer.tts(
    text=", mercoledì ventiquattro luglio, alle undici ventiquattro ora loc'ale, un violento terrem'oto ha scosso la peniisola di camci'atca, in russia, provocando un all'erta tsun'ami in tutto il pacifico. ",
    language_name='it',
    speaker_wav='upbeat_narration_sports_example.wav'
    )
sf.write("5_voice_sample_and_more_stresses.wav", wav, synthesizer.output_sample_rate)
Audio("5_voice_sample_and_more_stresses.wav")