## TTS Model Exploration – Italian

### This section explores several state-of-the-art TTS models for Italian:

- Parler-TTS

- Bark

- XTTS2

The goal is to evaluate their performance, audio quality, expressiveness, and support for voice/style control. This motivates why finetuning a model on our dataset would be beneficial

Starting with Parler-TTS

In [5]:
# Install Parler-TTS
!pip install git+https://github.com/huggingface/parler-tts.git




Collecting git+https://github.com/huggingface/parler-tts.git
  Cloning https://github.com/huggingface/parler-tts.git to /tmp/pip-req-build-ty693jy2
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/parler-tts.git /tmp/pip-req-build-ty693jy2
  Resolved https://github.com/huggingface/parler-tts.git to commit d108732cd57788ec86bc857d99a6cabd66663d68
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting descript-audiotools@ git+https://github.com/descriptinc/audiotools (from parler_tts==0.2.2)
  Cloning https://github.com/descriptinc/audiotools to /tmp/pip-install-cdrfaxe0/descript-audiotools_5a06c6a9358340e3b8d10846e015827b
  Running command git clone --filter=blob:none --quiet https://github.com/descriptinc/audiotools /tmp/pip-install-cdrfaxe0/descript-audiotools_5a06c6a9358340e3b8d10846e015827b
  Resolved https://github.com/d

In [2]:
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and tokenizer
model_id = "parler-tts/parler-tts-mini-multilingual-v1.1"
model = ParlerTTSForConditionalGeneration.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/3.75G [00:00<?, ?B/s]

  "_name_or_path": "google/flan-t5-large",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 2816,
  "d_kv": 64,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 24,
  "num_heads": 16,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32128
}

  "_name_or_path": "ylacombe/dac_44khz",
  "architectures": [
    "DacModel"
  ],
  "codebook_dim": 8,
  "codebook_loss_weight": 1.0,
  "codebook_size": 1024,
  "commitment_loss_weight": 0.25,
  "decoder_hidden_si

generation_config.json:   0%|          | 0.00/218 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/990 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/10.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

In [3]:
from IPython.display import Audio

In [4]:
# Example Italian text and style description
prompt = "Mercoledì ventiqquatro luglio, alle undici ventiquattro ora locale, un violento terremoto ha scosso la penisola di Kamchatka, in Russia, provocando un’allerta tsunami in tutto il Pacifico; Ore dopo, le prime onde hanno raggiunto le coste delle Hawaii, situate a migliaia di chilometri a est; Ma qui era ancora martedì sera. Un’apparente anomalia temporale che ha una spiegazione affascinante: lo tsunami ha attraversato la linea internazionale del cambiamento di data, viaggiando “indietro nel tempo”"
description = (
    "A clear, expressive male speaker with a slightly warm tone, moderate pace, "
    "very high audio quality, close-mic recording, like a news narrator"
)

# Tokenize inputs
input_ids = description_tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

# Generate audio
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio = generation.cpu().numpy().squeeze()

# Save output
sf.write("parler_it_output.wav", audio, model.config.sampling_rate)


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


KeyboardInterrupt: 

In [10]:
Audio("parler_it_output.wav")

\+ good quality

\+ multilingual

\+ style prompting

\- very long to infer

\- no voice prompting / voice cloning

---

Let' try Bark now

In [6]:
# install bark (make sure you have torch>=2 for much faster flash-attention)
!pip install git+https://github.com/suno-ai/bark.git

Collecting git+https://github.com/suno-ai/bark.git
  Cloning https://github.com/suno-ai/bark.git to /tmp/pip-req-build-h_fd9d4b
  Running command git clone --filter=blob:none --quiet https://github.com/suno-ai/bark.git /tmp/pip-req-build-h_fd9d4b
  Resolved https://github.com/suno-ai/bark.git to commit f4f32d4cd480dfec1c245d258174bc9bde3c2148
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting boto3 (from suno-bark==0.0.1a0)
  Downloading boto3-1.40.23-py3-none-any.whl.metadata (6.7 kB)
Collecting encodec (from suno-bark==0.0.1a0)
  Downloading encodec-0.1.1.tar.gz (3.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting funcy (from suno-bark==0.0.1a0)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Colle

In [None]:
from huggingface_hub import login
login(token="")

In [8]:
from bark import SAMPLE_RATE, generate_audio, preload_models
from IPython.display import Audio

import torch
from torch.serialization import safe_globals
from numpy.core.multiarray import scalar

with safe_globals([scalar]):
    from bark import preload_models
    preload_models()




text_2.pt:   0%|          | 0.00/5.35G [00:00<?, ?B/s]

UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, [1mdo those steps only if you trust the source of the checkpoint[0m. 
	(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
	(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
	WeightsUnpickler error: Unsupported global: GLOBAL numpy.core.multiarray.scalar was not an allowed global by default. Please use `torch.serialization.add_safe_globals([numpy.core.multiarray.scalar])` or the `torch.serialization.safe_globals([numpy.core.multiarray.scalar])` context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

In [None]:
text_prompt = "Mercoledì ventiqquatro luglio, alle undici ventiquattro ora locale, un violento terremoto ha scosso la penisola di Kamchatka, in Russia, provocando un’allerta tsunami in tutto il Pacifico; Ore dopo, le prime onde hanno raggiunto le coste delle Hawaii, situate a migliaia di chilometri a est; Ma qui era ancora martedì sera. Un’apparente anomalia temporale che ha una spiegazione affascinante: lo tsunami ha attraversato la linea internazionale del cambiamento di data, viaggiando “indietro nel tempo"
audio_array = generate_audio(text_prompt)

In [4]:
Audio("bark_output_esp.wav")

In [12]:
!pip install coqui-tts

Collecting coqui-tts
  Downloading coqui_tts-0.27.0-py3-none-any.whl.metadata (19 kB)
Collecting anyascii>=0.3.0 (from coqui-tts)
  Downloading anyascii-0.3.3-py3-none-any.whl.metadata (1.6 kB)
Collecting coqpit-config<0.3.0,>=0.2.0 (from coqui-tts)
  Downloading coqpit_config-0.2.1-py3-none-any.whl.metadata (11 kB)
Collecting coqui-tts-trainer<0.4.0,>=0.3.0 (from coqui-tts)
  Downloading coqui_tts_trainer-0.3.1-py3-none-any.whl.metadata (8.1 kB)
Collecting gruut>=2.4.0 (from gruut[de,es,fr]>=2.4.0->coqui-tts)
  Downloading gruut-2.4.0.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.3/85.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting monotonic-alignment-search>=0.1.0 (from coqui-tts)
  Downloading monotonic_alignment_search-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting num2words>=0.5.14 (from coqui-tts)
  Downloading num2words-0.5.14-p

In [1]:
import torch
from TTS.api import TTS


device = "cuda" if torch.cuda.is_available() else "cpu"


print(TTS().list_models())


tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)


print(tts.speakers)


tts.tts_to_file(
  text="Mercoledì 24 luglio, alle 11:24 ora locale, un violento terremoto ha scosso la penisola di Kamchatka, in Russia, provocando un’allerta tsunami in tutto il Pacifico; Ore dopo, le prime onde hanno raggiunto le coste delle Hawaii, situate a migliaia di chilometri a est; Ma qui era ancora martedì sera. Un’apparente anomalia temporale che ha una spiegazione affascinante: lo tsunami ha attraversato la linea internazionale del cambiamento di data, viaggiando “indietro nel tempo”;",
  speaker_wav="alarming.wav",
  language="it",
  file_path="xtts2_output.wav"
)

  re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)
  re_skip_default = re.compile("(\r\n|\s)", re.U)
  re_skip = re.compile("([a-zA-Z0-9]+(?:\.\d+)?%?)")


['tts_models/multilingual/multi-dataset/xtts_v2', 'tts_models/multilingual/multi-dataset/xtts_v1.1', 'tts_models/multilingual/multi-dataset/your_tts', 'tts_models/multilingual/multi-dataset/bark', 'tts_models/bg/cv/vits', 'tts_models/cs/cv/vits', 'tts_models/da/cv/vits', 'tts_models/et/cv/vits', 'tts_models/ga/cv/vits', 'tts_models/en/ek1/tacotron2', 'tts_models/en/ljspeech/tacotron2-DDC', 'tts_models/en/ljspeech/tacotron2-DDC_ph', 'tts_models/en/ljspeech/glow-tts', 'tts_models/en/ljspeech/speedy-speech', 'tts_models/en/ljspeech/tacotron2-DCA', 'tts_models/en/ljspeech/vits', 'tts_models/en/ljspeech/vits--neon', 'tts_models/en/ljspeech/fast_pitch', 'tts_models/en/ljspeech/overflow', 'tts_models/en/ljspeech/neural_hmm', 'tts_models/en/vctk/vits', 'tts_models/en/vctk/fast_pitch', 'tts_models/en/sam/tacotron-DDC', 'tts_models/en/blizzard2013/capacitron-t2-c50', 'tts_models/en/blizzard2013/capacitron-t2-c150_v2', 'tts_models/en/multi-dataset/tortoise-v2', 'tts_models/en/jenny/jenny', 'tts_m

100%|█████████▉| 1.87G/1.87G [00:41<00:00, 62.1MiB/s]
100%|██████████| 1.87G/1.87G [00:41<00:00, 45.1MiB/s]
4.37kiB [00:00, 90.1kiB/s]

361kiB [00:00, 8.95MiB/s]
100%|██████████| 32.0/32.0 [00:00<00:00, 322iB/s]
100%|██████████| 7.75M/7.75M [00:18<00:00, 17.1MiB/s]

['Claribel Dervla', 'Daisy Studious', 'Gracie Wise', 'Tammie Ema', 'Alison Dietlinde', 'Ana Florence', 'Annmarie Nele', 'Asya Anara', 'Brenda Stern', 'Gitta Nikolina', 'Henriette Usha', 'Sofia Hellen', 'Tammy Grit', 'Tanja Adelina', 'Vjollca Johnnie', 'Andrew Chipper', 'Badr Odhiambo', 'Dionisio Schuyler', 'Royston Min', 'Viktor Eka', 'Abrahan Mack', 'Adde Michal', 'Baldur Sanjin', 'Craig Gutsy', 'Damien Black', 'Gilberto Mathias', 'Ilkin Urbano', 'Kazuhiko Atallah', 'Ludvig Milivoj', 'Suad Qasim', 'Torcull Diarmuid', 'Viktor Menelaos', 'Zacharie Aimilios', 'Nova Hogarth', 'Maja Ruoho', 'Uta Obando', 'Lidiya Szekeres', 'Chandra MacFarland', 'Szofi Granger', 'Camilla Holmström', 'Lilya Stainthorpe', 'Zofija Kendrick', 'Narelle Moon', 'Barbora MacLean', 'Alexandra Hisakawa', 'Alma María', 'Rosemary Okafor', 'Ige Behringer', 'Filip Traverse', 'Damjan Chapman', 'Wulf Carlevaro', 'Aaron Dreschner', 'Kumar Dahl', 'Eugenio Mataracı', 'Ferran Simen', 'Xavier Hayasaka', 'Luis Moray', 'Marcos Ru

  s = torchaudio.io.StreamReader(src, format, None, buffer_size)


'xtts2_output.wav'

In [2]:
from IPython.display import Audio

In [3]:
Audio("xtts2_output.wav")

\+ great sound quality

\+ great prosody, expressiveness

\- too expressive?

\- problems with pauses and a few stresses?

\- unusual license

## Observations and Comparison

| **Model**   | **Pros**                                                     | **Cons**                                                               |
|------------|---------------------------------------------------------------|------------------------------------------------------------------------|
| `Parler-TTS` | Good quality, multilingual, style prompting available       | Slow inference, no voice cloning, limited control over voice           |
| `Bark`      | High quality, expressive audio                               | Inference can be slow, limited control over voice/style                |
| `XTTS2`     | Excellent prosody and expressiveness                         | Minor issues with pauses and stresses, unusual license                 |

---

## Summary

All models produce reasonably good audio and expressiveness, but **none provide enough fine-grained control** over output voice, intonation, or style for our specific use-case.  

This clearly motivates the need to **finetune a model (VITS)** on our curated Italian dataset for better control and adaptation.