# VITS Italian TTS Fine-Tuning

This notebook demonstrates fine-tuning a pretrained Italian VITS model on a custom dataset.

We'll have a look at how to:
- Load a pretrained Italian VITS model
- Update the config for the dataset and training parameters
- Load and preprocess the dataset (wav files + metadata)
- Finetune the model on custom data
- Save checkpoints and the updated config for later inference

**Notes:**
- Dataset must be in the format: `wav_filename|text|text` (3 columns)
- Make sure the checkpoint (`model.pth`) and config (`config.json`) paths are correct
- Outputs and checkpoints are saved in `output_path`
- This notebook uses Coqui TTS Trainer (`trainer.py`) for finetuning


In [1]:
!pip install coqui-tts

Collecting coqui-tts
  Downloading coqui_tts-0.27.0-py3-none-any.whl.metadata (19 kB)
Collecting anyascii>=0.3.0 (from coqui-tts)
  Downloading anyascii-0.3.3-py3-none-any.whl.metadata (1.6 kB)
Collecting coqpit-config<0.3.0,>=0.2.0 (from coqui-tts)
  Downloading coqpit_config-0.2.1-py3-none-any.whl.metadata (11 kB)
Collecting coqui-tts-trainer<0.4.0,>=0.3.0 (from coqui-tts)
  Downloading coqui_tts_trainer-0.3.1-py3-none-any.whl.metadata (8.1 kB)
Collecting encodec>=0.1.1 (from coqui-tts)
  Downloading encodec-0.1.1.tar.gz (3.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m37.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gruut>=2.4.0 (from gruut[de,es,fr]>=2.4.0->coqui-tts)
  Downloading gruut-2.4.0.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.3/85.3 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone

In [2]:
!pip install coqui-tts-trainer



In [3]:
import os

from trainer import Trainer, TrainerArgs

from TTS.tts.configs.glow_tts_config import GlowTTSConfig

from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.glow_tts import GlowTTS
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor


  re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)
  re_skip_default = re.compile("(\r\n|\s)", re.U)
  re_skip = re.compile("([a-zA-Z0-9]+(?:\.\d+)?%?)")


In [None]:
!python3.12 -m pip install --upgrade pip

Defaulting to user installation because normal site-packages is not writeable
Collecting pip
  Downloading pip-25.2-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.2-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
[0mSuccessfully installed pip-25.2


In [4]:
!which python

/usr/local/bin/python


In [6]:
!ls "/root/.local/share/tts/tts_models--it--mai_male--vits/"

ls: cannot access '/root/.local/share/tts/tts_models--it--mai_male--vits/': No such file or directory


In [7]:
from TTS.api import TTS

model_name = "tts_models/it/mai_male/vits"
TTS.list_models()

['tts_models/multilingual/multi-dataset/xtts_v2',
 'tts_models/multilingual/multi-dataset/xtts_v1.1',
 'tts_models/multilingual/multi-dataset/your_tts',
 'tts_models/multilingual/multi-dataset/bark',
 'tts_models/bg/cv/vits',
 'tts_models/cs/cv/vits',
 'tts_models/da/cv/vits',
 'tts_models/et/cv/vits',
 'tts_models/ga/cv/vits',
 'tts_models/en/ek1/tacotron2',
 'tts_models/en/ljspeech/tacotron2-DDC',
 'tts_models/en/ljspeech/tacotron2-DDC_ph',
 'tts_models/en/ljspeech/glow-tts',
 'tts_models/en/ljspeech/speedy-speech',
 'tts_models/en/ljspeech/tacotron2-DCA',
 'tts_models/en/ljspeech/vits',
 'tts_models/en/ljspeech/vits--neon',
 'tts_models/en/ljspeech/fast_pitch',
 'tts_models/en/ljspeech/overflow',
 'tts_models/en/ljspeech/neural_hmm',
 'tts_models/en/vctk/vits',
 'tts_models/en/vctk/fast_pitch',
 'tts_models/en/sam/tacotron-DDC',
 'tts_models/en/blizzard2013/capacitron-t2-c50',
 'tts_models/en/blizzard2013/capacitron-t2-c150_v2',
 'tts_models/en/multi-dataset/tortoise-v2',
 'tts_mode

In [8]:
tts = TTS(model_name)

 97%|█████████▋| 142M/146M [00:03<00:00, 38.8MiB/s]

In [9]:
!ls ~/.local/share/tts/tts_models--it--mai_male--vits/

config.json  model.pth


In [16]:
from TTS.api import TTS
from IPython.display import Audio
from TTS.utils.synthesizer import Synthesizer
import torch
import soundfile as sf


In [14]:
model_dir = "~/.local/share/tts/tts_models--it--mai_male--vits//"
pretrained_model_path = os.path.join(model_dir, "model.pth")
pretrained_config_path = os.path.join(model_dir, "config.json")

synthesizer = Synthesizer(
    tts_checkpoint=pretrained_model_path,
    tts_config_path=pretrained_config_path,
    use_cuda=torch.cuda.is_available(),

)

In [17]:
wav = synthesizer.tts(text="Mercoledi ventiquattro luglio, alle undici ventiquattro ora locale, un violento terremoto ha scosso la penisola di Kamchatca, in Russia, provocando un allerta tsunami in tutto il Pacifico. ", language_name='it')
sf.write("00_baseline_pretrained.wav", wav, synthesizer.output_sample_rate)
Audio("00_baseline_pretrained.wav")

\+ it works-ish

\+ it does speak Italian

\+ reasonable pronunciation (except foreign spelling)

\- robotic

\- problems with intonations, prosody, pauses

\- does not support italian accented letters (ì, ò etc)


---

Let's try to finetune
- The data has to be at the path set in config.json["datasets"]["path"]

In [None]:
import os
import json
import torch
from trainer import Trainer, TrainerArgs
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.text.tokenizer import TTSTokenizer


model_dir = "~/.local/share/tts/tts_models--it--mai_male--vits/"
pretrained_model_path = os.path.join(model_dir, "model.pth")
pretrained_config_path = os.path.join(model_dir, "config.json")

output_path = "data_from_drive/content/tts_finetuning_output"
os.makedirs(output_path, exist_ok=True)
cache_folder = "data_from_drive/content/tts_cache"
os.makedirs(cache_folder, exist_ok=True)


# passing here the last best checkpoint
# if we're just starting:
checkpoint_path = "~/.local/share/tts/tts_models--it--mai_male--vits/model.pth"
# checkpoint_path = "data_from_drive/content/tts_finetuning_output/my_italian_finetuning-September-01-2025_04+20PM-0000000/best_model_5900.pth"


if not os.path.exists(pretrained_config_path):
    raise FileNotFoundError(f"Config not found at {pretrained_config_path}")

with open(pretrained_config_path, "r") as f:
    config_dict = json.load(f)

checkpoint = torch.load(checkpoint_path, map_location="cpu")
if "config" in checkpoint:
    original_config = checkpoint["config"]
elif "config_dict" in checkpoint:
    original_config = checkpoint["config_dict"]
else:
    raise ValueError("No config found in checkpoint")


config_dict.update({
    "output_path": output_path,
    "run_name": "my_italian_finetuning",
    "num_loader_workers": 0,
    "num_eval_loader_workers": 0,
    "epochs": 150,
    "batch_size": 16,
    "eval_batch_size": 8,
    "mixed_precision": False,
    "text_cleaner": "multilingual_cleaners",
    "use_phonemes": False,
    "lr": 0.001,
    "datasets": [
        {
            "name": "my_dataset",
            "path": "data_from_drive/content/downloads_segmented_by_pauses",
            "meta_file_train": "metadata.csv",
            "formatter": "ljspeech",
            "cache_path": cache_folder,
        }
    ],
    "model_args": {
        "num_speakers": 1,
        "use_speaker_embedding": False,
        "init_discriminator": True
    },
    "test_sentences": [
        "Ciao, come stai oggi?",
        "Buongiorno a tutti!",
        "Mercoledì ventiquattro luglio, alle undici ventiquattro ora locale, un violento terremoto ha scosso la penisola di Kamchatka, in russia, provocando un’allerta tsunami in tutto il pacifico;",
    ]
})


config_dict["characters"]["characters"] = original_config["characters"]["characters"]

# saving updated config
updated_config_path = os.path.join(output_path, "config_finetuning.json")
with open(updated_config_path, "w") as f:
    json.dump(config_dict, f, indent=4)
print(f"Config saved at {updated_config_path}")

# initializing config, audio processor, tokenizer
config = VitsConfig()
config.from_dict(config_dict)

ap = AudioProcessor.init_from_config(config)
tokenizer, config = TTSTokenizer.init_from_config(config)

# loading dataset
train_samples, eval_samples = load_tts_samples(config.datasets[0], eval_split=True)

# initializing model and loading checkpoint
model = Vits(config, ap, tokenizer, speaker_manager=None)
state_dict = torch.load(checkpoint_path, map_location="cuda" if torch.cuda.is_available() else "cpu")
model.load_state_dict(state_dict["model"])

# setting trainer
args = TrainerArgs()
trainer = Trainer(
    args=args,
    config=config,
    output_path=output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)

# starting finetuning
trainer.fit()


Config saved at data_from_drive/content/tts_finetuning_output/config_finetuning.json


fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
 > Training Environment:
 | > Backend: Torch
 | > Mixed precision: False
 | > Precision: float32
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 40
 | > Num. of Torch Threads: 4
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 | > Torch TF32 MatMul: False
 > Start Tensorboard: tensorboard --logdir=data_from_drive/content/tts_finetuning_output/my_italian_finetuning-September-01-2025_08+06PM-0000000

 > Model has 83052076 parameters

[4m[1m > EPOCH: 0/149[0m
 --> data_from_drive/content/tts_finetuning_output/my_italian_finetuning-September-01-2025_08+06PM-0000000

[1m > TRAINING (2025-09-01 20:06:31) [0m

[1m   --> TIME: 2025