# **Training your custom model - TTS**

First, you must prepare your audio dataset, and split it into multiple audio files. These files should be in a range 5–10 seconds (works the best)

Second, Make sure the audio format is in .wav format and it should be in mono stereo.

Third, You can use Google Speech-to-text and loop through your dataset and save them inside transcript.txt format. Make sure your transcript file looks like this:

--------------------------------------------------------------
wav1|I have come into my conclusion that he is evil

wav2|The more we read the more we can gain knowledge

wav3|Good morning

--------------------------------------------------------------

Each of this audio should has their correct transcript, otherwise you’ll produce poorly trained model. Now it is time to arrange your dataset into a folder. Your folder structure should look like these:

MyTTSDataset/

-metadata.csv (your transcript)

-wavs/

-------->wav1.wav

-------->wav2.wav

-------->...

# **Installing the dependencies**

In [None]:
! pip install -U pip
! pip install TTS

Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.0
Collecting TTS
  Downloading TTS-0.22.0-cp310-cp310-manylinux1_x86_64.whl.metadata (21 kB)
Collecting scikit-learn>=1.3.0 (from TTS)
  Downloading scikit_learn-1.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting anyascii>=0.3.0 (from TTS)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pysbd>=0.3.4 (from TTS)
  Downloading pysbd-0.3.4-py3-none-any.whl.metadata (6.1 kB)
Collecting umap-learn>=0.5.1 (from TTS)
  Downloading umap_learn-0.5.6-py3-none-any.whl.metadata (21 kB)
Collecting pandas<2.0,>=1.4 (from TTS)
  Downloading pand

# **Dataset Preparation**

In [None]:
! git clone https://github.com/coqui-ai/TTS.git

fatal: destination path 'TTS' already exists and is not an empty directory.


In [None]:
import os

# BaseDatasetConfig: defines name, formatter and path of the dataset.
from TTS.tts.configs.shared_configs import BaseDatasetConfig,CharactersConfig

output_path = "tts_data"
if not os.path.exists(output_path):
    os.makedirs(output_path)

In [None]:
# Download and extract LJSpeech dataset.

!wget -O $output_path/LJSpeech-1.1.tar.bz2 https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
!tar -xf $output_path/LJSpeech-1.1.tar.bz2 -C $output_path

In [None]:
dataset_config = BaseDatasetConfig(
    formatter="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "LJSpeech-1.1/")
)

""" for yout custom data:
dataset_config = BaseDatasetConfig(
    formatter="ljspeech", meta_file_train="/path/to/transcript/file.csv", path=os.path.join(output_path, "path/to/dataset/folder/")
)

In [None]:
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits, VitsAudioConfig

audio_config = VitsAudioConfig(
    sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None
)

In [None]:
character_config = CharactersConfig(
    characters_class= "TTS.tts.models.vits.VitsCharacters",
    characters= "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890",
    punctuations=" !,.?-",
    pad= "<PAD>",
    eos= "<EOS>",
    bos= "<BOS>",
    blank= "<BLNK>",
)

In [None]:
# change the epochs, batch_size,save_step,eval_split_size,etc... according to the requirement

config = VitsConfig(
    audio=audio_config,
    run_name="vits_ljspeech_ly",
    batch_size=4,
    eval_batch_size=4,
#    num_loader_workers=8,
    # num_loader_workers=4,
    # num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1,
    save_step=1,
	save_checkpoints=True,
	# save_n_checkpoints=4,
	  save_best_after=1,
    text_cleaner="english_cleaners",
    use_phonemes=True,
    phoneme_language="en",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    compute_input_seq_cache=True,
    print_step=25,
    print_eval=True,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
    cudnn_benchmark=False,
    eval_split_size=25,
)


In [None]:
from TTS.utils.audio import AudioProcessor
ap = AudioProcessor.init_from_config(config)

 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024


In [None]:
from TTS.tts.utils.text.tokenizer import TTSTokenizer
tokenizer, config = TTSTokenizer.init_from_config(config)

In [None]:
def formatter(root_path, manifest_file, **kwargs):  # pylint: disable=unused-argument
    txt_file = '/content/tts_data/LJSpeech-1.1/metadata.csv' #path to transcript file
    items = []
    speaker_name = "my_speaker"
    with open(txt_file, "r", encoding="utf-8") as ttf:
        for line in ttf:
            cols = line.split("|")
            wav_file = f"/content/tts_data/LJSpeech-1.1/wavs/{cols[0]}.wav" #path to audio files
            text = cols[2]
            # print(text)
            items.append({"text":text, "audio_file":wav_file, "speaker_name":speaker_name, "root_path": root_path})
    return items

In [None]:
train_samples, eval_samples = load_tts_samples(
dataset_config,
eval_split=True,
eval_split_max_size=25, # change based on the requirement
eval_split_size=20, # change based on the requirement
formatter=formatter)

 | > Found 13100 files in /content/tts_data/LJSpeech-1.1


In [None]:
from trainer import Trainer, TrainerArgs

# init model
model = Vits(config, ap, tokenizer, speaker_manager=None)

# init the trainer and 🚀
trainer = Trainer(
    TrainerArgs(),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)

 > Training Environment:
 | > Backend: Torch
 | > Mixed precision: True
 | > Precision: fp16
 | > Num. of CPUs: 2
 | > Num. of Torch Threads: 1
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 | > Torch TF32 MatMul: False
 > Start Tensorboard: tensorboard --logdir=tts_data/vits_ljspeech_ly-June-05-2024_02+22PM-0000000

 > Model has 83059180 parameters


As shown below, the model started training, but since it consumed too much time, i stopped running. you can run it further with a good computation resources/processor

In [None]:
trainer.fit()


[4m[1m > EPOCH: 0/1[0m
 --> tts_data/vits_ljspeech_ly-June-05-2024_02+22PM-0000000

[1m > TRAINING (2024-06-05 14:22:41) [0m




> DataLoader initialization
| > Tokenizer:
	| > add_blank: True
	| > use_eos_bos: False
	| > use_phonemes: True
	| > phonemizer:
		| > phoneme language: en
		| > phoneme backend: gruut
| > Number of instances : 13080
 | > Preprocessing samples
 | > Max text length: 188
 | > Min text length: 13
 | > Avg text length: 100.92461773700306
 | 
 | > Max audio length: 222643.0
 | > Min audio length: 24499.0
 | > Avg audio length: 145011.88073394494
 | > Num. instances discarded samples: 0
 | > Batch group size: 0.


Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:873.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]


once the model is trained, you can load your checkpoints and config files to view your output

In [None]:
!tts --text "Hello its grest connecting with you" \
      --model_path /path/to/trained/checkpoints/ \
      --config_path /path/to/config/file/ \
      --out_path out.wav
import IPython
IPython.display.Audio("out.wav")