**FINE TUNE OR TRAIN A VITS MODEL WITH COQUI TTS FRAMEWORK USING CUSTOM DATASET**

Scripts from https://youtu.be/MU5157dKOHM

(1) RUN CELL BELOW, THEN RESTART RUNTIME

In [None]:
!sudo apt-get install espeak-ng
# core deps
!pip install numpy==1.21.6
!pip install cython==0.29.28
!pip install scipy>=1.4.0
!pip install torch>=1.7
!pip install torchaudio
!pip install soundfile
!pip install librosa==0.8.0
!pip install numba==0.55.1
!pip install inflect==5.6.0
!pip install tqdm
!pip install anyascii
!pip install pyyaml
!pip install fsspec>=2021.04.0
!pip install packaging
# deps for examples
!pip install flask
# deps for inference
!pip install pysbd
# deps for notebooks
!pip install umap-learn==0.5.1
!pip install pandas
# deps for training
!pip install matplotlib
# coqui stack
!pip install trainer==0.0.20
# config management
!pip install coqpit>=0.0.16
# chinese g2p deps
!pip install jieba
!pip install pypinyin
# japanese g2p deps
!pip install mecab-python3==1.0.5
!pip install unidic-lite==1.0.8
# gruut+supported langs
!pip install gruut[de]==2.2.3
# deps for korean
!pip install jamo
!pip install nltk
!pip install g2pkk>=0.1.1
!pip install TTS

**(2) Run this cell to connect your Google Drive**

In [None]:
from google.colab import drive

drive.mount("/content/drive")

**(3) Set paths and then run the next cell**

ds_name is the dataset directory (will be created)

output_directory is training storage directory, 

subdirectory of ds_name (will be created)

upload_dir is where your samples are stored (will be created)

MODEL_FILE is the default path to the VITS model downloaded using Coqui (do not need to change).
Default model paths: /root/.local/share/tts/tts_models--en--ljspeech--vits/model_file.pth or 
/root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth


RUN_NAME is a short name describing your training run

In [None]:
import os

ds_name = "sop" #@param {type:"string"}
output_directory = "traineroutput" #@param {type:"string"}
upload_dir = "upload" #@param {type:"string"}
MODEL_FILE = "/root/.local/share/tts/tts_models--en--ljspeech--vits/model_file.pth" #@param {type:"string"}
upload_dir = "/content/drive/MyDrive/" + upload_dir
RUN_NAME = "VITS-fi" #@param {type:"string"}


OUT_PATH = "/content/drive/MyDrive/"+ds_name+"/traineroutput/"
!mkdir $upload_dir
!mkdir /content/drive/MyDrive/$ds_name
!mkdir /content/drive/MyDrive/$ds_name/wavs/

**(4) Set run type.**

Continue to resume an interrupted session

restore to begin a new session from the defalt model model file above (download from Coqui Hub using the download cell later on).

restore-ckpt is for beginning a new session using a prior fine-tuned checkpoint. You can set this later on in the training section in Part 2.

newmodel is for beginning a new training session with an empty VITS model.

In [None]:
run_type = "restore-ckpt" #@param ["continue","restore","restore-ckpt","newmodel"]
print(run_type + " run selected")

#5 ONLY ONCE FOR EACH DATASET
**Download and Build Rnnoise (https://github.com/xiph/rnnoise) and Requirements**

In [None]:
#@title
!pip install pyloudnorm
!git clone https://github.com/xiph/rnnoise.git
!sudo apt-get install curl autoconf automake libtool python-dev pkg-config sox ffmpeg
%cd /content/rnnoise
!sh autogen.sh
!sh configure
!make clean
!make

**Install Sox, Install OpenAI Whisper STT+Translation (https://github.com/openai/whisper)**

In [None]:
#@title
%cd /content
!sudo apt install sox
!git clone https://github.com/openai/whisper.git
!pip install git+https://github.com/openai/whisper.git 


**Install Coqui TTS** (https://github.com/coqui-ai/TTS), espeak-ng phonemeizer (https://github.com/espeak-ng/espeak-ng), download Coqui TTS source and examples from GitHub.

In [None]:
#@title
%cd /content
!sudo apt-get install espeak-ng
!git clone https://github.com/coqui-ai/TTS.git
%cd TTS
!pip install -e .[all,dev,notebooks]
!pip install TTS

**(Optional) List pretrained models available on the Coqui Hub**

In [None]:
#@title
!tts --list_models

**List folders in sample upload directory**

Google Drive can become desyncronized when uploading files with the web interface.  If your sample folder doesn't show up here, wait, or use the desktop application.  If Colab can't see the folder, it can't access the samples.

In [None]:
#@title
%cd $upload_dir
!ls -al

**Set the sample uploads subfolder name to process.**

Set the name for the new speaker to process.  This will be the speakerid, and stored in the model as VCTK_{name}

You can name them both the same thing

**Set the folder to process**

**List dataset speaker subdirectories**

In [None]:
subfolder = "finnish2" #@param {type:"string"}
newspeakername = "finnish" #@param {type:"string"}

**Processing Options**

This section will convert mp3, ogg, and wav files in upload_dir to  22050hz mono wav files.  Then it will pass the wav files through rnnoise.

rnnoise output is then segmented based on 0.2 second silences (click show code below, change 0.2 in the sox line to the duration to silence duration if needed)

8000hz Highpass and 50hz lowpass filters applied, gain/loudness adjusted to reduce potential clipping, -6db peak normalization and -25db lufs applied.  Should be fine for general purpose.

segmented audio is then passed through sox again to force-split any long segments (above 9 seconds) into segments once again.  Files smaller than 35kb are deleted.

files are renamed and converted to .flac format for the dataloader

In [None]:
run_denoise = "True" #@param ["True", "False"]
run_splits = "True" #@param ["True", "False"]
use_audio_filter = "True" #@param ["True", "False"]
normalize_audio = "True" #@param ["True", "False"]
#start_sil_dur = 0.2 #@param {type:"number"}
#end_sil_dur = 0.2 #@param {type:"number"}
#sample_max = 8 #@param {type:"number"}


**Run processing**

In [None]:
#@title

from pathlib import Path
import os
import subprocess
import soundfile as sf
import pyloudnorm as pyln
import sys
import glob
%cd $upload_dir
%cd $subfolder
#!ls -al
!rm -rf $upload_dir/$subfolder/22k_1ch
!mkdir $upload_dir/$subfolder/22k_1ch

!find . -name '*.mp3' -exec bash -c 'for f; do ffmpeg -hide_banner -loglevel error -i "$f" -acodec pcm_s16le -ar 22050 -ac 1 22k_1ch/"${f%.mp3}".wav ; done' _ {} +
!find . -name '*.ogg' -exec bash -c 'for f; do ffmpeg -hide_banner -loglevel error -i "$f" -acodec pcm_s16le -ar 22050 -ac 1 22k_1ch/"${f%.ogg}".wav ; done' _ {} +
!find . -name '*.wav' -exec bash -c 'for f; do ffmpeg -hide_banner -loglevel error -i "$f" -acodec pcm_s16le -ar 22050 -ac 1 22k_1ch/"${f%.wav}".wav ; done' _ {} +
!ls -al $upload_dir/$subfolder/22k_1ch
print("Files converted to 22khz 1ch wav")
%cd $upload_dir
%cd $subfolder
!rm temp.raw
!rm rnn.raw
%cd 22k_1ch
!find . -name "*.wav" -type f -size -35k -delete
%cd $upload_dir
#Convert and resample uploaded mp3/wav clips to 1 channel, 22khz

#
if run_denoise=="True":
  print("Running denoise...")
  orig_wavs= upload_dir + '/' + subfolder + "/22k_1ch/"
  print(orig_wavs)

  from pathlib import Path
  import os
  import subprocess
  import soundfile as sf
  import pyloudnorm as pyln
  import sys
  import glob
  rnn = "/content/rnnoise/examples/rnnoise_demo"
  paths = glob.glob(os.path.join(orig_wavs, '*.wav'))
  for filepath in paths:
    base = os.path.basename(filepath)
    tp_s = upload_dir + '/' + subfolder + "/22k_1ch/denoise/"
    tf_s = upload_dir + '/' + subfolder + "/22k_1ch/denoise/" + base
    target_path = Path(tp_s)
    target_file = Path(tf_s)
    print("From: " + str(filepath))
    print("To: " + str(target_file))
	

  
  # Stereo to Mono; upsample to 48000Hz
  # added -G to fix gain, -v 0.8
    subprocess.run(["sox", "-G", "-v", "0.8", filepath, "48k.wav", "remix", "-", "rate", "48000"])
    subprocess.run(["sox", "48k.wav", "-c", "1", "-r", "48000", "-b", "16", "-e", "signed-integer", "-t", "raw", "temp.raw"]) # convert wav to raw
    subprocess.run(["/content/rnnoise/examples/rnnoise_demo", "temp.raw", "rnn.raw"]) # apply rnnoise
    subprocess.run(["sox", "-G", "-v", "0.8", "-r", "48k", "-b", "16", "-e", "signed-integer", "rnn.raw", "-t", "wav", "rnn.wav"]) # convert raw back to wav

    subprocess.run(["mkdir", "-p", str(target_path)])
    if use_audio_filter=="True":
      print("Running highpass/lowpass & resample")
      subprocess.run(["sox", "rnn.wav", str(target_file), "remix", "-", "highpass", "50", "lowpass", "8000", "rate", "22050"]) 
      # apply high/low pass filter and change sr to 22050Hz
      data, rate = sf.read(target_file)
    elif use_audio_filter=="False":
      print("Running resample without filter")
      subprocess.run(["sox", "rnn.wav", str(target_file), "remix", "-", "rate", "22050"]) 
      # apply high/low pass filter and change sr to 22050Hz
      data, rate = sf.read(target_file)
# peak normalize audio to -6 dB
    if normalize_audio=="True":
      print("Output normalized")
      peak_normalized_audio = pyln.normalize.peak(data, -6.0)

# measure the loudness first
      meter = pyln.Meter(rate) # create BS.1770 meter
      loudness = meter.integrated_loudness(data)

# loudness normalize audio to -25 dB LUFS
      loudness_normalized_audio = pyln.normalize.loudness(data, loudness, -25.0)
      sf.write(target_file, data=loudness_normalized_audio, samplerate=22050)
      print("")
    elif normalize_audio=="False":
      print("File written without normalizing")
      sf.write(target_file, data=data, samplerate=22050)
      print("")

  !rm $target_path/rnn.wav
  !rm $target_path/48k.wav

elif run_denoise=="False":
  paths = glob.glob(os.path.join(orig_wavs, '*.wav'))
  for filepath in paths:
    print("Skipping denoise...")
    base = os.path.basename(filepath)
    tp_s = upload_dir + '/' + subfolder + "/22k_1ch/denoise/"
    tf_s = upload_dir + '/' + subfolder + "/22k_1ch/denoise/" + base
    target_path = Path(tp_s)
    target_file = Path(tf_s)
    print("From: " + str(filepath))
    print("To: " + str(target_file))
    subprocess.run(["sox", "-G", "-v", "0.8", filepath, "48k.wav", "remix", "-", "rate", "48000"])
    subprocess.run(["sox", "48k.wav", "-c", "1", "-r", "48000", "-b", "16", "-e", "signed-integer", "-t", "raw", "temp.raw"]) # convert wav to raw
    #subprocess.run(["/content/rnnoise/examples/rnnoise_demo", "temp.raw", "rnn.raw"]) # apply rnnoise
    subprocess.run(["sox", "-G", "-v", "0.8", "-r", "48k", "-b", "16", "-e", "signed-integer", "rnn.raw", "-t", "wav", "rnn.wav"]) # convert raw back to wav
    subprocess.run(["mkdir", "-p", str(target_path)])
    if use_audio_filter=="True":
      print("Running filter...")
      subprocess.run(["sox", "rnn.wav", str(target_file), "remix", "-", "highpass", "50", "lowpass", "8000", "rate", "22050"]) # apply high/low pass filter and change sr to 22050Hz
      data, rate = sf.read(target_file)
    elif use_audio_filter=="False":
      print("Skipping filter...")
      subprocess.run(["sox", "rnn.wav", str(target_file), "remix", "-", "rate", "22050"]) # apply high/low pass filter and change sr to 22050Hz
      data, rate = sf.read(target_file)
          # peak normalize audio to -6 dB
    if normalize_audio=="True":
      print("Output normalized")
      peak_normalized_audio = pyln.normalize.peak(data, -6.0)

# measure the loudness first
      meter = pyln.Meter(rate) # create BS.1770 meter
      loudness = meter.integrated_loudness(data)

# loudness normalize audio to -25 dB LUFS
      loudness_normalized_audio = pyln.normalize.loudness(data, loudness, -25.0)
      sf.write(target_file, data=loudness_normalized_audio, samplerate=22050)
      print("")
    if normalize_audio=="False":
      print("File written without normalizing")
      sf.write(target_file, data=data, samplerate=22050)
      print("")
  !rm $target_path/rnn.wav
  !rm $target_path/48k.wav

if run_splits=="False":
  print("Copying files without splitting...")
  %mkdir /content/drive/MyDrive/$ds_name
  %mkdir /content/drive/MyDrive/$ds_name/wav48_silence_trimmed
  %mkdir /content/drive/MyDrive/$ds_name/wav48_silence_trimmed/$newspeakername

  !cp $target_path/*.wav /content/drive/MyDrive/$ds_name/wav48_silence_trimmed/$newspeakername
if run_splits=="True":
  %mkdir /content/drive/MyDrive/$ds_name
  %mkdir /content/drive/MyDrive/$ds_name/wav48_silence_trimmed
  %mkdir /content/drive/MyDrive/$ds_name/wav48_silence_trimmed/$newspeakername  
  print("Splitting output and copying...")
  %cd $target_path
  !rm -rf splits
  !mkdir splits
  !for FILE in *.wav; do sox "$FILE" splits/"$FILE" --show-progress silence 1 0.2 0.1% 1 0.2 0.1% : newfile : restart ; done
#alt split method: force splits of 9.5 seconds, however this will split words. Comment the above with # and remove the # below to change
#!for FILE in *.wav; do sox "$FILE" splits/"$FILE" --show-progress trim 0 8 : restart ; done
  %cd splits
  !mkdir resplit
  !for FILE in *.wav; do sox "$FILE" resplit/"$FILE" --show-progress trim 0 9 : newfile : restart ; done
  %cd resplit
  !find . -name "*.wav" -type f -size -35k -delete
  #!ls -al
  %cd /content/drive/MyDrive/$ds_name/wav48_silence_trimmed/$newspeakername

#  !ls -al
!cp $target_path/splits/resplit/*.wav /content/drive/MyDrive/$ds_name/wav48_silence_trimmed/$newspeakername
%cd /content/drive/MyDrive/$ds_name/wav48_silence_trimmed/$newspeakername
!rm *.flac
!find . -name '*.wav' -exec bash -c 'for f; do ffmpeg -i "$f" -c:a flac "${f%.wav}"_mic1.flac ; done' _ {} +
!rm *.wav
!ls /content/drive/MyDrive/$ds_name/wav48_silence_trimmed/$newspeakername

**Run Whisper on generated audio clips.**
Transcripts will be formatted for the VCTK-style dataset and placed in the<br>
->dataset directory<br>
---->txt<br>
-------->newspeakername<br>

# 6 ONLY ONCE FOR EACH DATASET
**Select Whisper STT model. Large-v2 slowest, most accurate, most memory usage. Free Colab users may need to use medium.en.  Select model, run next cell to load it.**

**Run the load cell only once. Reloading Whisper STT models may crash your Colab session.**

-> Changed model to medium, language to finnish

In [None]:
whisper_model = "medium" #@param ["large-v2", "large-v1", "medium", "small.en", "base.en"]


In [None]:
#@title
import whisper
import os, os.path
import glob
import pandas as pd

from pathlib import Path


#model = whisper.load_model("medium.en")
model = whisper.load_model(whisper_model)

List speakers in dataset directory

In [None]:
#@title
!ls /content/drive/MyDrive/$ds_name/wav48_silence_trimmed

New speaker to process for transcription

In [None]:
newspeakername = "finnish" #@param {type:"string"}


**Run this cell to transcribe clips using Whisper.**

To process additional speakers, set a new subfolder/newspeakername above, and remember to re-run that cell before running the one below.

Whisper transcription language

In [None]:
whisper_lang = "finnish" #@param {type:"string"}
whisper_lang = whisper_lang.lower()


In [None]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
options = dict(language=whisper_lang, beam_size=5, best_of=5)
transcribe_options = dict(task="transcribe", **options)

#@title
wavs = '/content/drive/MyDrive/'+ds_name+'/wav48_silence_trimmed/'+newspeakername


"""
if os.path.exists(modelpath):
	if os.path.isfile(modelpath+"large-v2.pt"):
		print("Saved large-v2 found")
		model = whisper.load_model(download_root=modelpath,name="large-v2")
else:
	print("loading model from Huggingface")
	model = whisper.load_model("large-v2")
"""
paths = glob.glob(os.path.join(wavs, '*.flac'))
print(len(paths))
all_filenames = []
transcript_text = []
try: 
	os.mkdir('/content/drive/MyDrive/'+ds_name+'/txt/') 
except OSError as error: 
	print(error)  
try: 
	os.mkdir('/content/drive/MyDrive/'+ds_name+'/txt/'+newspeakername+'/') 
except OSError as error: 
	print(error)  

for filepath in paths:
	base = os.path.basename(filepath)
	all_filenames.append(base)
	result = model.transcribe(filepath)
	output = result["text"].lstrip()
	output = output.replace("\n","")
	print(output)
	thefile = str(os.path.basename(filepath).lstrip(".")).rsplit(".")[0]
	thefile = thefile[:-5]
	print(thefile)
	outfilepath = '/content/drive/MyDrive/'+ds_name+'/txt/'+newspeakername+'/'+thefile+'.txt'
	with open(outfilepath, 'w', encoding='utf-8') as indfile:
		indfile.write(output)

Perform the steps above for each new speaker.  You should have a folder arraged like this for your dataset.<br>
datasetfolder<br>
->txt<br>
---->nameone<br>
---->nametwo<br>
->wav48_silence_trimmed<br>
---->nameone<br>
---->nametwo<br>

**Check dataset for empty transcript files. Set dataset name and speaker. Process each speaker individually.  Bad file sets will be moved to a folder called badfiles in your dataset directory.**

In [None]:
ds_name2 = "sop" #@param {type:"string"}
newspeakername2 = "finnish2" #@param {type:"string"}

Scan files, move broken pairs

In [None]:
#@title
import os, os.path
import glob
import shutil

flac = '/content/drive/MyDrive/'+ds_name2+'/wav48_silence_trimmed/'+newspeakername2+'/'
wav = '/content/drive/MyDrive/'+ds_name2+'/wavs/'
txt = '/content/drive/MyDrive/'+ds_name2+'/txt/'+newspeakername2+'/'
outfilepath = '/content/drive/MyDrive/'+ds_name2+'/txt/'+newspeakername2
backup_path = '/content/drive/MyDrive/'+ds_name2+'/badfiles/'
if not os.path.exists(backup_path):
    os.makedirs(backup_path)
searchstr = "(?i)(?!(?![×Þß÷þø])[a-zÀ-ÿ])\S"
txtfiles = glob.glob(os.path.join(txt, '*.txt'))
EMPTY_TRANSCRIPTS = []
for txts in txtfiles:
    with open(txts, 'r', encoding='utf-8') as outfile:

      state = os.stat(txts).st_size == 0
      if state==True:
          print(str(txts))
          basename_without_ext = os.path.splitext(os.path.basename(txts))[0]
          print(basename_without_ext)
          bad_wav = basename_without_ext+".wav"
          bad_txt = basename_without_ext+".txt"
          bad_flac = basename_without_ext+"_mic1.flac"
          print(bad_wav)
          print(bad_txt)
          bad_wav_file_size = os.path.getsize(wav+bad_wav)
          bad_txt_file_size = os.path.getsize(txt+bad_txt)
          bad_flac_file_size = os.path.getsize(flac+bad_flac)
          print(bad_wav_file_size)
          print(bad_txt_file_size)
          shutil.move(wav+bad_wav, backup_path+bad_wav)
          shutil.move(txt+bad_txt, backup_path+bad_txt)
          shutil.move(flac+bad_flac, backup_path+bad_flac)





# TRAINING

**Part 2 - Training**

(1) Download TTS model

(2) Load Tensorboard and dashboard

(3) Set training variables, load trainer, train

**OPTIONAL, CURRENTLY NOT USED SINCE TRAINING NEW MODEL INSTEAD OF FINE TUNING EXISTING ONE**
**Download VITS model and Generate Sample Wav File to /content/ljspeech-vits.wav  This will be deleted when your Colab session is closed.**

In [None]:
#@title
!tts --text "Olen suomalainen malli ja nyt minua jatkokoulutetaan." --model_name "tts_models/fi/css10/vits" --out_path /content/ljspeech-vits.wav
#!tts --text "I am the very model of a modern Major General" --model_name "tts_models/en/vctk/vits" --out_path /content/ljspeech-vits.wav

**Load Tensorboard**

In [None]:
import torch 
%load_ext tensorboard

**Load Dashboard**
May take several minutes to appear from a blank white box.  Ad blockers probably need to whitelist a bunch of Colab stuff or this won't work.

In [None]:
%tensorboard --logdir /content/drive/MyDrive/$ds_name/$output_directory/

**If continuning a run: use the next cell to list all run directories.**

**Copy and paste the run you want to or restore a checkpoint from into the next box**

In [None]:
#@title
!ls -al /content/drive/MyDrive/$ds_name/traineroutput

**Run folder to continue from or Run folder that contains your restore checkpoint**

In [None]:
run_folder = "sop-March-05-2023_11+22AM-0000000" #@param {type:"string"}


List checkpoints in run folder. The checkpoint only needs to be selected for a restore run.

Continuing a run will load the last best loss checkpoint according to the stored config.json in the run directory on its own (a directory is specified for a continue run, and a model file is specified for a restore run)

In [None]:
#@title
!ls -al /content/drive/MyDrive/$ds_name/traineroutput/$run_folder

**If changing to a different "restore" checkpoint to begin a new training session with a model you are already training, set the checkpoint filename here**

In [None]:
ckpt_file = "best_model_7569.pth" #@param {type:"string"}
print(ckpt_file + " selected for restore run")
if run_type=="continue":
  print("Warning:\n restore checkpoint selected, but run type set to continue.\nTrainer will load best loss from checkpoint directory.\n Are you sure this is what you want to do?\n\nIf not, change the run type below to 'restore'")
elif run_type=="restore-ckpt":
  print("Warning:\n restore checkpoint selected, run type set to restore from selected checkpoint, not default base model.\nIf this is not correct, adjust the run type.")


**Last chance to change run type**

In [None]:
run_type = "restore-ckpt" #@param ["continue","restore","restore-ckpt","newmodel"]
print(run_type + " run selected")

**Run the next cells in order to begin training**

In [None]:
import os

from trainer import Trainer, TrainerArgs

from TTS.tts.configs.shared_configs import BaseDatasetConfig, CharactersConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits, VitsArgs, VitsAudioConfig
from TTS.tts.utils.speakers import SpeakerManager
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.bin.compute_embeddings import compute_embeddings
from TTS.tts.utils.data import get_length_balancer_weights
from TTS.tts.utils.languages import LanguageManager, get_language_balancer_weights
from TTS.tts.utils.speakers import SpeakerManager, get_speaker_balancer_weights, get_speaker_manager
from TTS.tts.utils.text.characters import Graphemes

In [None]:
  #output_path = os.path.dirname(os.path.abspath(__file__))
output_path = os.path.dirname("/content/drive/MyDrive/"+ds_name+"/traineroutput/")
print(output_path)
SKIP_TRAIN_EPOCH=False
#https://github.com/coqui-ai/TTS/releases/tag/speaker_encoder_model
## Extract speaker embeddings

SPEAKER_ENCODER_CHECKPOINT_PATH = (
    "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar"
)

SPEAKER_ENCODER_CONFIG_PATH = "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json"
SKIP_TRAIN_EPOCH = False
BATCH_SIZE = 16
SAMPLE_RATE = 22050
MAX_AUDIO_LEN_IN_SECONDS = 10
NUM_RESAMPLE_THREADS = 10


dataset_config = BaseDatasetConfig(
    formatter="vctk", meta_file_train="", phonemizer="espeak", dataset_name="sop", language="fi", path="/content/drive/MyDrive/"+ds_name
)

print(dataset_config)
DATASETS_CONFIG_LIST = [dataset_config]
D_VECTOR_FILES=[]

In [None]:
audio_config = VitsAudioConfig(
    sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None
)

vitsArgs = VitsArgs(
    use_d_vector_file=True,
    d_vector_file=D_VECTOR_FILES,
    d_vector_dim=512,
    num_layers_text_encoder=6,
    embedded_language_dim=4,
    speaker_encoder_model_path=SPEAKER_ENCODER_CHECKPOINT_PATH,
    speaker_encoder_config_path=SPEAKER_ENCODER_CONFIG_PATH,
    use_language_embedding=False,
    use_speaker_embedding=False,
    use_speaker_encoder_as_loss=True,
    use_sdp=True,
)

config = VitsConfig(
    model_args=vitsArgs,
    audio=audio_config,
    run_name="sop",
    max_audio_len=SAMPLE_RATE * MAX_AUDIO_LEN_IN_SECONDS,
    min_text_len=1,
    min_audio_len=1,
    #max_text_len=325,
    batch_size=12,
    eval_batch_size=4,
    batch_group_size=1,
    num_loader_workers=2,
    num_eval_loader_workers=2,
    eval_split_max_size=256,
    #eval_split_size=0.014084507042253521,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=10000,
    save_step=1000,
    save_checkpoints=True,
    save_best_after=1000,
    save_n_checkpoints=4,
    #use_weighted_sampler=True,
    use_weighted_sampler=False,
    start_by_longest=True,
    weighted_sampler_attrs={"speaker_name": 1.0},
    weighted_sampler_multipliers={"speaker_name": {}},    
    speaker_encoder_loss_alpha=9.0,
    text_cleaner="multilingual_cleaners",
    use_phonemes=True,
    phoneme_language="fi",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    characters=CharactersConfig(
      characters_class="TTS.tts.utils.text.characters.Graphemes",
      vocab_dict=None,
      pad="<PAD>",
      eos="<EOS>",
      bos="<BOS>",
      blank="<BLNK>",
      characters="abcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491",
      punctuations="!'(),-.:;? ",
      #phonemes=None,
      is_unique=True,
      is_sorted=True
    ),
    compute_input_seq_cache=True,
    print_step=50,
    print_eval=True,
    mixed_precision=False,
    output_path=output_path,
    datasets=[dataset_config],
    cudnn_benchmark=False,
    test_sentences=[
        [
            "Sateenkaari on spektrin v\u00e4reiss\u00e4 esiintyv\u00e4 ilmakeh\u00e4n optinen ilmi\u00f6. Se syntyy, kun valo taittuu pisaran etupinnasta, heijastuu pisaran takapinnasta ja taittuu j\u00e4lleen pisaran etupinnasta.",
            #"css10",
            None,
            "fi",
        ],
    ]
)

In [None]:
# INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)

In [None]:
for dataset_conf in DATASETS_CONFIG_LIST:
    # Check if the embeddings weren't already computed, if not compute it
    print(dataset_conf.path)
    embbase=str(dataset_conf.dataset_name)
    #embeddings_file = MODEL_DIR+"speakers.pth"
    embeddings_file = os.path.join(dataset_conf.path, embbase+"_speakers.pth")
    print(embeddings_file)
    if not os.path.isfile(embeddings_file):
        print(f">>> Computing the speaker embeddings for the {dataset_conf.dataset_name} dataset")
        print(SPEAKER_ENCODER_CHECKPOINT_PATH)
        print(SPEAKER_ENCODER_CONFIG_PATH)
        print(embeddings_file)
        print(dataset_conf.formatter)
        print(dataset_conf.dataset_name)
        print(dataset_conf.path)
        print(dataset_conf.meta_file_train)
        print(dataset_conf.meta_file_val)
        compute_embeddings(
            SPEAKER_ENCODER_CHECKPOINT_PATH,
            SPEAKER_ENCODER_CONFIG_PATH,
            embeddings_file,
            old_spakers_file=None,
            config_dataset_path=None,
            formatter_name=dataset_conf.formatter,
            dataset_name=dataset_conf.dataset_name,
            dataset_path=dataset_conf.path,
            meta_file_train=dataset_conf.meta_file_train,
            meta_file_val=dataset_conf.meta_file_val,
            disable_cuda=False,
            no_eval=False,
        )
    D_VECTOR_FILES.append(embeddings_file)

In [None]:
speaker_manager = SpeakerManager(
    d_vectors_file_path=D_VECTOR_FILES,
    encoder_model_path=SPEAKER_ENCODER_CHECKPOINT_PATH,
    encoder_config_path=SPEAKER_ENCODER_CONFIG_PATH)

In [None]:
tokenizer, config = TTSTokenizer.init_from_config(config)


In [None]:
model = Vits(config, ap, tokenizer, speaker_manager)

In [None]:
train_samples, eval_samples = load_tts_samples(
    DATASETS_CONFIG_LIST,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)


**Display complete character set from all datasets:**

In [None]:
#@title
import os
import re

from trainer import Trainer, TrainerArgs

from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits, VitsArgs, VitsAudioConfig
from TTS.tts.utils.speakers import SpeakerManager
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.bin.compute_embeddings import compute_embeddings
from TTS.tts.utils.data import get_length_balancer_weights
from TTS.tts.utils.languages import LanguageManager, get_language_balancer_weights
from TTS.tts.utils.speakers import SpeakerManager, get_speaker_balancer_weights, get_speaker_manager

from tqdm.contrib.concurrent import process_map

from TTS.config import load_config
from TTS.tts.datasets import load_tts_samples
from TTS.tts.utils.text.phonemizers import espeak_wrapper
from TTS.tts.utils.text.phonemizers import ESpeak
import multiprocessing

def compute_phonemes(item):
    text = item["text"]
    ph = phonemizer.phonemize(text).replace("|", "")
    return set(list(ph))
train_samples, eval_samples = load_tts_samples(
    DATASETS_CONFIG_LIST,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

#from TTS/bin/find_unique_chars.py
items = train_samples + eval_samples

texts = "".join(item["text"] for item in items)
chars = set(texts)
lower_chars = filter(lambda c: c.islower(), chars)
chars_force_lower = [c.lower() for c in chars]
chars_force_lower = set(chars_force_lower)

print(f" > Number of unique characters: {len(chars)}")
print(f" > Unique characters: {''.join(sorted(chars))}")
print(f" > Unique lower characters: {''.join(sorted(lower_chars))}")
print(f" > Unique all forced to lower characters: {''.join(sorted(chars_force_lower))}")
#https://www.geeksforgeeks.org/python-convert-string-to-unicode-characters/
char_str = str(chars)


In [None]:
print("Current reinit_text_encoder value: " + str(config.model_args.reinit_text_encoder))
reinit_te_status = "False" #@param ["False", "True"]
if reinit_te_status=="False":
  print("Text encoder will not be reinitilized")
elif reinit_te_status=="True":
  config.model_args.reinit_text_encoder=True
  print("Model arguments set to reinitilize text encoder")
  print("Current reinit_DP value: " + str(config.model_args.reinit_DP))
reinit_DP_status = "False" #@param ["False", "True"]
if reinit_DP_status=="False":
  print("DP will not be reinitilized")
elif reinit_DP_status=="True":
  config.model_args.reinit_DP=True
  print("Model arguments set to reinitilize DP")
print("Current freeze_waveform_decoder value: " + str(config.model_args.freeze_waveform_decoder))
freeze_waveform_decoder_status = "False" #@param ["False", "True"]
if freeze_waveform_decoder_status=="False":
  print("Waveform decoder will NOT be frozen")
  config.model_args.freeze_waveform_decoder=False
elif freeze_waveform_decoder_status=="True":
  config.model_args.freeze_waveform_decoder=True
  print("Waveform decoder FROZEN")
print("Current freeze_flow_decoder value: " + str(config.model_args.freeze_flow_decoder))
freeze_flow_decoder_status = "False" #@param ["False", "True"]
if freeze_flow_decoder_status=="False":
  print("Flow decoder will NOT be frozen")
  config.model_args.freeze_flow_decoder=None
elif freeze_flow_decoder_status=="True":
  config.model_args.freeze_flow_decoder="True"
  print("Flow decoder FROZEN")
print("Current freeze_encoder value: " + str(config.model_args.freeze_encoder))
freeze_encoder_status = "False" #@param ["False", "True"]
if freeze_encoder_status=="False":
  print("Text encoder will NOT be frozen")
  config.model_args.freeze_encoder=False
elif freeze_encoder_status=="True":
  config.model_args.freeze_encoder=True
  print("Text encoder FROZEN")
print("Current freeze_DP value: " + str(config.model_args.freeze_DP))
freeze_DP_status = "False" #@param ["False", "True"]
if freeze_DP_status=="False":
  print("Duration predictor will NOT be frozen")
  config.model_args.freeze_DP=False
elif freeze_DP_status=="True":
  config.model_args.freeze_DP=True
  print("Duration predictor FROZEN")        

**Initilize the trainer**

In [None]:
#@title
print(run_type)

if run_type=="continue":
  CONTINUE_PATH="/content/drive/MyDrive/"+ds_name+"/traineroutput/"+run_folder
  trainer = Trainer(
    TrainerArgs(continue_path=CONTINUE_PATH, skip_train_epoch=SKIP_TRAIN_EPOCH),
    config,
    output_path=OUT_PATH,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
elif run_type=="restore":
    trainer = Trainer(
    TrainerArgs(restore_path=MODEL_FILE, skip_train_epoch=SKIP_TRAIN_EPOCH),
    config,
    output_path=OUT_PATH,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
elif run_type=="restore-ckpt":
  trainer = Trainer(
  TrainerArgs(restore_path="/content/drive/MyDrive/"+ds_name+"/traineroutput/"+run_folder+"/"+ckpt_file, skip_train_epoch=SKIP_TRAIN_EPOCH),
  config,
  output_path=OUT_PATH,
  model=model,
  train_samples=train_samples,
  eval_samples=eval_samples,
)
elif run_type=="newmodel":
  trainer = Trainer(
  TrainerArgs(),
  config,
  output_path=OUT_PATH,
  model=model,
  train_samples=train_samples,
  eval_samples=eval_samples,
)

**CHECK THAT TENSORBOARD IS RUNNING ABOVE, THEN RUN THIS TO TRAIN**

In [None]:
trainer.fit()

**SECTION FOR GENERATING SPEECH. NOT CONFIGURED YET**

In [None]:
!tts --model_path /content/drive/MyDrive/vits-vctk-multi-ds/traineroutput/vits_vctk-February-09-2023_12+12AM-914280a5/best_model_1003928.pth \
--config_path /content/drive/MyDrive/vits-vctk-multi-ds/traineroutput/vits_vctk-February-09-2023_12+12AM-914280a5/config.json \
--list_speaker_idxs \
--text ""

In [None]:
out_wav_file ="/content/drive/MyDrive/me-mmj.wav"

In [None]:
!tts --model_path /content/drive/MyDrive/vits-vctk-multi-ds/traineroutput/vits_vctk-February-09-2023_12+12AM-914280a5/best_model_1003928.pth \
--config_path /content/drive/MyDrive/vits-vctk-multi-ds/traineroutput/vits_vctk-February-09-2023_12+12AM-914280a5/config.json \
--speaker_idx VCTK_me \
--text "I am the very model of a modern Major-General,\
 I've information vegetable, animal, and mineral, \
 I know the kings of England, and I quote the fights historical \
 From Marathon to Waterloo, in order categorical; \
 I'm very well acquainted, too, with matters mathematical, \
 I understand equations, both the simple and quadratical, \
  About binomial theorem I'm teeming with a lot o' news, \
  With many cheerful facts about the square of the hypotenuse." \
  --out_path $out_wav_file 

In [None]:
from IPython.display import Audio
from IPython.display import display
wn = Audio(out_wav_file, autoplay=False) ##
display(wn)##