# 6 - Train tts

Based on [this colab](https://colab.research.google.com/drive/1N_B_38MMRk1BUqwI_C829TyGpNmppUqK?usp=sharing#scrollTo=m_HkOd4jwqIb) and [this other one](https://colab.research.google.com/drive/1q2mhEiclQVyNe20U9fLbzVobfDiLtXSQ?usp=sharing#scrollTo=PS3jyscLSDEc).




## 1 - Setup

In [1]:
%%capture
!pip install pyloudnorm
!git clone https://github.com/xiph/rnnoise.git
!sudo apt-get install curl autoconf automake libtool python-dev pkg-config sox
%cd /content/rnnoise
!sh autogen.sh
!sh configure
!make clean
!make

In [2]:
%%capture
%cd /content
!sudo apt-get install espeak-ng
!git clone https://github.com/coqui-ai/TTS.git
!pip install TTS

In [3]:
from IPython.display import Audio
import librosa
from google.colab import drive
from pathlib import Path
import shutil
import os
import subprocess
import soundfile as sf
import pyloudnorm as pyln
import sys
import glob
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np
import torch

In [4]:
def display_audio(path):
  x, sr = librosa.load(path)
  display(Audio(x, rate=sr))

def save_for_interence(input, output=None):
  input = Path(input)
  if output is None:
    name_parts = input.name.split('.')
    name_parts[0] = f"{name_parts[0]}_inference"
    name = '.'.join(name_parts)
    output = input.parent / name
  else:
    output = Path(output)
  #load model
  model = torch.load(input)
  keys = [k for k in model["model"].keys() if k.startswith('disc.')]
  for k in keys:
    del model["model"][k]

  torch.save(model, output)

  return output

def read_text(text, 
              model_path,
              config_path="/root/.local/share/tts/tts_models--es--css10--vits/config.json",
              out_path="/content/example.wav"):
  """Read a text using a model"""
  status = subprocess.run(["tts", 
                           "--text", text, 
                           "--model_path", str(model_path),
                           "--config_path", str(config_path), 
                           "--out_path", str(out_path)])
  if status.returncode:
    raise RuntimeError(f"Process finish with error {status}")
  return Path(out_path)

Mount drive

In [5]:
drive.mount('/content/drive/')

Mounted at /content/drive/


## 2 - Choose pretrained model
We will choose the model tts_models/es/css10/vits, pretrained with a male voice in spanish.

Can be listed with
```
!tts --list_models
```

**Need to run to download the model pretrained in the cache folder.**

 Should be stored at _/root/.local/share/tts/tts_models--es--css10--vits_

In [6]:
!tts --text "Es el vecino el que elige el alcalde y es el alcalde el que quiere que sean los vecinos el alcalde, fin de la cita." --model_name "tts_models/es/css10/vits" --out_path /content/example.wav

 > Downloading model to /root/.local/share/tts/tts_models--es--css10--vits
100% 101M/101M [00:01<00:00, 52.6MiB/s]
 > Model's license - bsd-3-clause
 > Check https://opensource.org/licenses for more info.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > initialization of speaker-

In [None]:
x, sr = librosa.load("/content/example.wav")
Audio(x, rate=sr)

## 3.A - Preprocessing and store dataset in drive (skip if done)

Download dataset

In [None]:
dataset_path = "/content/drive/MyDrive/Máster/DLASP/Final/dataset_clean.zip"
shutil.copy(dataset_path, "/content/dataset_clean.zip")
dataset_original = Path('/content/dataset/')
dataset_original.mkdir(exist_ok=True)
shutil.unpack_archive("/content/dataset_clean.zip", dataset_original)

Process dataset

In [None]:
dataset_processed = Path("/content/dataset_processed")
dataset_processed.mkdir(exist_ok=True)

rnn = "/content/rnnoise/examples/rnnoise_demo"

#paths = Path(src).glob("**/*.wav")
#paths = Path(orig_wavs).glob("**/*.wav")
paths = dataset_original.glob("*.wav")

for filepath in tqdm(list(paths), leave=False):
  target_filepath= dataset_processed / filepath.name

  subprocess.run(["sox", "-G", "-v", "0.95", filepath, "48k.wav", "remix", "-", "rate", "48000"])
  # convert wav to raw
  subprocess.run(["sox", "48k.wav", "-c", "1", "-r", "48000", "-b", "16", "-e", "signed-integer", "-t", "raw", "temp.raw"])
 
  # apply rnnoise
  subprocess.run([rnn, "temp.raw", "rnn.raw"])

  # convert raw back to wav
  subprocess.run(["sox", "-G", "-v", "0.95", "-r", "48k", "-b", "16", "-e", "signed-integer", "rnn.raw", "-t", "wav", "rnn.wav"])

  # apply high/low pass filter and change sr to 22050Hz
  subprocess.run(["sox", "rnn.wav", str(target_filepath), "remix", "-", "highpass", "100", "lowpass", "7000", "rate", "22050"])
  data, rate = sf.read(target_filepath)

  # peak normalize audio to -1 dB
  peak_normalized_audio = pyln.normalize.peak(data, -1.0)

  # measure the loudness first
  meter = pyln.Meter(rate) # create BS.1770 meter
  loudness = meter.integrated_loudness(data)

  # loudness normalize audio to -25 dB LUFS
  loudness_normalized_audio = pyln.normalize.loudness(data, loudness, -25.0)
  sf.write(target_filepath, data=loudness_normalized_audio, samplerate=22050)
  

  0%|          | 0/4643 [00:00<?, ?it/s]

Test differences

In [None]:
n = 3
display_audio(dataset_original / f"segment{n}.wav")
display_audio(dataset_processed / f"segment{n}.wav")

Save to drive

In [None]:
shutil.make_archive("dataset_processed", 'zip', dataset_processed)
shutil.copy("dataset_processed.zip", "/content/drive/MyDrive/Máster/DLASP/Final/dataset_processed.zip" )
metadata = pd.read_csv(dataset_original / "metadata.csv")
metadata.to_csv("/content/drive/MyDrive/Máster/DLASP/Final/metadata.csv", index=False)

## 3.B - Load dataset processed from drive

Copy metadata file and zip with wavs

In [7]:
shutil.copy("/content/drive/MyDrive/Máster/DLASP/Final/dataset_processed.zip", 
            "dataset_processed.zip")
metadata = pd.read_csv("/content/drive/MyDrive/Máster/DLASP/Final/metadata.csv")

Uncompress zip

In [8]:
dataset_processed = Path("/content/dataset_processed")
dataset_processed.mkdir(exist_ok=True)
shutil.unpack_archive("dataset_processed.zip", dataset_processed)

In [9]:
shutil.copy("/content/drive/MyDrive/Máster/DLASP/Final/best_model_476866.pth",
            "/content/nach_base_model.pth")

'/content/nach_base_model.pth'

In [10]:
shutil.copy("/content/drive/MyDrive/Máster/DLASP/Final/nach_smooth2.pth",
            "/content/nach_base_model_2.pth")

'/content/nach_base_model_2.pth'

## 4 - Format metadata file and test pretrained model


### 4.1 - Format metadata
Clean strange characters and remove sentences with numbers (I didnt find a normalizer to replace numbers with its text version in spanish).

We keep a total of 4596 sentences.

Store the metada file in *dataset_processed / "metadata.csv"*

In [11]:
metadata = pd.read_csv("/content/drive/MyDrive/Máster/DLASP/Final/metadata.csv")
songs_selected = [84, 85,86,87,88,90,91,94,95]
metadata = metadata[metadata.song.str.split('_').str[0].astype(int).isin(songs_selected)].copy()

In [12]:
#metadata = pd.read_csv("/content/drive/MyDrive/Máster/DLASP/Final/metadata.csv")
# Replace strange characters
replacements = {"'": '', 
                '…': '',
                '%': ' por ciento',
                'î': 'i',
                'ê':'e',
                'è':'e',
                'е':'e',
                'к': 'k',
                'т':'t',
                '-' : ' ',
                '¡' : '',
                '¿' : '',
                #'é': 'e', # For any reason é is not encoded in the model
                'ü': 'u',
                }
for k, v in replacements.items():
  metadata.text = metadata.text.str.replace(k, v, regex=False)

metadata.text = metadata.text.str.strip()

# Convert to lowercase
metadata.text = metadata.text.str.lower()

# Remove sentences with numbers
for number in range(11):
  metadata = metadata[~metadata.text.str.contains(str(number))]

# Need to set the same speaker than in the finetuned example
metadata["speaker_name"] = "tux"

# Rename columns to match coqui formatter
metadata = metadata.rename(
    columns={"filename":"audio_file"}).drop(
        columns=["index", "song"])
    
metadata = metadata.reset_index(drop=True).copy()

# Store in data_processed / metadata.csv
metadata.to_csv(dataset_processed / "metadata.csv", sep='|', index=False)

metadata.head()

Unnamed: 0,text,audio_file,speaker_name
0,"la calle es un zoo ilógico, caótico, tecnológico.",segment479.wav,tux
1,"hermético, netárgico, mágico, paradójico.",segment481.wav,tux
2,"la calle es un zoo ilógico, donde demasiados b...",segment482.wav,tux
3,ladran y ladran mientras sus egos nunca dicen ...,segment483.wav,tux
4,y a cada nueva pisada una mirada de amenaza.,segment484.wav,tux


Check there aren't strange characters

In [13]:
np.unique(list(''.join(list(metadata.text))))

array([' ', ',', '.', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i',
       'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v',
       'w', 'x', 'y', 'z', 'á', 'é', 'í', 'ñ', 'ó', 'ú'], dtype='<U1')

## 4.2 Test pretrained model cached
Test model cached is correctly download (in the expected path)

In [None]:
!tts --text "La cerámica de talavera no es cosa menor. Dicho de otro modo, es cosa mayor." --model_path /content/nach_base_model_inference.pth --config_path /root/.local/share/tts/tts_models--es--css10--vits/config.json > /dev/null

display_audio("tts_output.wav")

Make a frankestein model

In [16]:
model_original = torch.load("/root/.local/share/tts/tts_models--es--css10--vits/model_file.pth.tar")
model_nach = torch.load("/content/nach_base_model_2.pth")

In [18]:

keys = [k for k in model_nach["model"].keys() if not k.startswith('disc.')]
for k in keys:
  model_nach["model"][k] = model_original["model"][k]

torch.save(model_nach, "/content/model_frankestein.pth")



## 5.1 - Prepare training

Path where the training will be stored

In [19]:
output_path = Path("/content/output/")
output_path.mkdir(exist_ok=True)

Create training script

In [20]:
training_script_content = r"""
from TTS.config import load_config

from trainer import Trainer, TrainerArgs
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits, VitsAudioConfig
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.tts.models.vits import CharactersConfig, Vits, VitsArgs, VitsAudioConfig

from TTS.tts.utils.languages import LanguageManager
from TTS.tts.utils.speakers import SpeakerManager
from pathlib import Path

# Folder where the pretrained path is cached
pretrained_folder = Path('/root/.local/share/tts/tts_models--es--css10--vits/')

# Load base configuration
vits_config = load_config(str(pretrained_folder / 'config.json'))
vits_config.output_path="/content/output/"
vits_config.run_name='tts_models--es--nach--vits'
vits_config.model_args.num_speakers = 1
vits_config.lr = 0.00001
#vits_config.lr_gen = vits_config.lr_gen / 100
vits_config.save_all_best = True
# Need discriminator for training
vits_config.init_discriminator = True
vits_config.model_args.init_discriminator = True
vits_config.epochs = 1000
vits_config.save_step = 900

# Colab standard only has 2 threads
vits_config.num_loader_workers = 2
vits_config.num_eval_loader_workers = 2

# Override dataset config
vits_config.datasets = BaseDatasetConfig(
    formatter="coqui", 
    meta_file_train="metadata.csv", 
    path='/content/dataset_processed',
    language="es",
)

# Override test sentences
vits_config.test_sentences=[
    ['Un arcoíris\u200b o arco iris es un fenómeno óptico y meteorológico que '
    'causa la aparición en la atmósfera terrestre de un arco multicolor.',
     'tux', None, 'es']
]

# Override languages id file 
language_ids_file = str(pretrained_folder / 'language_ids.json')
vits_config.language_ids_file = language_ids_file
vits_config.model_args.language_ids_file = language_ids_file
# Audio processor
ap = AudioProcessor.init_from_config(vits_config)

# Load tokenizer
tokenizer, config = TTSTokenizer.init_from_config(vits_config)

# Load the training and eval samples
train_samples, eval_samples = load_tts_samples(
    config.datasets,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

# Load speaker  and language manager
speaker_manager = SpeakerManager.init_from_config(vits_config)
language_manager = LanguageManager(language_ids_file)

# Define model
model = Vits(config, ap, tokenizer, 
             speaker_manager=speaker_manager,
             language_manager=language_manager)


trainer_args = TrainerArgs()

trainer = Trainer(
    args=trainer_args,
    config=config,
    output_path='/content/output/',
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)

trainer.fit()
"""


In [21]:
training_file = output_path / "training.py"
with open(training_file, 'w') as f:
  f.write(training_script_content)

## 5.2 Training

In [22]:
!CUDA_VISIBLE_DEVICES="0" python /content/output/training.py --restore_path /content/model_frankestein.pth

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
	| > add_blank: True
	| > use_eos_bos: False
	| > use_phonemes: False
	| > 1 not found characters:
	| > ​
| > Number of instances : 342
 | > Preprocessing samples
 | > Max text length: 95
 | > Min text length: 12
 | > Avg text length: 50.70645161290322
 | 
 | > Max audio length: 132102.0
 | > Min audio length: 33318.0
 | > Avg audio length: 81678.6
 | > Num. instances discarded samples: 32
 | > Batch group size: 0.

[1m > TRAINING (2022-12-10 12:44:48) [0m

[1m   --> STEP: 5/10 -- GLOBAL_STEP: 479275[0m
     | > loss_disc: 2.60106  (2.59402)
     | > loss_disc_real_0: 0.26934  (0.20979)
     | > loss_disc_real_1: 0.17141  (0.20721)
     | > loss_disc_real_2: 0.20782  (0.22524)
     | > loss_disc_real_3: 0.17720  (0.22626)
     | > loss_disc_real_4: 0.19260  (0.24712)
     | > loss_disc_real_5: 0.24706  (0.23624)
     | > loss_0: 2.60106  (2.59402)
     | > grad_norm_0: 11.32108  (9.16580)
     | > loss_gen: 

In [None]:
save_for_interence('/content/output/tts_models--es--nach--vits-December-09-2022_09+54PM-0000000/checkpoint_480600.pth')

PosixPath('/content/output/tts_models--es--nach--vits-December-09-2022_09+54PM-0000000/checkpoint_480600_inference.pth')

In [None]:
shutil.copy("/content/output/tts_models--es--nach--vits-December-09-2022_08+52PM-0000000/best_model_477387.pth",
            "/content/drive/MyDrive/Máster/DLASP/Final/nach_smooth1.pth")
            

'/content/drive/MyDrive/Máster/DLASP/Final/nach_smooth1.pth'

In [None]:
shutil.copy("/content/output/tts_models--es--nach--vits-December-09-2022_09+54PM-0000000/best_model_478429.pth",
            "/content/drive/MyDrive/Máster/DLASP/Final/nach_smooth2.pth")
            

'/content/drive/MyDrive/Máster/DLASP/Final/nach_smooth2.pth'

In [None]:
save_for_interence("/content/drive/MyDrive/Máster/DLASP/Final/nach_smooth2.pth")

PosixPath('/content/drive/MyDrive/Máster/DLASP/Final/nach_smooth2_inference.pth')

Create backup in drive (~20gb)

In [None]:
training_folder = "/content/output/tts_models--es--nach--vits-December-06-2022_03+32PM-0000000"
shutil.make_archive("nach_training_backup", 
                    "zip", training_folder)
shutil.copy("nach_training_backup.zip", 
            "/content/drive/MyDrive/Máster/DLASP/Final/nach_training_backup.zip")

'/content/drive/MyDrive/Máster/DLASP/Final/nach_training_backup.zip'

## Test model

In [25]:
# Select checkpoint
training_path = Path("/content/output/tts_models--es--nach--vits-December-10-2022_12+22PM-0000000/")
model_path = training_path / "checkpoint_479700.pth"
# Prepare model for inference (remove discriminator network)
# Only need to be once
model_inference_path = save_for_interence(model_path)

# # Test
# text = ("Mi padre es el sol, mi madre la luna. "
#         "Mi hermano es el viento  y el planeta tierra mi cuna. "
#         "Mis unicos hijos son las frases que me invento, "
#         "y mi mayor regalo es vivir este momento")
wav_path = read_text(text=text, model_path=model_inference_path)

#display_audio(wav_path)
wav_path

PosixPath('/content/example.wav')

In [26]:
shutil.copy('/content/output/tts_models--es--nach--vits-December-10-2022_12+22PM-0000000/best_model_478950.pth',
            '/content/drive/MyDrive/Máster/DLASP/Final/Nach_smooth4.pth')

'/content/drive/MyDrive/Máster/DLASP/Final/Nach_smooth4.pth'

In [23]:
text = """Me llaman vida, porque resurjo en cualquier parte
Me llaman luz, me llaman paz, me llaman arte
Me llaman tiempo porque dicen que todo lo curo
Me llaman muerte, porque allí donde estés, llegaré seguro
Me llaman símbolo, me llaman traición
Aquellos que al ver mi imagen se ahogan en su frustración
Me llaman y no pronuncian ningún nombre
Me llaman semi-Dios, y se olvidan que soy un hombre
Me llaman cambio, precursor, presumido y déspota
Me llaman visionario adelantado a mi época

Me llaman agua, fuego, tierra, me llaman viento
Me llaman tormenta porque en cada aliento, libero lineas de sentimientos
Me llaman estatua, porque disfruto estando solo
Me llaman mar, porque saben que nunca me conocerán del todo
Me llaman lágrima, quizás por las lecciones que enseño

Me llaman fugitivo, porque nunca, nunca, tuve dueño
Me llaman tantas cosas para bien o para mal
Hermosas o venenosas formas de hacerme inmortal
Me aman o me odian, me quieren o me rechazan
Me llaman, para entregarme sus halagos, su amenaza

Me llaman caricia porque mis palabras recorren tu piel
Me llaman pájaro, porque sé volar cuando me entrego al papel
Me llaman infiel, me llaman ingenuo, cobarde, hipócrita y maestro
Me llaman Las Vegas por lo que apuesto
Me llaman wall street por lo que arriesgo
Por mis abrazos me llaman oso, por mi rabia, tigre
Me llaman calle, no por peligroso, sino por impredecible
Me llaman mago, druida, amigo y guía
Me llaman inocencia perdida, por mi sabiduría
Me llaman sonrisa por lo sincero, me llaman fiero y caballero

Porque dejo que las frases siempre pasen primero
Me llaman títere, desviado, payaso
¿Supongo que soy lo que ellos deben ser acaso?
Me llaman genio y demonio, me llaman furia
Me llaman manicomio porque guardo dentro aquello que otros repudian
Me llaman agitador, provocador, polémico

Sin dinero me llaman triste loco, con dinero divertido excéntrico
Me llaman hermético, me llaman virus y germen
Me llaman disparo, quizás porque nunca han podido detenerme
Me llaman pero no me vuelvo, me llaman rata, me llaman enfermo
Me llaman manhattan porque nunca duermo
Me llaman desierto porque parezco eterno
Me llaman tantas cosas para bien o para mal
Hermosas o venenosas formas de hacerme inmortal
Me aman o me odian, me quieren o me rechazan
Me llaman, para entregarme sus halagos, su amenaza"""