# VITS Training

This notebook is designed to provide a guide on how to train VITS as part of the TTS pipeline. It contains the following sections

  1. VITS and NeMo - An introduction to the VITS model
  2. LJSpeech - How to train VITS on LJSpeech

# License

> Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All Rights Reserved.
> 
> Licensed under the Apache License, Version 2.0 (the "License");
> you may not use this file except in compliance with the License.
> You may obtain a copy of the License at
> 
>     http://www.apache.org/licenses/LICENSE-2.0
> 
> Unless required by applicable law or agreed to in writing, software
> distributed under the License is distributed on an "AS IS" BASIS,
> WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> See the License for the specific language governing permissions and
> limitations under the License.

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL).
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator).
4. Run this cell to set up dependencies.
5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect.
"""

# If you're using Google Colab and not running locally, run this cell.

## Install dependencies
!apt-get install -y sox libsndfile1 ffmpeg
!pip install wget text-unidecode matplotlib>=3.3.2
!pip install -Uqq boto3 datasets

## Install NeMo
BRANCH = 'main'
!pip install --upgrade pip
!python -m pip install "git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]" 
# !python -m pip install "git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]" --target=/kaggle/working/

"""
Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!
Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case
that you want to use the "Run All Cells" (or similar) option.
"""
exit()

In [None]:
!pip install nemo_toolkit['tts']

### TODO: install nemo once
1. install nemo in /kaggle/working
2. copy nemo lib from /kaggle/working/ to kaggle python lib path

# VITS and NeMo

VITS is a neural network that converts text characters into an audio sample. For more details on the model, please refer to Nvidia's [VITS Model Card](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_en_lj_vits), or the original [paper](https://arxiv.org/abs/2106.06103).

VITS like most NeMo models are defined as a LightningModule, allowing for easy training via PyTorch Lightning, and parameterized by a configuration, currently defined via a yaml file and loading using Hydra.

Let's take a look using NeMo's pretrained model and how to use it to generate spectrograms.

In [None]:
# Load the VITSModel
from nemo.collections.tts.models import VitsModel
from nemo.collections.tts.models.base import TextToWaveform

# Let's see what pretrained models are available
print(VitsModel.list_available_models())

In [None]:
# We can load the pre-trained model as follows
model = VitsModel.from_pretrained("tts_en_lj_vits")

In [None]:
# VITS is a TextToWaveform
assert isinstance(model, TextToWaveform)

TextToWaveform in NeMo have two helper functions:
   1. ```python
      parse(self, str_input: str),
      ``` 
      which takes an English string and produces a token tensor



   2. ```python
      convert_text_to_waveform(self, *, tokens),
      ```
      which takes the token tensor and generates an audio sample
Let's try it out

# Training

Now that we looked at the VITS model, let's see how to train a VITS Model

In [None]:
BRANCH = 'main'

# NeMo's training scripts are stored inside the examples/ folder. Let's grab the vits.py file
# as well as the vits.yaml file
!wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/vits.py
!(mkdir -p conf \
  && cd conf \
  && wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/conf/vits.yaml \
  && cd ..)

# additional files
!mkdir -p tts_dataset_files && cd tts_dataset_files \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tts_dataset_files/heteronyms-052722 \
&& cd ..

Let's take a look at the vits.py file

```python
import pytorch_lightning as pl

from nemo.collections.tts.models.vits import VitsModel
from nemo.core.config import hydra_runner
from nemo.utils.exp_manager import exp_manager

# hydra_runner is a thin NeMo wrapper around Hydra
# It looks for a config named vits.yaml inside the conf folder
# Hydra parses the yaml and returns it as a Omegaconf DictConfig
@hydra_runner(config_path="conf", config_name="vits")
def main(cfg):
    # Define the Lightning trainer
    trainer = pl.Trainer(use_distributed_sampler=False, **cfg.trainer)
    # exp_manager is a NeMo construct that helps with logging and checkpointing
    exp_manager(trainer, cfg.get("exp_manager", None))
    # Define the VITS model, this will construct the model
    model = VitsModel(cfg=cfg.model, trainer=trainer)
    # Let's add a few more callbacks
    trainer.callbacks.extend([pl.callbacks.LearningRateMonitor()])
    # Call lightning trainer's fit() to train the model
    trainer.fit(model)


if __name__ == '__main__':
    main()  # noqa pylint: disable=no-value-for-parameter
```

Let's take a look at the yaml config

```yaml
name: &name VITS

train_dataset: ???
validation_datasets: ???
sup_data_path: null
sup_data_types: null

phoneme_dict_path: "scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt"
heteronyms_path: "scripts/tts_dataset_files/heteronyms-052722"
```

The first part of the yaml defines dataset parameters used by VITS. Then in the head of 'model' section there are processing - related parameters. You can see
that the sample rate is set to 22050 for LJSpeech. 

Looking at the yaml, there is `train_dataset: ???` and `validation_datasets: ???`. The ??? indicates to hydra that these values must be passed via the command line or the script will fail.

Looking further down the yaml, we get to the pytorch lightning trainer parameters.

```yaml
trainer:
  num_nodes: 1
  devices: 2
  accelerator: gpu
  strategy: ddp_find_unused_parameters_true
  precision: 32
  max_epochs: -1
  accumulate_grad_batches: 1
  enable_checkpointing: false # Provided by exp_manager
  logger: false # Provided by exp_manager
  log_every_n_steps: 50
  check_val_every_n_epoch: 1
```

These values can be changed either by editing the yaml or through the command line.

Let's grab some simple audio data and test VITS.

In [None]:
!wget https://github.com/NVIDIA/NeMo/releases/download/v0.11.0/test_data.tar.gz \
&& mkdir -p tests/data \
&& tar xzf test_data.tar.gz -C tests/data

# Just like ASR, the VITS require .json files to define the training and validation data.
!cat tests/data/asr/an4_val.json

Now that we have some sample data, we can try training VITS!

Note that the sample data is not enough data to fully train a VITS model. The following code uses a toy dataset to illustrate how the pipeline for training would work.

# Text

In [6]:
def extract_phoneme_from_file(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        phoneme_dict = {line.split()[0]: line.split()[1:] for line in f}

    symbols = set([symbol for symbols in phoneme_dict.values() for symbol in symbols])

    return phoneme_dict, symbols

def tokenize(phoneme_dict, text):
    return " ".join(
        [symbol for symbols in
         [phoneme_dict[word.strip()] for word in text.split()
         if word.strip() in phoneme_dict.keys()]
         for symbol in symbols])


In [7]:
!curl --output dic5k.formatted.txt https://raw.githubusercontent.com/ekapolc/ASR_classproject/master/g2p/dic5k.formatted.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  158k  100  158k    0     0   532k      0 --:--:-- --:--:-- --:--:--  533k


In [8]:
phoneme_dict, symbols = extract_phoneme_from_file("dic5k.formatted.txt")

In [9]:
import os
import wave
from glob import glob
from pydub import AudioSegment

def process_dataset(wav_dir, wrd_ph_dir):
    wav_ls = sorted(glob(f"{wav_dir}/*"))
    wrd_ph_ls = sorted(glob(f"{wrd_ph_dir}/*"))

    audio_paths = []
    texts = []
    durations = []
    error_file_count = 0
    sample_rate = None
    new_sample_rate = 22050

    for wav_fn, wrd_ph_fn in zip(wav_ls, wrd_ph_ls):
#         print(wav_fn, wrd_ph_fn)
        if wav_fn.endswith(".wav"):
            try:
                with wave.open(wav_fn, "rb") as wav_file:
                    number_of_frames = wav_file.getnframes()
                    sample_rate = wav_file.getframerate()
                    duration_seconds = number_of_frames / sample_rate

                    durations.append(duration_seconds)
                    
                    os.makedirs("TSync2/wav_22050Hz/", exist_ok=True)
                    new_wav_fn = f"TSync2/wav_22050Hz/{wav_fn.split('/')[-1]}"
                    
                    sound = AudioSegment.from_wav(wav_fn)
                    sound = sound.set_frame_rate(new_sample_rate)
                    sound.export(new_wav_fn, format="wav")
                    audio_paths.append(new_wav_fn)

            except Exception as e:
                print(wav_fn, e)
                error_file_count += 1
                continue

        if wrd_ph_fn.endswith(".txt"):
            with open(wrd_ph_fn, "r") as file:
                text = file.readline().strip()
                texts.append(text)

    print(f"Error files count: {error_file_count}")
    return audio_paths, texts, durations, sample_rate

## sample rate
Tsync2 = 44100 Hz

tts_en_lj_vits model = 22050 Hz

In [10]:
wav_dir, wrd_ph_dir = "/kaggle/input/tsync2/AIFORTHAI-TSync2Corpus/TSync2/wav", "/kaggle/input/tsync2/AIFORTHAI-TSync2Corpus/TSync2/wrd_ph"
audio_paths, texts, durations, sample_rate = process_dataset(wav_dir, wrd_ph_dir)

# audio_paths[:5], texts[:5], durations[:5]
len(audio_paths), len(texts), len(durations), sample_rate

/kaggle/input/tsync2/AIFORTHAI-TSync2Corpus/TSync2/wav/tsync2_noon_0_1943.wav unknown format: 2
/kaggle/input/tsync2/AIFORTHAI-TSync2Corpus/TSync2/wav/tsync2_noon_11_8747.wav unknown format: 2
/kaggle/input/tsync2/AIFORTHAI-TSync2Corpus/TSync2/wav/tsync2_noon_12_2502.wav unknown format: 2
/kaggle/input/tsync2/AIFORTHAI-TSync2Corpus/TSync2/wav/tsync2_noon_12_8492.wav unknown format: 2
/kaggle/input/tsync2/AIFORTHAI-TSync2Corpus/TSync2/wav/tsync2_noon_17_5217.wav unknown format: 2
/kaggle/input/tsync2/AIFORTHAI-TSync2Corpus/TSync2/wav/tsync2_noon_18_1457_2.wav unknown format: 2
/kaggle/input/tsync2/AIFORTHAI-TSync2Corpus/TSync2/wav/tsync2_noon_18_4241.wav unknown format: 2
/kaggle/input/tsync2/AIFORTHAI-TSync2Corpus/TSync2/wav/tsync2_noon_21_6363.wav unknown format: 2
/kaggle/input/tsync2/AIFORTHAI-TSync2Corpus/TSync2/wav/tsync2_noon_25_5989.wav unknown format: 2
/kaggle/input/tsync2/AIFORTHAI-TSync2Corpus/TSync2/wav/tsync2_noon_26_5214.wav unknown format: 2
/kaggle/input/tsync2/AIFORTHA

(2668, 2668, 2668, 44100)

In [11]:
audio_paths[:3]

['TSync2/wav_22050Hz/tsync2_noon_0_1228.wav',
 'TSync2/wav_22050Hz/tsync2_noon_0_1866.wav',
 'TSync2/wav_22050Hz/tsync2_noon_0_250.wav']

In [12]:
texts = [text.replace("|", " ") for text in texts]
texts[:3]

['บาง ศพ มี คราบ สี ขาว จับ เขรอะ ขอบ ปาก ',
 'ผล การ สอบ ปาก คำ พนักงาน บริษัท ทอร์เซน พับบลิค เอเยนซี่ ',
 'มอบ วัด พระ บาท น้ำพุ สร้าง เตา เผา เอดส์ ']

In [13]:
tokenized_texts = [tokenize(phoneme_dict, text) for text in texts]
tokenized_texts[:3]

['b aa ng^ s o p^ m ii s ii kh aa w^ c a p^ p aa k^',
 'ph o n^ k aa n^ s @@ p^ p aa k^ kh a m^ ph a n a k^ ng aa n^ b @@ r i s a t^',
 'm @@ p^ w a t^ phr a b aa t^ s aa ng^ t a w^ z ee t^']

In [None]:
import json

def prepare_dataset(audio_paths, texts, durations, outdir="dataset/"):
    sample_size = 50
    train_size = int(len(audio_paths) * 0.8)

    with open(os.path.join(outdir, "dataset.json"), "w", encoding="utf-8") as f:
        for audio_path, text, duration in zip(audio_paths, texts, durations):
            data = json.dumps({"audio_filepath": audio_path, "text": text, "duration": duration})
            f.write(data + "\n")

    with open(os.path.join(outdir, "3_sample.json"), "w") as f:
        for audio_path, text, duration in zip(audio_paths[:3], texts[:3], durations[:3]):
            data = json.dumps({"audio_filepath": audio_path, "text": text, "duration": duration})
            f.write(data + "\n")

    with open(os.path.join(outdir, "sample.json"), "w") as f:
        for audio_path, text, duration in zip(audio_paths[:sample_size], texts[:sample_size], durations[:sample_size]):
            data = json.dumps({"audio_filepath": audio_path, "text": text, "duration": duration})
            f.write(data + "\n")

    with open(os.path.join(outdir, "train.json"), "w") as f:
        for audio_path, text, duration in zip(audio_paths[:train_size], texts[:train_size], durations[:train_size]):
            data = json.dumps({"audio_filepath": audio_path, "text": text, "duration": duration})
            f.write(data + "\n")

    with open(os.path.join(outdir, "val.json"), "w") as f:
        for audio_path, text, duration in zip(audio_paths[train_size:], texts[train_size:], durations[train_size:]):
            data = json.dumps({"audio_filepath": audio_path, "text": text, "duration": duration})
            f.write(data + "\n")

In [None]:
!mkdir -p dataset

prepare_dataset(audio_paths, tokenized_texts, durations)

In [None]:
!cat /kaggle/working/dataset/3_sample.json

In [None]:
from nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers import IPATokenizer, EnglishPhonemesTokenizer

def config_nemo(symbols):
    EnglishPhonemesTokenizer.PUNCT_LIST = PUNCT_LIST[:-3] #
    # EnglishPhonemesTokenizer.VOWELS = ("^", "g") #
    EnglishPhonemesTokenizer.VOWELS = () #
    symbols.update(["^", "g"])
    EnglishPhonemesTokenizer.CONSONANTS = tuple(symbols)
    EnglishPhonemesTokenizer.text_preprocessing_func = lambda x : x

def reset_nemo(punct_list, vowels, consonants):
    EnglishPhonemesTokenizer.PUNCT_LIST = punct_list
    EnglishPhonemesTokenizer.VOWELS = vowels,
    EnglishPhonemesTokenizer.CONSONANTS = consonants
    EnglishPhonemesTokenizer.text_preprocessing_func = None

In [None]:
# from `nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py#EnglishPhonemesTokenizer`
PUNCT_LIST = (  # Derived from LJSpeech and "/" additionally
    ',', '.', '!', '?', '-',
    ':', ';', '/', '"', '(',
    ')', '[', ']', '{', '}',
)
VOWELS = (
    'AA', 'AE', 'AH', 'AO', 'AW',
    'AY', 'EH', 'ER', 'EY', 'IH',
    'IY', 'OW', 'OY', 'UH', 'UW',
)
CONSONANTS = (
    'B', 'CH', 'D', 'DH', 'F', 'G',
    'HH', 'JH', 'K', 'L', 'M', 'N',
    'NG', 'P', 'R', 'S', 'SH', 'T',
    'TH', 'V', 'W', 'Y', 'Z', 'ZH',
)

In [None]:
config_nemo(symbols)
# reset_nemo(PUNCT_LIST, VOWELS, CONSONANTS)

In [None]:
len(EnglishPhonemesTokenizer.PUNCT_LIST) + \
len(EnglishPhonemesTokenizer.VOWELS) + \
len(EnglishPhonemesTokenizer.CONSONANTS)

In [None]:
EnglishPhonemesTokenizer.CONSONANTS

In [39]:
!rm -rf conf
!mkdir conf
# !git clone https://gist.github.com/77c7f024d1193da3bc6a5c330d5b6c3c.git conf/
# IPA
!wget -O conf/vits.yaml https://gist.githubusercontent.com/gri11/77c7f024d1193da3bc6a5c330d5b6c3c/raw/ab97b971ac8af8d76da58842fcae2088d7a30562/vits.yaml

--2024-04-23 12:20:56--  https://gist.githubusercontent.com/gri11/77c7f024d1193da3bc6a5c330d5b6c3c/raw/ab97b971ac8af8d76da58842fcae2088d7a30562/vits.yaml
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5872 (5.7K) [text/plain]
Saving to: 'conf/vits.yaml'


2024-04-23 12:20:57 (49.0 MB/s) - 'conf/vits.yaml' saved [5872/5872]



In [None]:
!cat conf/vits.yaml

In [None]:
!cat /kaggle/working/dataset/sample.json

In [None]:
!cat /kaggle/working/dic5k.formatted.txt | head -5

In [None]:
symbols

In [None]:
!rm phoneme_dict.txt
for symbol in symbols:
    with open("phoneme_dict.txt", "a+") as f:
        f.write(f"{symbol} {symbol}\n")

In [None]:
!cat phoneme_dict.txt | head -10

In [None]:
!cat tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt | tail -10

In [None]:
!cat tts_dataset_files/heteronyms-052722 | tail -10

In [47]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("wandb")
!wandb login $secret_value_0

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [67]:
checkpoint_path = 'vits.ckpt'

In [71]:
# clear past run
!rm -rf /kaggle/working/nemo_experiments/VITS

In [75]:
!(python vits.py \
  model.sample_rate=22050 \
  train_dataset=/kaggle/working/dataset/sample.json \
  validation_datasets=/kaggle/working/dataset/sample.json \
  phoneme_dict_path=/kaggle/working/phoneme_dict.txt \
  heteronyms_path=tts_dataset_files/heteronyms-052722 \
  model.text_normalizer_call_kwargs.punct_pre_process=false \
  model.text_normalizer_call_kwargs.punct_post_process=false \
  model.text_tokenizer.punct=false \
  model.text_tokenizer.apostrophe=false \
  trainer.max_epochs=50 \
  trainer.accelerator='gpu' \
  trainer.strategy='ddp_find_unused_parameters_true' \
  trainer.check_val_every_n_epoch=1 \
  trainer.devices=1 \
  triner.log_every_n_steps=1 \
  wandb.group='dev' \
 )

#   ++exp_manager.resume_from_checkpoint=$checkpoint_path \

2024-04-23 12:50:15.628490: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-23 12:50:15.628574: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-23 12:50:15.630088: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2024-04-23 12:50:19 exp_manager:396] Experiments will be logged at /kaggle/

In [76]:
!ls /kaggle/working/nemo_experiments/VITS/

2024-04-23_12-47-34  2024-04-23_12-50-19


In [77]:
!ls /kaggle/working/nemo_experiments/VITS/ | tail -1

2024-04-23_12-50-19


In [78]:
!ls /kaggle/working/nemo_experiments/VITS/2024-04-23_12-23-35/checkpoints

ls: cannot access '/kaggle/working/nemo_experiments/VITS/2024-04-23_12-23-35/checkpoints': No such file or directory


In [79]:
!cp \
/kaggle/working/nemo_experiments/VITS/2024-04-23_12-23-35/checkpoints/VITS--loss_gen_all=52.9682-epoch=10-last.ckpt \
vits.ckpt

cp: cannot stat '/kaggle/working/nemo_experiments/VITS/2024-04-23_12-23-35/checkpoints/VITS--loss_gen_all=52.9682-epoch=10-last.ckpt': No such file or directory


# Training Data

In order to train VITS, it is highly recommended to obtain high quality speech data with the following properties:
  - Sampling rate of 22050Hz or higher
  - Speech should contain a variety of speech phonemes
  - Audio split into segments of 1-10 seconds
  - Audio segments should not have silence at the beginning and end
  - Audio segments should not contain long silences inside

After obtaining the speech data and splitting into training, validation, and test sections, it is required to construct .json files to tell NeMo where to find these audio files.

The .json files should adhere to the format required by the `nemo.collections.tts.data.dataset.TTSDataset` class. For example, here is a sample .json file

```json
{"audio_filepath": "/path/to/audio1.wav", "text": "the transcription", "duration": 0.82}
{"audio_filepath": "/path/to/audio2.wav", "text": "the other transcription", "duration": 2.1}
...
```
Please note that the duration is in seconds.

## Evaluating VITS

Let's evaluate the quality of the VITS model.

VITS is end-to-end model, so we don't need any additional models to generate audios.

In [16]:
tokenized_texts[:3]

['b aa ng^ s o p^ m ii s ii kh aa w^ c a p^ p aa k^',
 'ph o n^ k aa n^ s @@ p^ p aa k^ kh a m^ ph a n a k^ ng aa n^ b @@ r i s a t^',
 'm @@ p^ w a t^ phr a b aa t^ s aa ng^ t a w^ z ee t^']

In [53]:
from nemo.collections.tts.models import VitsModel

model_path = "/kaggle/working/nemo_experiments/VITS/2024-04-23_12-23-35/checkpoints/VITS.nemo"
model = VitsModel.restore_from(model_path).cuda().eval()

 NeMo-text-processing :: INFO     :: Creating ClassifyFst grammars.
[NeMo W 2024-04-23 12:32:59 experimental:26] `<class 'nemo.collections.tts.g2p.models.i18n_ipa.IpaG2p'>` is experimental and not ready for production yet. Use at your own risk.
[NeMo W 2024-04-23 12:32:59 i18n_ipa:124] apply_to_oov_word=None, This means that some of words will remain unchanged if they are not handled by any of the rules in self.parse_one_word(). This may be intended if phonemes and chars are both valid inputs, otherwise, you may see unexpected deletions in your input.
[NeMo W 2024-04-23 12:32:59 experimental:26] `<class 'nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.IPATokenizer'>` is experimental and not ready for production yet. Use at your own risk.
[NeMo W 2024-04-23 12:32:59 modelPT:165] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    d

[NeMo I 2024-04-23 12:32:59 features:289] PADDING: 1
[NeMo I 2024-04-23 12:32:59 features:297] STFT using exact pad
[NeMo I 2024-04-23 12:33:01 save_restore_connector:249] Model VitsModel was successfully restored from /kaggle/working/nemo_experiments/VITS/2024-04-23_12-23-35/checkpoints/VITS.nemo.


In [54]:
import IPython.display as ipd
text_raw = tokenized_texts[0]

target_sr = 22050

tokens = model.parse(text_raw)
audio_pred = model.convert_text_to_waveform(tokens=tokens).cpu().detach().numpy()

print("predicted audio")
ipd.Audio(audio_pred, rate=target_sr)

predicted audio


In [55]:
def inference(text):
    tokens = model.parse(text)
    audio_pred = model.convert_text_to_waveform(tokens=tokens).cpu().detach().numpy()

    return ipd.Audio(audio_pred, rate=target_sr)

In [56]:
inference("hello world")

In [36]:
inference("one two three four five")

# Tutorial

In [None]:
!wget https://multilangaudiosamples.s3.us-east-2.amazonaws.com/LJ023-0089.wav

In [None]:
from matplotlib.pyplot import imshow
from matplotlib import pyplot as plt
import IPython.display as ipd
import numpy as np
import torch
import librosa
import soundfile as sf

target_sr = 22050

audio_path = "./LJ023-0089.wav"
text_raw = "That is not only my accusation."


audio_data, orig_sr = sf.read(audio_path)
if orig_sr != target_sr:
    audio_data = librosa.core.resample(audio_data, orig_sr=orig_sr, target_sr=target_sr)

# Let's double-check that everything matches up!
print(f"Duration (s): {len(audio_data)/target_sr}")
print("Transcript:", text_raw)
ipd.Audio(audio_data, rate=target_sr)

In [None]:
model = VitsModel.from_pretrained("tts_en_lj_vits").cuda().eval()

In [None]:
text_raw = "one, two, three, four, five, i love you"

tokens = model.parse(text_raw)
audio_pred = model.convert_text_to_waveform(tokens=tokens).cpu().detach().numpy()

print("predicted audio")
ipd.Audio(audio_pred, rate=target_sr)

In [None]:
audio_to_mel = model.audio_to_melspec_processor


len_audio = torch.tensor(len(audio_data)).view(1, -1)

spec_pred, _ = audio_to_mel(torch.tensor(audio_pred).view(1, -1), len_audio)
spec_orig, _ = audio_to_mel(torch.tensor(audio_data).view(1, -1), len_audio)

In [None]:
fig, ax = plt.subplots(1, 2)

ax[0].imshow(spec_orig[0][0].cpu().detach())
ax[1].imshow(spec_pred[0][0].cpu().detach())

ax[0].set_title('Original spectrogram')
ax[1].set_title('Predicted spectrogram')
fig.show()