Homebrewed VITS-3 with extra flow to improve text encoder's projected normalizing flow distribution and prior loss (WIP 🚧)

inspired by VITS-2 and GradTTS

Prerequisites

Python >= 3.10
Clone this repository
Install python requirements. Please refer requirements.txt
1. You may need to install espeak first: apt-get install espeak
Download datasets
1. Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
2. For mult-speaker setting, download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder: ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2
Build Monotonic Alignment Search and run preprocessing if you use your own datasets.

# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for LJ Speech and VCTK have been already provided.
# python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 
# python preprocess.py --text_index 2 --filelists filelists/vctk_audio_sid_text_train_filelist.txt filelists/vctk_audio_sid_text_val_filelist.txt filelists/vctk_audio_sid_text_test_filelist.txt

How to run (dry-run)

model forward pass (dry-run)

import torch
from models import SynthesizerTrn

net_g = SynthesizerTrn(
    n_vocab=256,
    spec_channels=80, 
    segment_size=8192,
    inter_channels=192,
    hidden_channels=192,
    filter_channels=768,
    n_heads=2,
    n_layers=6,
    kernel_size=3,
    p_dropout=0.1,
    resblock="1", 
    resblock_kernel_sizes=[3, 7, 11],
    resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
    upsample_rates=[8, 8, 2, 2],
    upsample_initial_channel=512,
    upsample_kernel_sizes=[16, 16, 4, 4],
    n_speakers=0,
    gin_channels=0,
    use_sdp=True, 
    use_transformer_flows=True, 
    # (choose from "pre_conv", "fft", "mono_layer_inter_residual", "mono_layer_post_residual")
    transformer_flow_type="fft", 
    use_spk_conditioned_encoder=True, 
    use_noise_scaled_mas=True, 
    use_duration_discriminator=True, 
)

x = torch.LongTensor([[1, 2, 3],[4, 5, 6]]) # token ids
x_lengths = torch.LongTensor([3, 2]) # token lengths
y = torch.randn(2, 80, 100) # mel spectrograms
y_lengths = torch.Tensor([100, 80]) # mel spectrogram lengths

net_g(
    x=x,
    x_lengths=x_lengths,
    y=y,
    y_lengths=y_lengths,
)

# calculate loss and backpropagate

Training Example

# LJ Speech
python train.py -c configs/vits3_ljs_nosdp.json -m ljs_base # no-sdp; (recommended)
python train.py -c configs/vits3_ljs_base.json -m ljs_base # with sdp;

# VCTK
python train_ms.py -c configs/vits3_vctk_base.json -m vctk_base

TODOs, features and notes

[] Train for LJ Speech and get sample audio

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
filelists		filelists
monotonic_align		monotonic_align
text		text
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attentions.py		attentions.py
colab_requirements.txt		colab_requirements.txt
commons.py		commons.py
data_utils.py		data_utils.py
export_onnx.py		export_onnx.py
infer_onnx.py		infer_onnx.py
inference.ipynb		inference.ipynb
inference.py		inference.py
inference_ms.py		inference_ms.py
losses.py		losses.py
mel_processing.py		mel_processing.py
models.py		models.py
modules.py		modules.py
preprocess.py		preprocess.py
preprocess_audio.py		preprocess_audio.py
requirements.txt		requirements.txt
train.py		train.py
train_ms.py		train_ms.py
transforms.py		transforms.py
utils.py		utils.py
webui.py		webui.py

License

p0p4k/vits3_pytorch

Folders and files

Latest commit

History

Repository files navigation

Homebrewed VITS-3 with extra flow to improve text encoder's projected normalizing flow distribution and prior loss (WIP 🚧)

inspired by VITS-2 and GradTTS

Prerequisites

How to run (dry-run)

Training Example

TODOs, features and notes

About

Resources

License

Stars

Watchers

Forks

Languages