Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a TTS recipe VITS on LJSpeech dataset #1372

Merged
merged 17 commits into from
Nov 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ per-file-ignores =
egs/librispeech/ASR/zipformer_mmi/*.py: E501, E203
egs/librispeech/ASR/zipformer/*.py: E501, E203
egs/librispeech/ASR/RESULTS.md: E999,

egs/ljspeech/TTS/vits/*.py: E501, E203
# invalid escape sequence (cause by tex formular), W605
icefall/utils.py: E501, W605

Expand Down
7 changes: 7 additions & 0 deletions docs/source/recipes/TTS/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
TTS
======

.. toctree::
:maxdepth: 2

ljspeech/vits
113 changes: 113 additions & 0 deletions docs/source/recipes/TTS/ljspeech/vits.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
VITS
===============

This tutorial shows you how to train an VITS model
with the `LJSpeech <https://keithito.com/LJ-Speech-Dataset/>`_ dataset.

.. note::

The VITS paper: `Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech <https://arxiv.org/pdf/2106.06103.pdf>`_


Data preparation
----------------

.. code-block:: bash

$ cd egs/ljspeech/TTS
$ ./prepare.sh

To run stage 1 to stage 5, use

.. code-block:: bash

$ ./prepare.sh --stage 1 --stop_stage 5


Build Monotonic Alignment Search
--------------------------------

.. code-block:: bash

$ cd vits/monotonic_align
$ python setup.py build_ext --inplace
$ cd ../../


Training
--------

.. code-block:: bash

$ export CUDA_VISIBLE_DEVICES="0,1,2,3"
$ ./vits/train.py \
--world-size 4 \
--num-epochs 1000 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir vits/exp \
--tokens data/tokens.txt
--max-duration 500

.. note::

You can adjust the hyper-parameters to control the size of the VITS model and
the training configurations. For more details, please run ``./vits/train.py --help``.

.. note::

The training can take a long time (usually a couple of days).

Training logs, checkpoints and tensorboard logs are saved in ``vits/exp``.


Inference
---------

The inference part uses checkpoints saved by the training part, so you have to run the
training part first. It will save the ground-truth and generated wavs to the directory
``vits/exp/infer/epoch-*/wav``, e.g., ``vits/exp/infer/epoch-1000/wav``.

.. code-block:: bash

$ export CUDA_VISIBLE_DEVICES="0"
$ ./vits/infer.py \
--epoch 1000 \
--exp-dir vits/exp \
--tokens data/tokens.txt
--max-duration 500

.. note::

For more details, please run ``./vits/infer.py --help``.


Export models
-------------

Currently we only support ONNX model exporting. It will generate two files in the given ``exp-dir``:
``vits-epoch-*.onnx`` and ``vits-epoch-*.int8.onnx``.

.. code-block:: bash

$ ./vits/export-onnx.py \
--epoch 1000 \
--exp-dir vits/exp \
--tokens data/tokens.txt

You can test the exported ONNX model with:

.. code-block:: bash

$ ./vits/test_onnx.py \
--model-filename vits/exp/vits-epoch-1000.onnx \
--tokens data/tokens.txt


Download pretrained models
--------------------------

If you don't want to train from scratch, you can download the pretrained models
by visiting the following link:

- `<https://huggingface.co/Zengwei/icefall-tts-ljspeech-vits-2023-11-29>`_
3 changes: 2 additions & 1 deletion docs/source/recipes/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Recipes
=======

This page contains various recipes in ``icefall``.
Currently, only speech recognition recipes are provided.
Currently, we provide recipes for speech recognition, language model, and speech synthesis.

We may add recipes for other tasks as well in the future.

Expand All @@ -16,3 +16,4 @@ We may add recipes for other tasks as well in the future.
Non-streaming-ASR/index
Streaming-ASR/index
RNN-LM/index
TTS/index
106 changes: 106 additions & 0 deletions egs/ljspeech/TTS/local/compute_spectrogram_ljspeech.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
#!/usr/bin/env python3
# Copyright 2021-2023 Xiaomi Corp. (authors: Fangjun Kuang,
# Zengwei Yao)
#
# See ../../../../LICENSE for clarification regarding multiple authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


"""
This file computes fbank features of the LJSpeech dataset.
It looks for manifests in the directory data/manifests.

The generated spectrogram features are saved in data/spectrogram.
"""

import logging
import os
from pathlib import Path

import torch
from lhotse import (
CutSet,
LilcomChunkyWriter,
Spectrogram,
SpectrogramConfig,
load_manifest,
)
from lhotse.audio import RecordingSet
from lhotse.supervision import SupervisionSet

from icefall.utils import get_executor

# Torch's multithreaded behavior needs to be disabled or
# it wastes a lot of CPU and slow things down.
# Do this outside of main() in case it needs to take effect
# even when we are not invoking the main (e.g. when spawning subprocesses).
torch.set_num_threads(1)
torch.set_num_interop_threads(1)


def compute_spectrogram_ljspeech():
src_dir = Path("data/manifests")
output_dir = Path("data/spectrogram")
num_jobs = min(4, os.cpu_count())

sampling_rate = 22050
frame_length = 1024 / sampling_rate # (in second)
frame_shift = 256 / sampling_rate # (in second)
use_fft_mag = True

prefix = "ljspeech"
suffix = "jsonl.gz"
partition = "all"

recordings = load_manifest(
src_dir / f"{prefix}_recordings_{partition}.{suffix}", RecordingSet
)
supervisions = load_manifest(
src_dir / f"{prefix}_supervisions_{partition}.{suffix}", SupervisionSet
)

config = SpectrogramConfig(
sampling_rate=sampling_rate,
frame_length=frame_length,
frame_shift=frame_shift,
use_fft_mag=use_fft_mag,
)
extractor = Spectrogram(config)

with get_executor() as ex: # Initialize the executor only once.
cuts_filename = f"{prefix}_cuts_{partition}.{suffix}"
if (output_dir / cuts_filename).is_file():
logging.info(f"{cuts_filename} already exists - skipping.")
return
logging.info(f"Processing {partition}")
cut_set = CutSet.from_manifests(
recordings=recordings, supervisions=supervisions
)

cut_set = cut_set.compute_and_store_features(
extractor=extractor,
storage_path=f"{output_dir}/{prefix}_feats_{partition}",
# when an executor is specified, make more partitions
num_jobs=num_jobs if ex is None else 80,
executor=ex,
storage_type=LilcomChunkyWriter,
)
cut_set.to_file(output_dir / cuts_filename)


if __name__ == "__main__":
formatter = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"

logging.basicConfig(format=formatter, level=logging.INFO)
compute_spectrogram_ljspeech()
73 changes: 73 additions & 0 deletions egs/ljspeech/TTS/local/display_manifest_statistics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#!/usr/bin/env python3
# Copyright 2023 Xiaomi Corp. (authors: Zengwei Yao)
#
# See ../../../../LICENSE for clarification regarding multiple authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
This file displays duration statistics of utterances in a manifest.
You can use the displayed value to choose minimum/maximum duration
to remove short and long utterances during the training.

See the function `remove_short_and_long_utt()` in vits/train.py
for usage.
"""


from lhotse import load_manifest_lazy


def main():
path = "./data/spectrogram/ljspeech_cuts_all.jsonl.gz"
cuts = load_manifest_lazy(path)
cuts.describe()


if __name__ == "__main__":
main()

"""
Cut statistics:
╒═══════════════════════════╤══════════╕
│ Cuts count: │ 13100 │
├───────────────────────────┼──────────┤
│ Total duration (hh:mm:ss) │ 23:55:18 │
├───────────────────────────┼──────────┤
│ mean │ 6.6 │
├───────────────────────────┼──────────┤
│ std │ 2.2 │
├───────────────────────────┼──────────┤
│ min │ 1.1 │
├───────────────────────────┼──────────┤
│ 25% │ 5.0 │
├───────────────────────────┼──────────┤
│ 50% │ 6.8 │
├───────────────────────────┼──────────┤
│ 75% │ 8.4 │
├───────────────────────────┼──────────┤
│ 99% │ 10.0 │
├───────────────────────────┼──────────┤
│ 99.5% │ 10.1 │
├───────────────────────────┼──────────┤
│ 99.9% │ 10.1 │
├───────────────────────────┼──────────┤
│ max │ 10.1 │
├───────────────────────────┼──────────┤
│ Recordings available: │ 13100 │
├───────────────────────────┼──────────┤
│ Features available: │ 13100 │
├───────────────────────────┼──────────┤
│ Supervisions available: │ 13100 │
╘═══════════════════════════╧══════════╛
"""
Loading
Loading