# Piper Finetuning

This is a nb to finetune piper voices

# Sources
- https://www.youtube.com/watch?v=b_we_jma220
- https://github.com/rhasspy/piper/blob/master/TRAINING.md


# Training Guide

Check out a [video training guide by Thorsten Müller](https://www.youtube.com/watch?v=b_we_jma220)

For Windows, see [ssamjh's guide using WSL](https://ssamjh.nz/create-custom-piper-tts-voice/)

---

Training a voice for Piper involves 3 main steps:

1. Preparing the dataset
2. Training the voice model
3. Exporting the voice model

Choices must be made at each step, including:

* The model "quality"
    * low = 16,000 Hz sample rate, [smaller voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L30)
    * medium = 22,050 Hz sample rate, [smaller voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L30)
    * high = 22,050 Hz sample rate, [larger voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L45)
* Single or multiple speakers
* Fine-tuning an [existing model](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main) or training from scratch
* Exporting to [onnx](https://github.com/microsoft/onnxruntime/) or PyTorch



## Getting Started

Start by installing system dependencies:

``` sh
sudo apt-get install python3-dev
```

Then create a Python virtual environment:

``` sh
cd piper/src/python
python3 -m venv .venv
source .venv/bin/activate
pip3 install --upgrade pip
pip3 install --upgrade wheel setuptools
pip3 install -e .
```

Run the `build_monotonic_align.sh` script in the `src/python` directory to build the extension.

Ensure you have [espeak-ng](https://github.com/espeak-ng/espeak-ng/) installed (`sudo apt-get install espeak-ng`).


In [1]:
import os

!cd piper/src/python
!python3 -m venv .venv
!source .venv/bin/activate
# !pip3 install --upgrade pip
!pip3 install pip==21.0.1
!pip3 install --upgrade wheel setuptools
if (not os.path.exists('piper')):
    !git clone https://github.com/rhasspy/piper.git
!pip3 install -r piper/src/python/requirements.txt
!pip3 install -e piper/src/python
!pip3 install torchaudio==0.11.0 torchmetrics==0.11.4
# !pip3 install numpy==1.20
!pip3 install --force-reinstall numpy==1.26.4
# https://stackoverflow.com/a/75702229/6559381
!pip3 install --force-reinstall torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/
!pip3 list | grep numpy
!pip3 list | grep piper
!pip3 list | grep torchmetrics

.venv/bin/activate (line 41): Unsupported use of '='. In fish, please use 'set VIRTUAL_ENV "/mnt/projects/piper_training/.venv"'.
from sourcing file .venv/bin/activate
source: Error while reading file “.venv/bin/activate”
Collecting pip==21.0.1
  Using cached pip-21.0.1-py3-none-any.whl (1.5 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.2
    Uninstalling pip-22.0.2:
      Successfully uninstalled pip-22.0.2
Successfully installed pip-21.0.1
Obtaining file:///mnt/projects/piper_training/piper/src/python
Installing collected packages: piper-train
  Attempting uninstall: piper-train
    Found existing installation: piper-train 1.0.0
    Uninstalling piper-train-1.0.0:
      Successfully uninstalled piper-train-1.0.0
  Running setup.py develop for piper-train
Successfully installed piper-train
Collecting torch==1.11.0
  Using cached torch-1.11.0-cp310-cp310-manylinux1_x86_64.whl (750.6 MB)
Installing collected packages: torch


In [2]:
# # Run the build_monotonic_align.sh script in the src/python directory to build the extension.
# !git clone https://github.com/rhasspy/piper.git
!chmod +x piper/src/python/build_monotonic_align.sh
!./piper/src/python/build_monotonic_align.sh

In [3]:
import os

voice_name = 'jarvis'
training_path = os.path.join(os.getcwd(), 'content/dataset/' + voice_name)
sample_rate = 48000

# Preparing a dataset

The Piper training scripts expect two files that can be generated by `python3 -m piper_train.preprocess`:

* A `config.json` file with the voice settings
    * `audio` (required)
        * `sample_rate` - audio rate in hertz
    * `espeak` (required)
        * `language` - espeak-ng voice or [alphabet](https://github.com/rhasspy/piper-phonemize/blob/master/src/phoneme_ids.cpp)
    * `num_symbols` (required)
        * Number of phonemes in the model (typically 256)
    * `num_speakers` (required)
        * Number of speakers in the dataset
    * `phoneme_id_map` (required)
        * Map from a phoneme (UTF-8 codepoint) to a list of ids
        * Id 0 ("_") is padding (pad)
        * Id 1 ("^") is the beginning of an utterance (bos)
        * Id 2 ("$") is the end of an utterance (eos)
        * Id 3 (" ") is a word separator (whitespace)
    * `phoneme_type`
        * "espeak" or "text"
        * "espeak" phonemes use [espeak-ng](https://github.com/rhasspy/espeak-ng)
        * "text" phonemes use a pre-defined [alphabet](https://github.com/rhasspy/piper-phonemize/blob/master/src/phoneme_ids.cpp)
    * `speaker_id_map`
        * Map from a speaker name to id
    * `phoneme_map`
        * Map from a phoneme (UTF-8 codepoint) to a list of phonemes
    * `inference`
        * `noise_scale` - noise added to the generator (default: 0.667)
        * `length_scale` - speaking speed (default: 1.0)
        * `noise_w` - phoneme width variation (default: 0.8) 
* A `dataset.jsonl` file with one line per utterance (JSON objects)
    * `phoneme_ids` (required)
        * List of ids for each utterance phoneme (0 <= id < `num_symbols`)
    * `audio_norm_path` (required)
        * Absolute path to [normalized audio](https://github.com/rhasspy/piper/tree/master/src/python/piper_train/norm_audio) file (`.pt`)
    * `audio_spec_path` (required)
        * Absolute path to [audio spectrogram](https://github.com/rhasspy/piper/blob/fda64e7a5104810a24eb102b880fc5c2ac596a38/src/python/piper_train/vits/mel_processing.py#L40) file (`.pt`)
    * `speaker_id` (required for multi-speaker)
        * Id of the utterance's speaker (0 <= id < `num_speakers`)
    * `audio_path`
        * Absolute path to original audio file
    * `text`
        * Original text of utterance before phonemization
    * `phonemes`
        * Phonemes from utterance text before converting to ids
    * `speaker`
        * Name of utterance speaker (from `speaker_id_map`)


### Dataset Format

The pre-processing script expects data to be a directory with:

* `metadata.csv` - CSV file with text, audio filenames, and speaker names
* `wav/` - directory with audio files

The `metadata.csv` file uses `|` as a delimiter, and has 2 or 3 columns depending on if the dataset has a single or multiple speakers.
There is no header row.

For single speaker datasets:

```csv
id|text
```

where `id` is the name of the WAV file in the `wav` directory. For example, an `id` of `1234` means that `wav/1234.wav` should exist. 

For multi-speaker datasets:

```csv
id|speaker|text
```

where `speaker` is the name of the utterance's speaker. Speaker ids will automatically be assigned based on the number of utterances per speaker (speaker id 0 has the most utterances).


### Pre-processing

An example of pre-processing a single speaker dataset:

``` sh
python3 -m piper_train.preprocess \
  --language en-us \
  --input-dir /path/to/dataset_dir/ \
  --output-dir /path/to/training_dir/ \
  --dataset-format ljspeech \
  --single-speaker \
  --sample-rate 22050
```

The `--language` argument refers to an [espeak-ng voice](https://github.com/espeak-ng/espeak-ng/) by default, such as `de` for German.

To pre-process a multi-speaker dataset, remove the `--single-speaker` flag and ensure that your dataset has the 3 columns: `id|speaker|text`
Verify the number of speakers in the generated `config.json` file before proceeding.


In [4]:
# Show metadata.csv
import os

metadata_file = os.path.join(training_path, 'metadata.csv')

with open(metadata_file, 'r') as file:
    metadata = file.readlines()

for line in metadata:
    print(line.strip())

1|Maybe you should have chosen a later time because clearly you don't want to get up.
2|Activated!
3|I've noticed we're not connected to a network. For me to operate at full capacity, please activate the Wi-Fi settings on your device.
4|Accessing alarm and interface settings. In this window, you can set up your customized greetings and alarm preferences.
5|Should you need anything, you need only ask.
6|Remember, I am a voice activated system.
7|Your device is not at full power!
8|Though it isn't quite an Arc reactor, this power source should suffice.
9|Calibrating settings, and playback.
10|Your device is running low on power.
11|Your device is now running at dangerously low power levels.
12|We are now running on emergency backup power.
13|Adjusting display.
14|Accessing bonus features.
15|Calibrating settings, and playback.
16|Accessing deleted scenes.
17|playback initiated.
18|Disabling voice commands during playback.
19|Languages.
20|Pausing playback.
21|Popus menu accessed.
22|The 

In [5]:
# Get the .ckpt file for the voice that'll be used to finetune upon.
# alan: https://huggingface.co/datasets/rhasspy/piper-checkpoints/resolve/main/en/en_GB/alan/medium/epoch%3D6339-step%3D1647790.ckpt?download=true

if not os.path.exists(f'{training_path}/{voice_name}.ckpt'):
    !wget -O {training_path}/{voice_name}.ckpt https://huggingface.co/datasets/rhasspy/piper-checkpoints/resolve/main/en/en_GB/alan/medium/epoch%3D6339-step%3D1647790.ckpt


In [6]:
!echo {training_path}
!python3 -m piper_train.preprocess \
  --language en-US \
  --input-dir {training_path} \
  --output-dir {training_path}/output \
  --dataset-format ljspeech \
  --single-speaker \
  --sample-rate {sample_rate} \
  --max-workers 1

/mnt/projects/piper_training/content/dataset/jarvis
INFO:preprocess:Single speaker dataset
INFO:preprocess:Wrote dataset config
INFO:preprocess:Processing 120 utterance(s) with 1 worker(s)


## Training a Model

Once you have a `config.json`, `dataset.jsonl`, and audio files (`.pt`) from pre-processing, you can begin the training process with `python3 -m piper_train`

For most cases, you should fine-tune from [an existing model](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main). The model must have the sample audio quality and sample rate, but does not necessarily need to be in the same language.

It is **highly recommended** to train with the following `Dockerfile`:

``` dockerfile
FROM nvcr.io/nvidia/pytorch:22.03-py3

RUN pip3 install \
    'pytorch-lightning'

ENV NUMBA_CACHE_DIR=.numba_cache
```

As an example, we will fine-tune the [medium quality lessac voice](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main/en/en_US/lessac/medium). Download the `.ckpt` file and run the following command in your training environment:

``` sh
python3 -m piper_train \
    --dataset-dir /path/to/training_dir/ \
    --accelerator 'gpu' \
    --devices 1 \
    --batch-size 32 \
    --validation-split 0.0 \
    --num-test-examples 0 \
    --max_epochs 10000 \
    --resume_from_checkpoint /path/to/lessac/epoch=2164-step=1355540.ckpt \
    --checkpoint-epochs 1 \
    --precision 32
```

Use `--quality high` to train a [larger voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L45) (sounds better, but is much slower).

You can adjust the validation split (5% = 0.05) and number of test examples for your specific dataset. For fine-tuning, they are often set to 0 because the target dataset is very small.

Batch size can be tricky to get right. It depends on the size of your GPU's vRAM, the model's quality/size, and the length of the longest sentence in your dataset. The `--max-phoneme-ids <N>` argument to `piper_train` will drop sentences that have more than `N` phoneme ids. In practice, using `--batch-size 32` and `--max-phoneme-ids 400` will work for 24 GB of vRAM (RTX 3090/4090).


### Multi-Speaker Fine-Tuning

If you're training a multi-speaker model, use `--resume_from_single_speaker_checkpoint` instead of `--resume_from_checkpoint`. This will be *much* faster than training your multi-speaker model from scratch.


In [7]:
import torch
isAvailable = torch.cuda.is_available()

if isAvailable:
    os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
    !python3 -m piper_train \
        --dataset-dir {training_path}/output \
        --accelerator 'gpu' \
        --devices 1 \
        --batch-size 32 \
        --validation-split 0.0 \
        --num-test-examples 0 \
        --max_epochs 4000 \
        --resume_from_checkpoint {training_path}/{voice_name}.ckpt \
        --checkpoint-epochs 1 \
        --precision 16
else:
    !python3 -m piper_train \
        --dataset-dir {training_path}/output \
        --batch-size 32 \
        --validation-split 0.0 \
        --num-test-examples 0 \
        --max_epochs 10000 \
        --resume_from_checkpoint {training_path}/{voice_name}.ckpt \
        --checkpoint-epochs 1 \
        --precision 32

  from .autonotebook import tqdm as notebook_tqdm


DEBUG:piper_train:Namespace(dataset_dir='/mnt/projects/piper_training/content/dataset/jarvis/output', checkpoint_epochs=1, quality='medium', resume_from_single_speaker_checkpoint=None, logger=True, enable_checkpointing=True, default_root_dir=None, gradient_clip_val=None, gradient_clip_algorithm=None, num_nodes=1, num_processes=None, devices='1', gpus=None, auto_select_gpus=False, tpu_cores=None, ipus=None, enable_progress_bar=True, overfit_batches=0.0, track_grad_norm=-1, check_val_every_n_epoch=1, fast_dev_run=False, accumulate_grad_batches=None, max_epochs=4000, min_epochs=None, max_steps=-1, min_steps=None, max_time=None, limit_train_batches=None, limit_val_batches=None, limit_test_batches=None, limit_predict_batches=None, val_check_interval=None, log_every_n_steps=50, accelerator='gpu', strategy=None, sync_batchnorm=False, precision=16, enable_model_summary=True, weights_save_path=None, num_sanity_val_steps=2, resume_from_checkpoint='/mnt/projects/piper_training/content/dataset/jar

# Testing

In [9]:
current_training_version = 0
# Test Data
!cat piper/etc/test_sentences/test_en-us.jsonl | \
    python3 -m piper_train.infer \
        --sample-rate {sample_rate} \
        --checkpoint {training_path}/output/lightning_logs/version_{current_training_version}/checkpoints/*.ckpt \
        --output-dir {training_path}/output/test

# Training Data
!cat content/dataset/jarvis/output/dataset.jsonl | \
    python3 -m piper_train.infer \
        --sample-rate {sample_rate} \
        --checkpoint {training_path}/output/lightning_logs/version_{current_training_version}/checkpoints/*.ckpt \
        --output-dir {training_path}/output/test_infer

DEBUG:fsspec.local:open file: /mnt/projects/piper_training/content/dataset/jarvis/output/lightning_logs/version_0/checkpoints/epoch=3999-step=1370220.ckpt
DEBUG:vits.lightning:No dataset to load
Removing weight norm...
DEBUG:piper_train.infer:Real-time factor for 1: 0.10 (infer=0.83 sec, audio=8.01 sec)
DEBUG:piper_train.infer:Real-time factor for 2: 0.09 (infer=0.25 sec, audio=2.69 sec)
DEBUG:piper_train.infer:Real-time factor for 3: 0.12 (infer=0.50 sec, audio=4.23 sec)
DEBUG:piper_train.infer:Real-time factor for 4: 0.08 (infer=0.35 sec, audio=4.28 sec)
DEBUG:piper_train.infer:Real-time factor for 5: 0.08 (infer=0.32 sec, audio=3.85 sec)
DEBUG:piper_train.infer:Real-time factor for 6: 0.08 (infer=0.32 sec, audio=4.10 sec)
DEBUG:piper_train.infer:Real-time factor for 7: 0.09 (infer=0.59 sec, audio=6.89 sec)
DEBUG:fsspec.local:open file: /mnt/projects/piper_training/content/dataset/jarvis/output/lightning_logs/version_0/checkpoints/epoch=3999-step=1370220.ckpt
DEBUG:vits.lightning:No 

# Tensorboard

In [12]:
!tensorboard --logdir {training_path}/output/lightning_logs

TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.18.0 at http://localhost:6006/ (Press CTRL+C to quit)
^C


## Exporting a Model

When your model is finished training, export it to onnx with:

```sh
python3 -m piper_train.export_onnx \
    /path/to/model.ckpt \
    /path/to/model.onnx
    
cp /path/to/training_dir/config.json \
   /path/to/model.onnx.json
```

The [export script](https://github.com/rhasspy/piper-samples/blob/master/_script/export.sh) does additional optimization of the model with [onnx-simplifier](https://github.com/daquexian/onnx-simplifier).

If the export is successful, you can now use your voice with Piper:

```sh
echo 'This is a test.' | \
  piper -m /path/to/model.onnx --output_file test.wav
```

In [None]:
# !pip3 install --force-reinstall torch==1.13.0 --extra-index-url https://download.pytorch.org/whl/

!python3 -m piper_train.export_onnx \
    {training_path}/output/lightning_logs/version_{current_training_version}/checkpoints/model.ckpt \
    {training_path}/model_exports/{voice_name}-medium.onnx

Removing weight norm...
  t_s == t_t
  pad_length = max(length - (self.window_size + 1), 0)
  slice_start_position = max((self.window_size + 1) - length, 0)
  if pad_length > 0:
  assert (discriminant >= 0).all(), discriminant
  _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
  _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
  _C._jit_pass_onnx_graph_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(
INFO:piper_train.export_onnx:Exported model to /mnt/projects/piper_training/content/dataset/jarvis/output/jarvis_final-medium.onnx
