## Dataset notes
Training will fail with samples that are too short (<1 second).

## Download necessary files and set base config
Specify the experiment name in `EXPERIMENT_NAME` and download the dataset and OOD text list. This will also configure some default settings in `Configs/config.yml`.

`DATASET_URL` should point to a google drive hosted .zip containing:

* train_list.txt
* val_list.txt
* *.wav

`OOD_URL` should point to an OOD text list (the corresponding audio files are not needed; this is just used to test the TTS model on out-of-dataset text)

`PRETRAINED`, if populated, should point to a direct download for a first-stage checkpoint or pretrained second-stage model (depending on what stage of training you plan to do.)

Set `FIRST_STAGE` to True if the checkpoint you are supplying is a first stage checkpoint.

It is recommended to use huggingface if your home internet bandwidth is not fantastic.
If it is great, you can use `vastai copy` to copy files to the desired locations. These are the files you need to provide:
* `/root/StyleTTS2/Data/train_list.txt`
* `/root/StyleTTS2/Data/val_list.txt`
* `/root/StyleTTS2/{EXPERIMENT_NAME}_data/*.wav`
* `/root/StyleTTS2/Models/{EXPERIMENT_NAME}/pretrained.pth` (if using a pretrained model)

In [4]:
EXPERIMENT_NAME = "Twilight0"
DATASET_URL = "https://huggingface.co/datasets/therealvul/StyleTTS2MLP/resolve/main/StyleTTS2Omnidata.zip?download=true"
OOD_URL = "https://gist.githubusercontent.com/effusiveperiscope/f3a6d48ad3463b63d1cb9a53bfab9dd9/raw/480f168c0f4cdc96c8863ac09c84978b364ee6a1/gistfile1.txt"
PRETRAINED_URL = "https://huggingface.co/therealvul/StyleTTS2/resolve/main/Unfinished/omni_epoch_1st_00012.pth?download=true"
FIRST_STAGE = True


In [18]:
import ffmpeg
import os
import gdown
import urllib
from tqdm import tqdm
from pathlib import Path

# Download and extract dataset and data labels
styletts_basedir = "/root/StyleTTS2" 
dataset_file = os.path.join(styletts_basedir, f"dataset_{EXPERIMENT_NAME}.zip")
dataset_target_dir = os.path.join(styletts_basedir, f"{EXPERIMENT_NAME}_data")
model_dir = os.path.join(styletts_basedir, "Models", EXPERIMENT_NAME)

def download_file(url, dest):
    print(f"Downloading {url} to {dest}")
    with urllib.request.urlopen(url) as r, open(dest, 'wb') as out_file:
        total_size = int(r.info().get('Content-Length', 0))
        block_size = 1024
        with tqdm(total=total_size, unit='B', unit_scale=True) as pbar:
            while True:
                data = r.read(block_size)
                if not data:
                    break
                out_file.write(data)
                pbar.update(len(data))

print("Downloading dataset")
if not os.path.exists(dataset_file):
    if "drive.google.com" in DATASET_URL:
        gdown.download(DATASET_URL, output=dataset_file, quiet=False)
    else:
        download_file(DATASET_URL, dataset_file)
print("Downloaded dataset")

assert(os.path.exists(dataset_file))

if not os.path.exists(dataset_target_dir):
    os.makedirs(dataset_target_dir, exist_ok=True)

import zipfile
with zipfile.ZipFile(dataset_file, 'r') as f:
    files_count = len(f.namelist())
    with tqdm(total=files_count, desc=f"Extracting data",
        unit="file") as pbar:
        for info in f.namelist():
            if info.endswith('.txt'):
                unzip_file_path = os.path.join(styletts_basedir, 'Data')
                f.extract(info, unzip_file_path) # Not sure if this can overwrite data?
            elif info.endswith('.wav'):
                unzip_file_path = dataset_target_dir
                if os.path.exists(os.path.join(unzip_file_path, info)):
                    continue
                f.extract(info, unzip_file_path)
            pbar.update(1)

# Download OOD text
ood_out_path = os.path.join(styletts_basedir, 'Data', 'OOD_texts.txt')
if not os.path.exists(ood_out_path):
    print("Downloading OOD")
    download_file(OOD_URL, ood_out_path)
    print("Downloaded OOD")

# Download pretrained model
if not os.path.exists(model_dir):
    os.makedirs(model_dir, exist_ok=True)

if FIRST_STAGE:
    pretrain_name = 'epoch_1st_pretrained.pth'
else:
    pretrain_name = 'epoch_2nd_pretrained.pth'

if PRETRAINED_URL is not None and not os.path.exists(os.path.join(model_dir, pretrain_name)):
    print("Downloading pretrained model")
    download_file(PRETRAINED_URL, os.path.join(model_dir, pretrain_name))
    print("Downloaded pretrained model")

# Setup base config
import yaml
config_path = os.path.join(styletts_basedir, 'Configs', 'config.yml')
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)
    config['log_dir'] = f'Models/{EXPERIMENT_NAME}'
    if PRETRAINED_URL is not None:
        config['pretrained_model'] = os.path.join(model_dir, pretrain_name)
    config['data_params']['root_path'] = dataset_target_dir

with open(config_path, 'w') as f:
    yaml.dump(config, f, default_flow_style=False)

Downloading dataset
Downloaded dataset


Extracting data:   0%|          | 2/6199 [00:00<05:22, 19.23file/s]


Downloading OOD
Downloading https://gist.githubusercontent.com/effusiveperiscope/f3a6d48ad3463b63d1cb9a53bfab9dd9/raw/480f168c0f4cdc96c8863ac09c84978b364ee6a1/gistfile1.txt to D:/styletts2test\Data\OOD_texts.txt


100%|██████████| 58.4k/58.4k [00:00<00:00, 8.07MB/s]

Downloaded OOD





## First stage (acoustic reconstruction training)
This is only for training a model from scratch; you do not need to run this step if you only plan to finetune the diffusion (second stage) model.

Configure the settings in the cell before running.

* `TOTAL_EPOCHS` - total epochs used for first stage training. The paper used `100` epochs for LJSpeech (single speaker, 24hrs), `50` epochs for VCTK (109 speakers, 44hrs), and `30` epochs for LibriTTS (1151 speakers, 245hrs).
* `TMA_START_POINT` - epochs before which transferable monotonic alignment is trained. Epochs in the TMA training regime are typically 20x longer than epochs before--so this should be expected to take up a large portion of training time. [The author used TMA_epoch = 5 for LibriTTS](https://github.com/yl4579/StyleTTS2/issues/18). I found `50` to be a reasonable value for a 5-hour dataset, so this can probably be lowered for larger datasets.
* `PRETRAINED_MODEL` - if specified, loads the model and treats as a fresh run on the existing parameters (reset epoch/optimizer). Will be overridden by resume behavior.
* `RESUME` - Whether to treat this run as resuming from an existing checkpoint (meaning that epoch numbers, optimizer states, etc. will not be reset). The default resume implementation will automatically select the most recent checkpoint from the current directory.
* `EPOCHS_SAVE_FREQ` - Determines how often epochs are saved. Be mindful of your allocated disk limits. Because TMA epochs take so long to complete, you may want to consider lowering this down to 1 during TMA training.
* `STEPS_SAVE_FREQ` - Determines how often epochs are saved. Be mindful of your allocated disk limits. Because TMA epochs take so long to complete, you may want to consider lowering this down to 1 during TMA training.
* `BATCH_SIZE` - Higher values will increase training speed but require more VRAM. A batch size of 2 will fit on a 16GB GPU during both training stages. 
* `MULTISPEAKER` - Whether this model is multispeaker.
* `SAVE_MODE` - `ITER` to save most recent models, `VAL_LOSS` to save based on validation loss.

In [None]:
TOTAL_EPOCHS = 100 
TMA_START_POINT = 32
PRETRAINED_MODEL = ''
RESUME = False
EPOCHS_SAVE_FREQ = 2
STEPS_SAVE_FREQ = 300
BATCH_SIZE = 4
MULTISPEAKER = True
SAVE_MODE = 'ITER'

if 'styletts_basedir' not in locals():
    styletts_basedir = '/root/StyleTTS2'
if 'config_path' not in locals():
    config_path = os.path.join(styletts_basedir, 'Configs', 'config.yml')

import os
os.chdir(styletts_basedir)

import yaml
with open(config_path) as f:
    config = yaml.safe_load(f)
    config['epochs_1st'] = TOTAL_EPOCHS
    config['loss_params']['TMA_epoch'] = TMA_START_POINT
    config['model_params']['multispeaker'] = MULTISPEAKER
    if (PRETRAINED_MODEL != ''):
        config['pretrained_model'] = PRETRAINED_MODEL
    else:
        config['pretrained_model'] = ""
    config['resume'] = RESUME
    config['save_freq'] = EPOCHS_SAVE_FREQ
    config['saver_freq_steps'] = STEPS_SAVE_FREQ
    config['saver_mode'] = SAVE_MODE
    config['batch_size'] = BATCH_SIZE
    config['second_stage_load_pretrained'] = False

with open(config_path, 'w') as f:
    yaml.dump(config, f, default_flow_style=False)
    print(f"Wrote config options to {config_path}")

!chmod +x ./train_first.sh
!./train_first.sh

## Second stage (diffusion training/finetuning)
[This discussion page](https://github.com/yl4579/StyleTTS2/discussions/81)
contains some useful information on finetuning. Some gotchas:
* It is basically impossible to finetune on a 16GB GPU--you need 24GB minimum. The original models were trained on 4xA100 (e.g. 160 GB VRAM).
* Epochs are zero-indexed in the second stage. This includes `DIFF_BEGIN_EPOCH` and `JOINT_EPOCH`.

Configure the settings in the below cell before running.
* `IS_FIRST_STAGE` - **!!! Set this if the checkpoint provided is a first stage checkpoint !!!**.
* `MAX_LEN` - the maximum length of training samples, in frames (.0125 second/frame; 100 frames = 1.25 seconds, 800 frames = 10 seconds). This will affect VRAM usage and the ability of the model to coherently synthesize longer samples. A `MAX_LEN` below 175 is not recommended. Do not reduce `MAX_LEN` below 100.
* `TOTAL_EPOCHS` - total epochs used for second stage training. The paper used 60 epochs on LJSpeech, 40 epochs on VCTK, and 25 epochs on LibriTTS.
* `DIFF_BEGIN_EPOCH` - The epoch at which style diffusion training begins. This can be as low as `1` (i.e. second epoch) for large datasets. If you set this above `TOTAL_EPOCHS`, you can disable style vector diffusion, lowering VRAM usage at the cost of output quality. 
* `JOINT_BEGIN_EPOCH` - The epoch at which SLM adversarial training begins. This cannot be set lower than `DIFF_BEGIN_EPOCH` or an error will occur (SLM adversarial cannot be run without diffusion). SLM adversarial training will increase VRAM usage. A model can be produced without adversarial training, at the expense of output quality.
* `BATCH_SIZE` - Higher values will increase training speed but require more VRAM. **Will break if lowered below (NUM_GPUS) * 2.**
* `SAVE_FREQ` - Determines how often epochs are saved. Be mindful of your allocated disk limits. Because TMA epochs take so long to complete, you may want to consider lowering this down to 1 during TMA training.
* `PRETRAINED_MODEL` - if specified, loads the model and treats as a fresh run on the existing parameters (reset epoch/optimizer). **To make training treat this as a first stage model, the model name must start with `epoch_1st`.** Will be overridden by resume behavior.
* `RESUME` - Whether to treat this run as resuming from an existing checkpoint (meaning that epoch numbers, optimizer states, etc. will not be reset). The default resume implementation will automatically select the most recent checkpoint from the current directory.
* `MULTISPEAKER` - Whether this model is multispeaker.
* `SAVE_MODE` - `ITER` to save most recent models, `VAL_LOSS` to save based on validation loss.
* `SLMADV_MIN_LEN` - Minimum length of SLMadv training samples
* `SLMADV_MAX_LEN` - Maximum length of SLMadv training samples

In [None]:
IS_FIRST_STAGE = True
MAX_LEN = 400
TOTAL_EPOCHS = 60
DIFF_BEGIN_EPOCH = 30
JOINT_BEGIN_EPOCH = 50
BATCH_SIZE = 32 # 16, 8, 6
EPOCHS_SAVE_FREQ = 1
STEPS_SAVE_FREQ = 150
PRETRAINED_MODEL = 'epoch_1st_pretrained.pth'
RESUME = False
MULTISPEAKER = True
SAVE_MODE = 'ITER'
SLMADV_MIN_LEN = 160
SLMADV_MAX_LEN = MAX_LEN

if 'styletts_basedir' not in locals():
    styletts_basedir = '/root/StyleTTS2'
if 'config_path' not in locals():
    config_path = os.path.join(styletts_basedir, 'Configs', 'config.yml')

import re
import os
os.chdir(styletts_basedir)

import yaml
with open(config_path) as f:
    config = yaml.safe_load(f)
    config['max_len'] = MAX_LEN
    config['epochs_2nd'] = TOTAL_EPOCHS
    config['loss_params']['diff_epoch'] = DIFF_BEGIN_EPOCH
    config['loss_params']['joint_epoch'] = JOINT_BEGIN_EPOCH
    config['batch_size'] = BATCH_SIZE
    config['second_stage_load_pretrained'] = not IS_FIRST_STAGE
    config['model_params']['multispeaker'] = MULTISPEAKER
    if (PRETRAINED_MODEL != ''):
        config['pretrained_model'] = PRETRAINED_MODEL
    else:
        config['pretrained_model'] = ""
    config['resume'] = RESUME
    config['save_freq'] = EPOCHS_SAVE_FREQ
    config['saver_mode'] = SAVE_MODE
    config['saver_freq_steps'] = STEPS_SAVE_FREQ
    config['slmadv_params']['min_len'] = SLMADV_MIN_LEN
    config['slmadv_params']['max_len'] = SLMADV_MAX_LEN

with open(config_path, 'w') as f:
    yaml.dump(config, f, default_flow_style=False)
    print(f"Wrote config options to {config_path}")

In [None]:
!chmod +x ./train_second.sh
!./train_second.sh

# Install restart script (for vast.ai)
Useful for vast.ai interruptible instances. Takes the current config as configured above and writes it to `Configs/config_resume.yml` with `config['resume']=True`. Note that this means if you decide to update any config settings above mid-training you should re-run the below cell too.

The `onstart.sh` script restarts training as a forked process. The log from a continued training process will not be visible from Jupyter but can be read from train.log (`tail -f train.log`)

In [None]:
from pathlib import Path
import os
import yaml

if 'styletts_basedir' not in locals():
    styletts_basedir = '/root/StyleTTS2'
if 'config_path' not in locals():
    config_path = os.path.join(styletts_basedir, 'Configs', 'config.yml')
config_resume_path = Path(config_path).parent / "config_resume.yml"

with open(config_path) as f:
    config2 = yaml.safe_load(f)
    config2['resume'] = True

with open(config_resume_path, 'w') as f:
    yaml.dump(config2, f, default_flow_style=False)
    print(f"Writing resume config to {config_resume_path}")

!cp onstart.sh /root/onstart.sh
!chmod a+rwx /root/onstart.sh /root/StyleTTS2/*.sh
print(f"Onstart script installed")