<a href="https://colab.research.google.com/drive/1iqnIq_-7zbwAnu77GuiUUk3YLoJPypTr?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Downloading a high-quality TTS dataset for German language (LibriTTS-style dataset)

In [None]:
# HUI-Audio-Corpus-German: A high quality TTS dataset (https://arxiv.org/pdf/2106.06309.pdf)
!wget https://opendata.iisys.de/opendata/Datasets/HUI-Audio-Corpus-German/dataset_clean/others_Clean.zip

--2024-10-27 13:27:21--  https://opendata.iisys.de/opendata/Datasets/HUI-Audio-Corpus-German/dataset_clean/others_Clean.zip
Resolving opendata.iisys.de (opendata.iisys.de)... 194.95.60.121
Connecting to opendata.iisys.de (opendata.iisys.de)|194.95.60.121|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15937839738 (15G) [application/zip]
Saving to: ‘others_Clean.zip’

others_Clean.zip      0%[                    ] 139.79M  1.12MB/s    eta 3h 50m ^C


### Unzip dataset

In [None]:
!unzip ./others_Clean.zip -d ./dataset

Archive:  ./others_Clean.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of ./others_Clean.zip or
        ./others_Clean.zip.zip, and cannot find ./others_Clean.zip.ZIP, period.


### Adjust dataset structure so that all train and val wav files are located in an individual folder

In [None]:
from glob import glob
import csv

transcribed_audio_samples = []

for folder in glob("others_dataset/*"):
  for subfolder in glob(f"{folder}/*"):
    with open(f"{subfolder}/metadata.csv") as csv_file:
        audio_samples = csv.reader(csv_file, delimiter='\n')
        for audio_sample in audio_samples:
            transcribed_audio_sample = ''.join(audio_sample)
            filename, transcription = transcribed_audio_sample.split('|')
            new_transcribed_audio_sample = f"wavs/{filename}.wav|{transcription}"
            transcribed_audio_samples.append((f"{subfolder}/wavs/{filename}.wav", new_transcribed_audio_sample))

### Divide dataset into train and val subsets

In [None]:
import random

random.shuffle(all_transcribed_audio_samples)
num_train_samples = int(len(all_transcribed_audio_samples) * 0.85)

train_dataset = transcribed_audio_samples[:num_train_samples]
val_dataset = transcribed_audio_samples[num_train_samples:]

### Copy files to train and val folders

In [None]:
!mkdir dataset
!mkdir dataset/wavs

In [None]:
import shutil

train_val_list = [('train', train_dataset), ('val', val_dataset)]

for stage, dataset in train_val_list:
    sample_list = []
    for old_sample, new_sample in dataset:
        # copy audio sample to new location
        shutil.copyfile(old_sample, "./dataset/" + new_sample.split("|")[0])
        sample_list.append(new_sample)

    with open(f"./dataset/{stage}.txt", "w") as sample_file:
        sample_file.write("\n".join(sample_list))

### Zip altered dataset

In [None]:
!zip -r hui_audio_corpus.zip ./dataset

### Alter fine-tuning code to allow special letters (e.g. ä,ö,ü,ß for German language)
---
##### Source: https://github.com/152334H/DL-Art-School
##### File: codes/models/audio/tts/tacotron2/text/cleaners.py
- Alter method english_cleaners to clean input text for your language (line 83)

##### File: codes/models/audio/tts/tacotron2/text/symbols.py
- Add special characters to \_letters (line 12)

##### Create File: experiments/custom_language_gpt.yml
- Copy file experiments/EXAMPLE_gpt.yml and rename it to custom_language_gpt.yml
- Change experiment name to custom_language_gpt (line 1)
- Change name to train_dataset (line 14)
- Change path to ../../dataset/train.txt (line 18)
- Add attribute *tokenizer_vocab: ../custom_language_tokenizer.json* (line 29)
- Change name to val_dataset (line 31)
- Change path to ../../dataset/val.txt (line 35)
- Add attribute *tokenizer_vocab: ../custom_language_tokenizer.json* (line 46)
- Change value to 5000 (line 128)

##### File: .gitignore
- Add statement *!experiments/custom_language_gpt.yml* (line 169)

##### File: codes/requirements.laxed.txt
- Change transformers to transformers==4.29.2 (line 39)

##### File: codes/utils/util.py
- Change to *from torch import inf* (line 25)

### Install all required modules and clone repo with altered fine-tuning code

In [None]:
!git clone https://github.com/thisserand/DL-Art-School
%cd DL-Art-School/codes/
!pip install -r requirements.laxed.txt

### Download model weights for the VQ-VAE and Autoregressive Model (GPT-2)

In [None]:
!wget https://huggingface.co/Gatozu35/tortoise-tts/resolve/main/dvae.pth -O ../experiments/dvae.pth
!wget https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/autoregressive.pth -O ../experiments/autoregressive.pth

### Create text file containing all transcriptions as source to train a tokenizer

In [None]:
transcriptions = ""
dataset_path = "../../dataset"

for stage in ["train", "val"]:
    with open(f'{dataset_path}/{stage}.txt') as f:
        for line in f.readlines():
            transcriptions += ' ' + line.split("|")[1].strip()

with open("transcriptions.txt", "w") as f:
  f.write(transcriptions.strip())

### Train Tokenizer

In [None]:
import re
import torch
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer
from german_transliterate.core import GermanTransliterate


def text_cleaners(text):
  ###########################################
  # ToDo Adjust this code for your language #
  ###########################################
  text = GermanTransliterate().transliterate(text)
  text = text.lower()
  text = re.sub(re.compile(r'\s+'), ' ', text)
  text = text.replace('"', '')
  return text


def remove_extraneous_punctuation(word):
    replacement_punctuation = {
        '{': '(', '}': ')',
        '[': '(', ']': ')',
        '`': '\'', '—': '-',
        '—': '-', '`': '\'',
        'ʼ': '\''
    }
    replace = re.compile("|".join([re.escape(k) for k in sorted(replacement_punctuation, key=len, reverse=True)]), flags=re.DOTALL)
    word = replace.sub(lambda x: replacement_punctuation[x.group(0)], word)
    extraneous = re.compile(r'^[@#%_=\$\^&\*\+\\]$')
    word = extraneous.sub('', word)
    return word

with open('transcriptions.txt', 'r', encoding='utf-8') as at:
    ttsd = at.readlines()
    allowed_characters_re = re.compile(r'^[a-zäöüß!:;"/, \-\(\)\.\'\?ʼ]+$')

    def preprocess_word(word, report=False):
        word = text_cleaners(word)
        word = remove_extraneous_punctuation(word)
        if not bool(allowed_characters_re.match(word)):
            if report and word:
                print(f"REPORTING: '{word}'")
            return ''
        return word

    def batch_iterator(batch_size=1000):
        print("Processing ASR texts.")
        for i in range(0, len(ttsd), batch_size):
            yield [preprocess_word(t, True) for t in ttsd[i:i+batch_size]]

    trainer = BpeTrainer(special_tokens=['[STOP]', '[UNK]', '[SPACE]'], vocab_size=255)
    tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    tokenizer.pre_tokenizer = Whitespace()
    tokenizer.train_from_iterator(batch_iterator(), trainer, length=len(ttsd))
    tokenizer.save('../custom_language_tokenizer.json')

In [None]:
import librosa
import soundfile as sf

def resample_wav_file(input_file, target_sampling_rate=22050):
    # Load audio file
    audio, sampling_rate = librosa.load(input_file, sr=None)

    # Check if the sampling rate matches the target
    if sampling_rate != target_sampling_rate:

        # Resample audio to the target sampling rate
        audio_resampled = librosa.resample(audio, orig_sr=sampling_rate, target_sr=target_sampling_rate)

        # Overwrite the input file with the resampled audio
        sf.write(input_file, audio_resampled, target_sampling_rate)

# Resample all audio samples to 22.05kHz
dataset_path = "../../dataset/wavs/"
for wav_file in glob(dataset_path + "*.wav"):
    resample_wav_file(input_file)

### Fine-Tune the Autoregressive Model

In [None]:
!python3 train.py -opt ../experiments/custom_language_gpt.yml

In [None]:
from huggingface_hub import HfApi
api = HfApi()

hf_user_name = ""
repository_name = ""
repo_id = f"{hf_user_name}/{repository_name}"
fine_tuned_model_path = "../experiments/custom_language_gpt/models/2500_gpt.pth"
hf_auth_token = ""


api.create_repo(repo_id=repo_id, token=hf_auth_token, repo_type='model')

model_url = api.upload_file(
    path_or_fileobj="fine_tuned_model_path",
    path_in_repo="custom_language_gpt.pth",
    repo_id=repo_id,
    repo_type="model",
)

print(f"The fine-tuned model was uploaded to: {model_url}")

### Clone adjusted inference code for German language

In [None]:
!git clone https://gitlab.com/tisserand13/tortoise-tts
%cd tortoise-tts
!pip install -e .

### Define example text and voice samples to assess generation quality

In [None]:
example_texts = ["Das Trainieren von Sprachmodellen für neue Sprachen funktioniert äußerst gut!",
                 "Äpfel und Birnen sind gesund für den Körper.",
                 "Übermorgen fahren wir in die Berge.",
                 "Das Café um die Ecke serviert köstlichen Kuchen und Kaffee.",
                 "Ein leises Lachen verzaubert die Luft und lässt Herzen höher schlagen.",
                 "Inmitten des Blütenmeers wandern wir durch den zauberhaften Frühlingswald."
                ]

voices = ["tom", "train_atkins", "angie", "daniel", "deniro", "emma"]

### Generate speech

In [None]:
from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_voice
import torchaudio
import IPython

tts = TextToSpeech(kv_cache=True)

for voice, text in zip(voices, example_texts):
    voice_samples, _ = load_voice(voice)
    audio = tts.tts_with_preset(text, voice_samples=voice_samples, preset='fast')
    torchaudio.save(f'{voice}.wav', audio.squeeze(0), 24000)
    IPython.display.display(IPython.display.Audio(f'{voice}.wav'))

In [None]:
import librosa

hop_length = 512
db_threshold = -45

def trim_audio(audio):
    generated_audio = audio.squeeze().numpy()
    audio_rms = librosa.feature.rms(y=generated_audio, frame_length=2048, hop_length=hop_length)[0]
    audio_db = librosa.power_to_db(rmse**2, ref=np.max)

    start_index, end_index = None, None
    for index, frame_rms in enumerate(y_values):
        if frame_rms >= db_threshold:
            if start_index is None:
                    start_index = index
        elif start_index is not None:
            end_index = index - 1
            break

    return audio.squeeze(0)[:, start_index*hop_length:end_index*hop_length]

### Optimization code

In [None]:
import torch
import torch.nn as nn
import torch.quantization
import time
import numpy as np
from torch.utils.data import DataLoader
import torchaudio
from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_voice
import pandas as pd
from typing import List, Tuple, Dict
import librosa
import pesq
from torch.nn.utils import prune

class ModelOptimizer:
    def __init__(self, model_path: str = None):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.original_model = TextToSpeech(kv_cache=True)
        self.quantized_model = None
        self.pruned_model = None
        self.test_texts = [
            "Das Trainieren von Sprachmodellen für neue Sprachen funktioniert äußerst gut!",
            "Äpfel und Birnen sind gesund für den Körper.",
            "Übermorgen fahren wir in die Berge."
        ]
        self.test_voices = ["tom", "angie", "daniel"]

    def quantize_model(self) -> None:
        """
        Apply post-training static quantization to the model
        """
        # Configure quantization
        self.quantized_model = torch.quantization.quantize_dynamic(
            self.original_model,
            {torch.nn.Linear, torch.nn.Conv1d, torch.nn.Conv2d},
            dtype=torch.qint8
        )

    def prune_model(self, amount: float = 0.3) -> None:
        """
        Apply weight pruning to reduce model size
        Args:
            amount: Amount of weights to prune (0.0 to 1.0)
        """
        self.pruned_model = self.original_model
        for name, module in self.pruned_model.named_modules():
            if isinstance(module, (nn.Conv1d, nn.Conv2d, nn.Linear)):
                prune.l1_unstructured(module, name='weight', amount=amount)
                prune.remove(module, 'weight')

    def measure_inference_time(self, model, num_runs: int = 5) -> Dict[str, float]:
        """
        Measure inference time statistics
        Args:
            model: Model to evaluate
            num_runs: Number of inference runs to average over
        Returns:
            Dictionary containing timing statistics
        """
        timing_stats = []

        for voice in self.test_voices:
            voice_samples, _ = load_voice(voice)

            for text in self.test_texts:
                for _ in range(num_runs):
                    start_time = time.time()
                    with torch.no_grad():
                        _ = model.tts_with_preset(text, voice_samples=voice_samples, preset='fast')
                    end_time = time.time()
                    timing_stats.append(end_time - start_time)

        return {
            "mean": np.mean(timing_stats),
            "std": np.std(timing_stats),
            "min": np.min(timing_stats),
            "max": np.max(timing_stats)
        }

    def evaluate_audio_quality(self, model, reference_model) -> Dict[str, float]:
        """
        Evaluate audio quality using PESQ score
        Args:
            model: Model to evaluate
            reference_model: Reference model for comparison
        Returns:
            Dictionary containing quality metrics
        """
        pesq_scores = []

        for voice in self.test_voices:
            voice_samples, _ = load_voice(voice)

            for text in self.test_texts:
                # Generate audio with both models
                with torch.no_grad():
                    reference_audio = reference_model.tts_with_preset(
                        text, voice_samples=voice_samples, preset='fast'
                    ).squeeze().numpy()
                    test_audio = model.tts_with_preset(
                        text, voice_samples=voice_samples, preset='fast'
                    ).squeeze().numpy()

                # Ensure same length for PESQ calculation
                min_len = min(len(reference_audio), len(test_audio))
                reference_audio = reference_audio[:min_len]
                test_audio = test_audio[:min_len]

                # Calculate PESQ score
                pesq_score = pesq.pesq(24000, reference_audio, test_audio, 'wb')
                pesq_scores.append(pesq_score)

        return {
            "mean_pesq": np.mean(pesq_scores),
            "std_pesq": np.std(pesq_scores)
        }

    def get_model_size(self, model) -> float:
        """
        Calculate model size in MB
        Args:
            model: Model to measure
        Returns:
            Size in MB
        """
        torch.save(model.state_dict(), "temp_model.pth")
        size_mb = os.path.getsize("temp_model.pth") / (1024 * 1024)
        os.remove("temp_model.pth")
        return size_mb

    def run_benchmarks(self) -> pd.DataFrame:
        """
        Run comprehensive benchmarks on all model variants
        Returns:
            DataFrame containing benchmark results
        """
        results = []

        # Test original model
        orig_time = self.measure_inference_time(self.original_model)
        orig_size = self.get_model_size(self.original_model)
        results.append({
            "model": "original",
            "size_mb": orig_size,
            "inf_time_mean": orig_time["mean"],
            "inf_time_std": orig_time["std"],
            "quality_score": 1.0  # Reference quality
        })

        # Test quantized model
        if self.quantized_model is not None:
            quant_time = self.measure_inference_time(self.quantized_model)
            quant_size = self.get_model_size(self.quantized_model)
            quant_quality = self.evaluate_audio_quality(self.quantized_model, self.original_model)
            results.append({
                "model": "quantized",
                "size_mb": quant_size,
                "inf_time_mean": quant_time["mean"],
                "inf_time_std": quant_time["std"],
                "quality_score": quant_quality["mean_pesq"]
            })

        # Test pruned model
        if self.pruned_model is not None:
            pruned_time = self.measure_inference_time(self.pruned_model)
            pruned_size = self.get_model_size(self.pruned_model)
            pruned_quality = self.evaluate_audio_quality(self.pruned_model, self.original_model)
            results.append({
                "model": "pruned",
                "size_mb": pruned_size,
                "inf_time_mean": pruned_time["mean"],
                "inf_time_std": pruned_time["std"],
                "quality_score": pruned_quality["mean_pesq"]
            })

        return pd.DataFrame(results)

    def generate_report(self) -> str:
        """
        Generate a detailed optimization report
        Returns:
            Formatted report string
        """
        results_df = self.run_benchmarks()

        report = "TTS Model Optimization Report\n"
        report += "==========================\n\n"

        # Model Size Comparison
        report += "Model Size Comparison:\n"
        report += f"Original Model: {results_df.loc[results_df['model'] == 'original', 'size_mb'].values[0]:.2f} MB\n"
        if self.quantized_model is not None:
            quant_size = results_df.loc[results_df['model'] == 'quantized', 'size_mb'].values[0]
            reduction = (1 - quant_size / results_df.loc[results_df['model'] == 'original', 'size_mb'].values[0]) * 100
            report += f"Quantized Model: {quant_size:.2f} MB ({reduction:.1f}% reduction)\n"
        if self.pruned_model is not None:
            pruned_size = results_df.loc[results_df['model'] == 'pruned', 'size_mb'].values[0]
            reduction = (1 - pruned_size / results_df.loc[results_df['model'] == 'original', 'size_mb'].values[0]) * 100
            report += f"Pruned Model: {pruned_size:.2f} MB ({reduction:.1f}% reduction)\n"

        # Inference Time Comparison
        report += "\nInference Time Comparison:\n"
        for _, row in results_df.iterrows():
            report += f"{row['model'].capitalize()} Model: {row['inf_time_mean']:.3f}s ± {row['inf_time_std']:.3f}s\n"

        # Quality Comparison
        report += "\nQuality Metrics (PESQ Score):\n"
        for _, row in results_df.iterrows():
            report += f"{row['model'].capitalize()} Model: {row['quality_score']:.3f}\n"

        return report

def main():
    # Initialize optimizer
    optimizer = ModelOptimizer()

    # Apply optimizations
    optimizer.quantize_model()
    optimizer.prune_model(amount=0.3)

    # Generate and print report
    report = optimizer.generate_report()
    print(report)

    # Save report to file
    with open("optimization_report.txt", "w") as f:
        f.write(report)

if __name__ == "__main__":
    main()