<a href="https://colab.research.google.com/drive/1sAcXMC2JxdTGkCAWCqSkY8vIP1oYz87B?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Install packages and download models

In [35]:
%%shell
git clone https://github.com/yl4579/StyleTTS2.git
cd StyleTTS2
pip install SoundFile torchaudio munch torch pydub pyyaml librosa nltk matplotlib accelerate transformers phonemizer einops einops-exts tqdm typing-extensions git+https://github.com/resemble-ai/monotonic_align.git
sudo apt-get install espeak-ng
git-lfs clone https://huggingface.co/yl4579/StyleTTS2-LJSpeech
mv StyleTTS2-LJSpeech/Models .

Cloning into 'StyleTTS2'...
remote: Enumerating objects: 372, done.[K
remote: Total 372 (delta 0), reused 0 (delta 0), pack-reused 372 (from 1)[K
Receiving objects: 100% (372/372), 133.98 MiB | 25.58 MiB/s, done.
Resolving deltas: 100% (199/199), done.
Updating files: 100% (48/48), done.
Collecting git+https://github.com/resemble-ai/monotonic_align.git
  Cloning https://github.com/resemble-ai/monotonic_align.git to /tmp/pip-req-build-_pi2nr67
  Running command git clone --filter=blob:none --quiet https://github.com/resemble-ai/monotonic_align.git /tmp/pip-req-build-_pi2nr67
  Resolved https://github.com/resemble-ai/monotonic_align.git to commit 78b985be210a03d08bc3acc01c4df0442105366f
  Installing build dependencies ... [?25l[?25hcanceled
[31mERROR: Operation cancelled by user[0m[31m
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
espeak-ng is already the newest version (1.50+dfsg-10ubuntu0.1).


CalledProcessError: Command 'git clone https://github.com/yl4579/StyleTTS2.git
cd StyleTTS2
pip install SoundFile torchaudio munch torch pydub pyyaml librosa nltk matplotlib accelerate transformers phonemizer einops einops-exts tqdm typing-extensions git+https://github.com/resemble-ai/monotonic_align.git
sudo apt-get install espeak-ng
git-lfs clone https://huggingface.co/yl4579/StyleTTS2-LJSpeech
mv StyleTTS2-LJSpeech/Models .
' died with <Signals.SIGTERM: 15>.

### Load models

In [2]:
%cd StyleTTS2

import torch
torch.manual_seed(0)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

import random
random.seed(0)

import numpy as np
np.random.seed(0)

import nltk
nltk.download('punkt')

# load packages
import time
import random
import yaml
from munch import Munch
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
import torchaudio
import librosa
from nltk.tokenize import word_tokenize

from models import *
from utils import *
from text_utils import TextCleaner
textclenaer = TextCleaner()

%matplotlib inline

device = 'cuda' if torch.cuda.is_available() else 'cpu'

to_mel = torchaudio.transforms.MelSpectrogram(
    n_mels=80, n_fft=2048, win_length=1200, hop_length=300)
mean, std = -4, 4

def length_to_mask(lengths):
    mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
    mask = torch.gt(mask+1, lengths.unsqueeze(1))
    return mask

def preprocess(wave):
    wave_tensor = torch.from_numpy(wave).float()
    mel_tensor = to_mel(wave_tensor)
    mel_tensor = (torch.log(1e-5 + mel_tensor.unsqueeze(0)) - mean) / std
    return mel_tensor

def compute_style(ref_dicts):
    reference_embeddings = {}
    for key, path in ref_dicts.items():
        wave, sr = librosa.load(path, sr=24000)
        audio, index = librosa.effects.trim(wave, top_db=30)
        if sr != 24000:
            audio = librosa.resample(audio, sr, 24000)
        mel_tensor = preprocess(audio).to(device)

        with torch.no_grad():
            ref = model.style_encoder(mel_tensor.unsqueeze(1))
        reference_embeddings[key] = (ref.squeeze(1), audio)

    return reference_embeddings

# load phonemizer
import phonemizer
global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True, with_stress=True, words_mismatch='ignore')

config = yaml.safe_load(open("Models/LJSpeech/config.yml"))

# load pretrained ASR model
ASR_config = config.get('ASR_config', False)
ASR_path = config.get('ASR_path', False)
text_aligner = load_ASR_models(ASR_path, ASR_config)

# load pretrained F0 model
F0_path = config.get('F0_path', False)
pitch_extractor = load_F0_models(F0_path)

# load BERT model
from Utils.PLBERT.util import load_plbert
BERT_path = config.get('PLBERT_dir', False)
plbert = load_plbert(BERT_path)

model = build_model(recursive_munch(config['model_params']), text_aligner, pitch_extractor, plbert)
_ = [model[key].eval() for key in model]
_ = [model[key].to(device) for key in model]

params_whole = torch.load("Models/LJSpeech/epoch_2nd_00100.pth", map_location='cpu')
params = params_whole['net']

for key in model:
    if key in params:
        print('%s loaded' % key)
        try:
            model[key].load_state_dict(params[key])
        except:
            from collections import OrderedDict
            state_dict = params[key]
            new_state_dict = OrderedDict()
            for k, v in state_dict.items():
                name = k[7:] # remove `module.`
                new_state_dict[name] = v
            # load params
            model[key].load_state_dict(new_state_dict, strict=False)
#             except:
#                 _load(params[key], model[key])
_ = [model[key].eval() for key in model]

from Modules.diffusion.sampler import DiffusionSampler, ADPM2Sampler, KarrasSchedule

sampler = DiffusionSampler(
    model.diffusion.diffusion,
    sampler=ADPM2Sampler(),
    sigma_schedule=KarrasSchedule(sigma_min=0.0001, sigma_max=3.0, rho=9.0), # empirical parameters
    clamp=False
)

def inference(text, noise, diffusion_steps=5, embedding_scale=1):
    text = text.strip()
    text = text.replace('"', '')
    ps = global_phonemizer.phonemize([text])
    ps = word_tokenize(ps[0])
    ps = ' '.join(ps)

    tokens = textclenaer(ps)
    tokens.insert(0, 0)
    tokens = torch.LongTensor(tokens).to(device).unsqueeze(0)

    with torch.no_grad():
        input_lengths = torch.LongTensor([tokens.shape[-1]]).to(tokens.device)
        text_mask = length_to_mask(input_lengths).to(tokens.device)

        t_en = model.text_encoder(tokens, input_lengths, text_mask)
        bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
        d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

        s_pred = sampler(noise,
              embedding=bert_dur[0].unsqueeze(0), num_steps=diffusion_steps,
              embedding_scale=embedding_scale).squeeze(0)

        s = s_pred[:, 128:]
        ref = s_pred[:, :128]

        d = model.predictor.text_encoder(d_en, s, input_lengths, text_mask)

        x, _ = model.predictor.lstm(d)
        duration = model.predictor.duration_proj(x)
        duration = torch.sigmoid(duration).sum(axis=-1)
        pred_dur = torch.round(duration.squeeze()).clamp(min=1)

        pred_dur[-1] += 5

        pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
        c_frame = 0
        for i in range(pred_aln_trg.size(0)):
            pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
            c_frame += int(pred_dur[i].data)

        # encode prosody
        en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
        F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
        out = model.decoder((t_en @ pred_aln_trg.unsqueeze(0).to(device)),
                                F0_pred, N_pred, ref.squeeze().unsqueeze(0))

    return out.squeeze().cpu().numpy()

def LFinference(text, s_prev, noise, alpha=0.7, diffusion_steps=5, embedding_scale=1):
  text = text.strip()
  text = text.replace('"', '')
  ps = global_phonemizer.phonemize([text])
  ps = word_tokenize(ps[0])
  ps = ' '.join(ps)

  tokens = textclenaer(ps)
  tokens.insert(0, 0)
  tokens = torch.LongTensor(tokens).to(device).unsqueeze(0)

  with torch.no_grad():
      input_lengths = torch.LongTensor([tokens.shape[-1]]).to(tokens.device)
      text_mask = length_to_mask(input_lengths).to(tokens.device)

      t_en = model.text_encoder(tokens, input_lengths, text_mask)
      bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
      d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

      s_pred = sampler(noise,
            embedding=bert_dur[0].unsqueeze(0), num_steps=diffusion_steps,
            embedding_scale=embedding_scale).squeeze(0)

      if s_prev is not None:
          # convex combination of previous and current style
          s_pred = alpha * s_prev + (1 - alpha) * s_pred

      s = s_pred[:, 128:]
      ref = s_pred[:, :128]

      d = model.predictor.text_encoder(d_en, s, input_lengths, text_mask)

      x, _ = model.predictor.lstm(d)
      duration = model.predictor.duration_proj(x)
      duration = torch.sigmoid(duration).sum(axis=-1)
      pred_dur = torch.round(duration.squeeze()).clamp(min=1)

      pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
      c_frame = 0
      for i in range(pred_aln_trg.size(0)):
          pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
          c_frame += int(pred_dur[i].data)

      # encode prosody
      en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
      F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
      out = model.decoder((t_en @ pred_aln_trg.unsqueeze(0).to(device)),
                              F0_pred, N_pred, ref.squeeze().unsqueeze(0))

  return out.squeeze().cpu().numpy(), s_pred

/content/StyleTTS2


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


177


  params = torch.load(model_path, map_location='cpu')['model']
  params = torch.load(path, map_location='cpu')['net']
  checkpoint = torch.load(log_dir + "/step_" + str(iters) + ".t7", map_location='cpu')
  WeightNorm.apply(module, name, dim)
  params_whole = torch.load("Models/LJSpeech/epoch_2nd_00100.pth", map_location='cpu')


bert loaded
bert_encoder loaded
predictor loaded
decoder loaded
text_encoder loaded
predictor_encoder loaded
style_encoder loaded
diffusion loaded
text_aligner loaded
pitch_extractor loaded
mpd loaded
msd loaded
wd loaded


### Synthesize speech

In [3]:
# @title Input Text { display-mode: "form" }
# synthesize a text
text = "StyleTTS 2 is a text-to-speech model that leverages style diffusion and adversarial training with large speech language models to achieve human-level text-to-speech synthesis." # @param {type:"string"}


#### Basic synthesis (5 diffusion steps)

In [4]:
start = time.time()
noise = torch.randn(1,1,256).to(device)
wav = inference(text, noise, diffusion_steps=5, embedding_scale=1)
rtf = (time.time() - start) / (len(wav) / 24000)
print(f"RTF = {rtf:5f}")
import IPython.display as ipd
display(ipd.Audio(wav, rate=24000))



RTF = 0.580594


#### With higher diffusion steps (more diverse)
Since the sampler is ancestral, the higher the stpes, the more diverse the samples are, with the cost of slower synthesis speed.

In [5]:
start = time.time()
noise = torch.randn(1,1,256).to(device)
wav = inference(text, noise, diffusion_steps=10, embedding_scale=1)
rtf = (time.time() - start) / (len(wav) / 24000)
print(f"RTF = {rtf:5f}")
import IPython.display as ipd
display(ipd.Audio(wav, rate=24000))



RTF = 0.061289


In [6]:
# 1. Add Model Optimization Section
"""### Model Optimization"""
import torch.quantization
from torch.quantization import quantize_dynamic
from torch.nn.utils import prune
import time

class StyleTTS2Optimizer:
    def __init__(self, model):
        self.model = model

    def apply_dynamic_quantization(self):
        """Apply dynamic quantization to compatible components"""
        quantized_components = {}

        for key, component in self.model.items():
            if isinstance(component, torch.nn.Module):
                try:
                    # Skip BERT and text_aligner as they require special handling
                    if key in ['bert', 'text_aligner']:
                        continue

                    print(f"Quantizing {key}...")
                    quantized = torch.quantization.quantize_dynamic(
                        component,
                        {torch.nn.Linear, torch.nn.LSTM},
                        dtype=torch.qint8
                    )
                    quantized_components[key] = quantized

                except Exception as e:
                    print(f"Couldn't quantize {key}: {e}")
                    quantized_components[key] = component

        return quantized_components

    def apply_pruning(self, amount=0.3):
        """Apply weight pruning to reduce model size"""
        pruned_components = {}

        for key, component in self.model.items():
            if isinstance(component, torch.nn.Module):
                try:
                    print(f"Pruning {key}...")
                    module_copy = type(component)()
                    module_copy.load_state_dict(component.state_dict())

                    for name, module in module_copy.named_modules():
                        if isinstance(module, (torch.nn.Conv1d, torch.nn.Linear)):
                            prune.l1_unstructured(module, name='weight', amount=amount)

                    pruned_components[key] = module_copy

                except Exception as e:
                    print(f"Couldn't prune {key}: {e}")
                    pruned_components[key] = component

        return pruned_components

# 2. Add Benchmarking Section
"""### Model Benchmarking"""

def benchmark_model(model_dict, text, steps=5, runs=3):
    """Benchmark model performance"""
    times = []

    for _ in range(runs):
        start = time.time()
        noise = torch.randn(1,1,256).to(device)
        wav = inference(text, noise, diffusion_steps=steps)
        end = time.time()
        times.append(end - start)

    avg_time = sum(times) / len(times)
    rtf = avg_time / (len(wav) / 24000)

    return {
        'average_time': avg_time,
        'rtf': rtf,
        'times': times
    }

# 3. Add Quality Analysis Section
"""### Quality Analysis"""

def analyze_audio_quality(wav):
    """Analyze generated audio quality metrics"""
    metrics = {}

    # Basic audio statistics
    metrics['duration'] = len(wav) / 24000
    metrics['mean_amplitude'] = np.mean(np.abs(wav))
    metrics['peak_amplitude'] = np.max(np.abs(wav))

    # Compute spectral features
    spec = librosa.stft(wav)
    mag_spec = np.abs(spec)

    metrics['spectral_centroid'] = np.mean(librosa.feature.spectral_centroid(y=wav, sr=24000))
    metrics['spectral_bandwidth'] = np.mean(librosa.feature.spectral_bandwidth(y=wav, sr=24000))

    return metrics

# 4. Add Comparative Analysis Section
"""### Comparative Analysis"""

def compare_synthesis_parameters(text, parameters_list):
    """Compare different synthesis parameters"""
    results = []

    for params in parameters_list:
        start = time.time()
        noise = torch.randn(1,1,256).to(device)
        wav = inference(text, noise, **params)
        generation_time = time.time() - start

        result = {
            'parameters': params,
            'duration': len(wav) / 24000,
            'generation_time': generation_time,
            'rtf': generation_time / (len(wav) / 24000),
            'audio': wav,
            'quality_metrics': analyze_audio_quality(wav)
        }
        results.append(result)

    return results

# 5. Example Usage Cells

"""### Run Optimization and Analysis"""
# Initialize optimizer
optimizer = StyleTTS2Optimizer(model)

# Apply optimizations
print("Applying quantization...")
quantized_model = optimizer.apply_dynamic_quantization()

print("\nApplying pruning...")
pruned_model = optimizer.apply_pruning(amount=0.3)

# Benchmark original vs optimized models
test_text = "This is a test of the optimized model."
print("\nBenchmarking original model...")
original_metrics = benchmark_model(model, test_text)
print(f"Original RTF: {original_metrics['rtf']:.4f}")

print("\nBenchmarking optimized model...")
optimized_metrics = benchmark_model(quantized_model, test_text)
print(f"Optimized RTF: {optimized_metrics['rtf']:.4f}")

# Compare different synthesis parameters
parameters_to_test = [
    {'diffusion_steps': 5, 'embedding_scale': 1.0},
    {'diffusion_steps': 10, 'embedding_scale': 1.0},
    {'diffusion_steps': 5, 'embedding_scale': 2.0},
    {'diffusion_steps': 10, 'embedding_scale': 2.0}
]

print("\nComparing different synthesis parameters...")
comparison_results = compare_synthesis_parameters(test_text, parameters_to_test)

# Display results
for i, result in enumerate(comparison_results):
    print(f"\nConfiguration {i+1}:")
    print(f"Parameters: {result['parameters']}")
    print(f"RTF: {result['rtf']:.4f}")
    print(f"Duration: {result['duration']:.2f}s")
    print("Quality Metrics:")
    for metric, value in result['quality_metrics'].items():
        print(f"  {metric}: {value:.4f}")
    display(ipd.Audio(result['audio'], rate=24000))

# 6. Add Voice Style Analysis
"""### Voice Style Analysis"""

def analyze_voice_styles(text, num_samples=3):
    """Generate multiple samples with different styles"""
    samples = []

    for i in range(num_samples):
        noise = torch.randn(1,1,256).to(device)
        # Try different embedding scales for variety
        wav = inference(text, noise, diffusion_steps=10, embedding_scale=1.5)
        samples.append(wav)

    return samples

# Generate different style samples
print("Generating different voice styles...")
style_samples = analyze_voice_styles(test_text)

for i, sample in enumerate(style_samples):
    print(f"\nStyle variation {i+1}:")
    display(ipd.Audio(sample, rate=24000))

Applying quantization...
Quantizing bert_encoder...
Quantizing predictor...




Couldn't quantize predictor: Could not run 'quantized::make_quantized_cell_params_dynamic' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'quantized::make_quantized_cell_params_dynamic' is only available for these backends: [CPU, Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastXPU, AutocastMPS, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

CPU: reg



Pruning msd...
Pruning wd...
Couldn't prune wd: cannot assign 'torch.FloatTensor' object to parameter 'weight_orig' (torch.nn.Parameter or None required)

Benchmarking original model...




Original RTF: 0.2903

Benchmarking optimized model...




Optimized RTF: 0.2611

Comparing different synthesis parameters...





Configuration 1:
Parameters: {'diffusion_steps': 5, 'embedding_scale': 1.0}
RTF: 0.2045
Duration: 2.92s
Quality Metrics:
  duration: 2.9250
  mean_amplitude: 0.0323
  peak_amplitude: 0.5465
  spectral_centroid: 2707.8134
  spectral_bandwidth: 2017.6301



Configuration 2:
Parameters: {'diffusion_steps': 10, 'embedding_scale': 1.0}
RTF: 0.0782
Duration: 2.92s
Quality Metrics:
  duration: 2.9250
  mean_amplitude: 0.0358
  peak_amplitude: 0.7142
  spectral_centroid: 2731.7918
  spectral_bandwidth: 2089.7334



Configuration 3:
Parameters: {'diffusion_steps': 5, 'embedding_scale': 2.0}
RTF: 0.0592
Duration: 2.85s
Quality Metrics:
  duration: 2.8500
  mean_amplitude: 0.0360
  peak_amplitude: 0.6046
  spectral_centroid: 2766.4100
  spectral_bandwidth: 1960.7610



Configuration 4:
Parameters: {'diffusion_steps': 10, 'embedding_scale': 2.0}
RTF: 0.0891
Duration: 2.75s
Quality Metrics:
  duration: 2.7500
  mean_amplitude: 0.0375
  peak_amplitude: 0.6033
  spectral_centroid: 2823.8452
  spectral_bandwidth: 2089.7633




Generating different voice styles...





Style variation 1:



Style variation 2:



Style variation 3:


In [7]:
"""### Model Optimization with CPU Fallback"""
import torch
import torch.nn as nn
from torch.quantization import quantize_dynamic
import copy
import time
import numpy as np

class StyleTTS2Optimizer:
    def __init__(self, model):
        self.model = model
        self.original_device = next(model['text_encoder'].parameters()).device

    def _move_to_cpu(self, module):
        """Safely move a module to CPU"""
        return module.cpu() if isinstance(module, nn.Module) else module

    def _move_back_to_device(self, module):
        """Safely move module back to original device"""
        return module.to(self.original_device) if isinstance(module, nn.Module) else module

    def optimize_model(self, quantize=True, prune=True):
        """Combined optimization with safer implementation"""
        optimized_model = {}

        # Components safe for optimization
        safe_components = [
            'bert_encoder',
            'decoder',
            'style_encoder',
            'diffusion'
        ]

        for key, component in self.model.items():
            if not isinstance(component, nn.Module):
                optimized_model[key] = component
                continue

            if key not in safe_components:
                print(f"Skipping optimization for {key} (not in safe list)")
                optimized_model[key] = component
                continue

            try:
                # Work with a copy on CPU
                component_copy = copy.deepcopy(component)
                component_cpu = self._move_to_cpu(component_copy)

                if quantize:
                    print(f"Attempting quantization for {key}...")
                    try:
                        # Apply quantization only to linear and conv layers
                        component_cpu = quantize_dynamic(
                            component_cpu,
                            {nn.Linear, nn.Conv1d, nn.Conv2d},
                            dtype=torch.qint8
                        )
                        print(f"Successfully quantized {key}")
                    except Exception as e:
                        print(f"Quantization failed for {key}: {str(e)}")

                if prune:
                    print(f"Attempting pruning for {key}...")
                    try:
                        # Apply lightweight pruning only to specific layers
                        for name, module in component_cpu.named_modules():
                            if isinstance(module, (nn.Linear, nn.Conv1d, nn.Conv2d)):
                                # Use very conservative pruning
                                nn.utils.prune.l1_unstructured(
                                    module,
                                    name='weight',
                                    amount=0.1  # Only prune 10% of weights
                                )
                        print(f"Successfully pruned {key}")
                    except Exception as e:
                        print(f"Pruning failed for {key}: {str(e)}")

                # Move back to original device
                optimized_model[key] = self._move_back_to_device(component_cpu)

            except Exception as e:
                print(f"Optimization failed for {key}: {str(e)}")
                optimized_model[key] = component

        return optimized_model

def benchmark_inference(model_dict, text, steps=5, runs=3):
    """Benchmark inference with proper error handling"""
    times = []
    wavs = []

    print("\nRunning benchmark...")
    for i in range(runs):
        try:
            start = time.time()
            noise = torch.randn(1,1,256).to(device)
            wav = inference(text, noise, diffusion_steps=steps)
            end = time.time()

            times.append(end - start)
            wavs.append(wav)

            rtf = (end - start) / (len(wav) / 24000)
            print(f"Run {i+1}: RTF = {rtf:.4f}")

        except Exception as e:
            print(f"Error in run {i+1}: {str(e)}")
            continue

    if not times:
        return None

    return {
        'average_time': np.mean(times),
        'rtf': np.mean([(t / (len(w) / 24000)) for t, w in zip(times, wavs)]),
        'std_dev': np.std(times),
        'wavs': wavs
    }

"""### Run Optimized Model"""
# Initialize optimizer
optimizer = StyleTTS2Optimizer(model)

# Run optimization with conservative settings
print("Starting optimization process...")
optimized_model = optimizer.optimize_model(quantize=True, prune=True)

# Benchmark original model
print("\nBenchmarking original model...")
original_metrics = benchmark_inference(model,
    "This is a test of the speech synthesis system.",
    steps=5,
    runs=3
)

# Benchmark optimized model
print("\nBenchmarking optimized model...")
optimized_metrics = benchmark_inference(optimized_model,
    "This is a test of the speech synthesis system.",
    steps=5,
    runs=3
)

if original_metrics and optimized_metrics:
    improvement = ((original_metrics['rtf'] - optimized_metrics['rtf']) /
                  original_metrics['rtf'] * 100)
    print(f"\nOptimization Results:")
    print(f"Original RTF: {original_metrics['rtf']:.4f}")
    print(f"Optimized RTF: {optimized_metrics['rtf']:.4f}")
    print(f"Speed Improvement: {improvement:.1f}%")

    # Compare audio samples
    print("\nOriginal audio sample:")
    display(ipd.Audio(original_metrics['wavs'][0], rate=24000))
    print("\nOptimized audio sample:")
    display(ipd.Audio(optimized_metrics['wavs'][0], rate=24000))

Starting optimization process...
Skipping optimization for bert (not in safe list)
Attempting quantization for bert_encoder...
Successfully quantized bert_encoder
Attempting pruning for bert_encoder...
Successfully pruned bert_encoder
Skipping optimization for predictor (not in safe list)
Attempting quantization for decoder...
Successfully quantized decoder
Attempting pruning for decoder...
Pruning failed for decoder: cannot assign 'torch.cuda.FloatTensor' object to parameter 'weight_orig' (torch.nn.Parameter or None required)
Skipping optimization for text_encoder (not in safe list)
Skipping optimization for predictor_encoder (not in safe list)
Attempting quantization for style_encoder...
Successfully quantized style_encoder
Attempting pruning for style_encoder...
Pruning failed for style_encoder: cannot assign 'torch.FloatTensor' object to parameter 'weight_orig' (torch.nn.Parameter or None required)
Attempting quantization for diffusion...




Successfully quantized diffusion
Attempting pruning for diffusion...
Successfully pruned diffusion
Skipping optimization for text_aligner (not in safe list)
Skipping optimization for pitch_extractor (not in safe list)
Skipping optimization for mpd (not in safe list)
Skipping optimization for msd (not in safe list)
Skipping optimization for wd (not in safe list)

Benchmarking original model...

Running benchmark...




Run 1: RTF = 0.0641
Run 2: RTF = 0.0471




Run 3: RTF = 0.0420

Benchmarking optimized model...

Running benchmark...
Run 1: RTF = 0.0472




Run 2: RTF = 0.0455
Run 3: RTF = 0.0430

Optimization Results:
Original RTF: 0.0511
Optimized RTF: 0.0452
Speed Improvement: 11.4%

Original audio sample:



Optimized audio sample:


In [18]:
import locale
locale.getdefaultlocale()


('en_US', 'UTF-8')

In [21]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding