# TTS (Text-to-Speech) with Voice Cloning

This notebook explores XTTS-v2 from Coqui for text-to-speech with voice cloning capabilities.

**Features:**
- Voice cloning from reference audio (no fine-tuning needed!)
- Multilingual support (Spanish focus)
- High-quality speech synthesis
- Performance testing

**Note:** XTTS-v2 supports zero-shot voice cloning with just 6+ seconds of reference audio - no fine-tuning required!

In [None]:
# Check if running in Colab
try:
    import google.colab
    IN_COLAB = True
    print("✓ Running in Google Colab")
except:
    IN_COLAB = False
    print("✓ Running locally")

if IN_COLAB:
    print("\n" + "="*70)
    print("  GOOGLE COLAB SETUP")
    print("="*70)
    
    # Clone repository
    print("\n[1/3] Cloning repository...")
    !git clone https://github.com/ltruciosr-dev/utec-voice-assistant.git
    
    # Change to repo directory
    import os
    os.chdir('utec-voice-assistant')
    print("✓ Repository cloned")
    
    # Install dependencies
    print("\n[2/3] Installing dependencies...")
    !pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
    !pip install -q -r requirements.txt
    print("✓ Dependencies installed")
    
    # Verify GPU
    print("\n[3/3] Verifying GPU access...")
    import torch
    print(f"CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"GPU: {torch.cuda.get_device_name(0)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
    
    print("\n" + "="*70)
    print("  SETUP COMPLETE!")
    print("="*70)
else:
    print("Skipping Colab setup (running locally)")

# 🚀 Google Colab Setup

**Run this section if using Google Colab. Skip if running locally.**

## 1. Setup and Imports

In [None]:
import torch
from TTS.api import TTS
import sounddevice as sd
import soundfile as sf
import numpy as np
from pathlib import Path
import time
from IPython.display import Audio, display
import warnings
warnings.filterwarnings('ignore')

# Check GPU availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")

## 2. Initialize XTTS-v2

In [None]:
# Initialize TTS
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

print("\nLoading XTTS-v2 model...")
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

print("✓ XTTS-v2 loaded successfully!")

if torch.cuda.is_available():
    print(f"\nGPU Memory Usage: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

## 3. Supported Languages

In [None]:
# List supported languages
supported_languages = [
    "en", "es", "fr", "de", "it", "pt", "pl", "tr", 
    "ru", "nl", "cs", "ar", "zh-cn", "ja", "hu", "ko"
]

language_names = {
    "en": "English", "es": "Spanish", "fr": "French", "de": "German",
    "it": "Italian", "pt": "Portuguese", "pl": "Polish", "tr": "Turkish",
    "ru": "Russian", "nl": "Dutch", "cs": "Czech", "ar": "Arabic",
    "zh-cn": "Chinese", "ja": "Japanese", "hu": "Hungarian", "ko": "Korean"
}

print("Supported Languages:")
print("="*50)
for code in supported_languages:
    print(f"  {code:6} - {language_names.get(code, 'Unknown')}")

## 4. Record Reference Audio for Voice Cloning

In [None]:
def record_speaker_reference(duration=10, sample_rate=22050, output_path="speaker_reference.wav"):
    """
    Record speaker reference audio for voice cloning.
    
    Args:
        duration: Recording duration (minimum 6 seconds recommended)
        sample_rate: Sample rate (22050 Hz is standard for XTTS)
        output_path: Where to save the reference audio
    
    Returns:
        Path to saved reference audio
    """
    print(f"\n{'='*70}")
    print("Recording Speaker Reference for Voice Cloning")
    print(f"{'='*70}")
    print(f"\n⏱️  Duration: {duration} seconds")
    print("\n📋 Tips for best results:")
    print("   - Speak clearly and naturally")
    print("   - Use complete sentences")
    print("   - Minimize background noise")
    print("   - Speak in the language you'll use for synthesis")
    print("   - Minimum 6 seconds, 10+ seconds recommended")
    
    input("\nPress Enter when ready to record...")
    
    print(f"\n🎤 Recording for {duration} seconds...")
    print("Speak now!\n")
    
    # Record
    recording = sd.rec(
        int(duration * sample_rate),
        samplerate=sample_rate,
        channels=1,
        dtype='float32'
    )
    sd.wait()
    
    # Save
    sf.write(output_path, recording, sample_rate)
    
    print(f"✓ Reference audio saved to: {output_path}")
    print(f"Duration: {duration}s, Sample rate: {sample_rate}Hz\n")
    
    return output_path

# Uncomment to record your own voice
# speaker_ref = record_speaker_reference(duration=10, output_path="my_voice_reference.wav")

## 5. Basic TTS without Voice Cloning

In [None]:
# Test basic TTS (uses default voice)
print("Testing basic TTS (default voice)...\n")

spanish_text = "Hola, soy un asistente de voz creado con inteligencia artificial. Puedo ayudarte con diversas tareas."

output_path = "output_default_voice.wav"

start_time = time.time()

# Note: XTTS requires speaker_wav for voice cloning
# Without it, you'll get an error. We'll use a sample later.
# For now, let's prepare for voice cloning examples

print(f"Text: {spanish_text}")
print("\nℹ️  XTTS-v2 requires a speaker reference for synthesis.")
print("We'll use voice cloning examples below.")

## 6. Voice Cloning with Reference Audio

In [None]:
def synthesize_with_cloning(text, speaker_wav, output_path, language="es", speed=1.0):
    """
    Synthesize speech with voice cloning.
    
    Args:
        text: Text to synthesize
        speaker_wav: Path to speaker reference audio
        output_path: Where to save synthesized audio
        language: Language code
        speed: Speech speed (1.0 = normal)
    
    Returns:
        Dictionary with synthesis info
    """
    print(f"\n{'='*70}")
    print("Synthesizing with Voice Cloning")
    print(f"{'='*70}")
    print(f"\nText: {text}")
    print(f"Speaker Reference: {speaker_wav}")
    print(f"Language: {language}")
    print(f"Speed: {speed}x\n")
    
    # Clear GPU cache
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()
        initial_memory = torch.cuda.memory_allocated() / 1024**3
    
    # Synthesize
    start_time = time.time()
    
    tts.tts_to_file(
        text=text,
        file_path=output_path,
        speaker_wav=speaker_wav,
        language=language,
        speed=speed
    )
    
    end_time = time.time()
    synthesis_time = end_time - start_time
    
    # Get audio duration
    audio_data, sample_rate = sf.read(output_path)
    audio_duration = len(audio_data) / sample_rate
    real_time_factor = synthesis_time / audio_duration if audio_duration > 0 else 0
    
    # GPU stats
    if torch.cuda.is_available():
        peak_memory = torch.cuda.max_memory_allocated() / 1024**3
        memory_used = peak_memory - initial_memory
    else:
        memory_used = 0
        peak_memory = 0
    
    print(f"\n✓ Audio saved to: {output_path}")
    print(f"\n⏱️  Synthesis Time: {synthesis_time:.2f}s")
    print(f"🎵 Audio Duration: {audio_duration:.2f}s")
    print(f"📊 Real-time Factor: {real_time_factor:.2f}x")
    
    if torch.cuda.is_available():
        print(f"💾 GPU Memory Used: {memory_used:.3f} GB")
    
    return {
        "text": text,
        "output_path": output_path,
        "synthesis_time": synthesis_time,
        "audio_duration": audio_duration,
        "real_time_factor": real_time_factor,
        "memory_used": memory_used
    }

## 7. Example: Voice Cloning Tests

**Note:** You need to provide a speaker reference audio file. You can:
1. Record your own using the function in cell 4
2. Use an existing audio file with clear speech (6+ seconds)
3. Download a sample from the internet

In [None]:
# Set your speaker reference path
SPEAKER_REFERENCE = "my_voice_reference.wav"  # Change this to your reference audio

# Verify file exists
if Path(SPEAKER_REFERENCE).exists():
    print(f"✓ Speaker reference found: {SPEAKER_REFERENCE}")
else:
    print(f"⚠️  Speaker reference not found: {SPEAKER_REFERENCE}")
    print("\nPlease either:")
    print("1. Record your voice using the function in cell 4")
    print("2. Provide an existing audio file path")
    print("\nFor testing, you can record now:")
    # Uncomment to record
    # SPEAKER_REFERENCE = record_speaker_reference(duration=10)

In [None]:
# Spanish test sentences
spanish_tests = [
    "Hola, ¿cómo estás hoy?",
    "Soy un asistente de voz basado en inteligencia artificial.",
    "La tecnología de síntesis de voz ha avanzado significativamente en los últimos años.",
    "¿En qué puedo ayudarte hoy?",
    "La Universidad de Ingeniería y Tecnología está ubicada en Lima, Perú."
]

results = []

# Test each sentence (only if speaker reference exists)
if Path(SPEAKER_REFERENCE).exists():
    for i, text in enumerate(spanish_tests, 1):
        output_file = f"output_cloned_voice_{i}.wav"
        
        result = synthesize_with_cloning(
            text=text,
            speaker_wav=SPEAKER_REFERENCE,
            output_path=output_file,
            language="es",
            speed=1.0
        )
        
        results.append(result)
        
        # Play audio in notebook
        print("\n🔊 Playing audio...")
        display(Audio(output_file, autoplay=False))
        
        print("-" * 70)
    
    print("\n✓ All voice cloning tests complete!")
else:
    print("\n⚠️  Skipping voice cloning tests - no speaker reference available.")

## 8. Test Different Speaking Speeds

In [None]:
if Path(SPEAKER_REFERENCE).exists():
    test_text = "Esta es una prueba de diferentes velocidades de habla."
    speeds = [0.8, 1.0, 1.2, 1.5]
    
    print("\n" + "="*70)
    print("Testing Different Speaking Speeds")
    print("="*70)
    
    for speed in speeds:
        output_file = f"output_speed_{speed}.wav"
        
        result = synthesize_with_cloning(
            text=test_text,
            speaker_wav=SPEAKER_REFERENCE,
            output_path=output_file,
            language="es",
            speed=speed
        )
        
        print(f"\n🔊 Speed {speed}x:")
        display(Audio(output_file, autoplay=False))
        print("-" * 70)
else:
    print("\n⚠️  Skipping speed tests - no speaker reference available.")

## 9. Multilingual Test (Optional)

In [None]:
# Test the same speaker reference with different languages
if Path(SPEAKER_REFERENCE).exists():
    multilingual_tests = [
        ("Hello, this is a test in English.", "en"),
        ("Hola, esta es una prueba en español.", "es"),
        ("Bonjour, ceci est un test en français.", "fr"),
        ("Ciao, questo è un test in italiano.", "it"),
    ]
    
    print("\n" + "="*70)
    print("Multilingual Voice Cloning Test")
    print("="*70)
    
    for text, lang in multilingual_tests:
        output_file = f"output_multilingual_{lang}.wav"
        
        result = synthesize_with_cloning(
            text=text,
            speaker_wav=SPEAKER_REFERENCE,
            output_path=output_file,
            language=lang,
            speed=1.0
        )
        
        print(f"\n🔊 Playing {language_names.get(lang, lang)}:")
        display(Audio(output_file, autoplay=False))
        print("-" * 70)
else:
    print("\n⚠️  Skipping multilingual tests - no speaker reference available.")

## 10. Performance Summary

In [None]:
if results:
    import pandas as pd
    
    df = pd.DataFrame(results)
    
    print("\n" + "="*70)
    print("TTS PERFORMANCE SUMMARY")
    print("="*70)
    
    print(f"\nAverage Synthesis Time: {df['synthesis_time'].mean():.2f}s")
    print(f"Average Audio Duration: {df['audio_duration'].mean():.2f}s")
    print(f"Average Real-time Factor: {df['real_time_factor'].mean():.2f}x")
    
    if torch.cuda.is_available():
        print(f"Average GPU Memory: {df['memory_used'].mean():.3f} GB")
    
    print("\n" + "="*70)
    print("Interpretation:")
    print("="*70)
    rtf = df['real_time_factor'].mean()
    if rtf < 1.0:
        print(f"✅ Faster than real-time (RTF={rtf:.2f}x)")
        print("   Suitable for real-time voice assistant")
    else:
        print(f"⚠️  Slower than real-time (RTF={rtf:.2f}x)")
        print("   May need optimization for real-time use")
else:
    print("\n⚠️  No results to summarize. Please run voice cloning tests first.")

## 11. Voice Cloning Best Practices

In [None]:
print("""
╔══════════════════════════════════════════════════════════════════════╗
║                VOICE CLONING BEST PRACTICES                          ║
╚══════════════════════════════════════════════════════════════════════╝

📋 Reference Audio Requirements:
   • Duration: Minimum 6 seconds, 10-30 seconds recommended
   • Quality: Clear speech, minimal background noise
   • Content: Complete sentences in target language
   • Format: WAV, 22050 Hz sample rate preferred
   • Speaker: Single speaker, consistent volume

🎯 For Best Results:
   • Use high-quality microphone
   • Record in quiet environment
   • Speak naturally and clearly
   • Include emotional variation
   • Match language of synthesis

⚡ Performance Tips:
   • Use GPU for faster synthesis
   • Keep text chunks reasonable (<500 chars)
   • Adjust speed parameter for natural flow
   • Cache speaker embeddings for multiple uses

🚫 Avoid:
   • Background music in reference
   • Multiple speakers in reference
   • Very short reference clips (<6s)
   • Low-quality recordings
   • Excessive silence or noise

💡 XTTS-v2 Advantages:
   • Zero-shot voice cloning (no fine-tuning!)
   • Multilingual support
   • Natural prosody and intonation
   • Cross-lingual voice transfer
   • Fast inference on GPU
""")

## 12. Cleanup

In [None]:
# Free memory
if 'tts' in locals():
    del tts

if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print(f"\nGPU Memory after cleanup: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

print("\n✓ Cleanup complete")

## Summary

**XTTS-v2 for Voice Assistant:**

✅ **Advantages:**
- Zero-shot voice cloning (no fine-tuning needed!)
- Excellent Spanish support
- Natural-sounding speech
- Reasonable memory footprint (~2GB)
- Fast enough for real-time applications

✅ **Recommended Configuration:**
- Use 10-15 second reference audio
- Normal speed (1.0x)
- GPU acceleration for best performance
- Spanish language code: "es"

✅ **Integration:**
- Works well within 12GB VRAM constraint
- Suitable for voice assistant pipeline
- Can clone any voice with minimal reference

**No fine-tuning required** - XTTS-v2 is ready to use out of the box!