# 🎤 XTTS Voice Cloner - Gradio Web Interface

Welcome to the XTTS Voice Cloner! This notebook creates a web interface that allows you to:

- **Upload reference audio** (your voice sample)
- **Input text script** (what you want the AI to say)
- **Generate cloned speech** using XTTS v2 model
- **Play and download** the generated audio

## ⚡ Features
- High-quality voice cloning using XTTS v2
- Support for 16+ languages
- GPU acceleration (when available)
- User-friendly web interface
- No coding required for end users

## 🚀 Perfect for:
- Content creators
- Voiceovers
- Audiobook narration
- Educational content
- Accessibility tools

---

**Important:** This runs completely in Google Colab - no local installation required!

## 📦 Step 1: Install Required Dependencies

First, we'll install all the necessary packages. This may take a few minutes.

In [1]:
# Install required packages
!pip install torch==2.6.0+cu124 torchaudio==2.6.0+cu124 --index-url https://download.pytorch.org/whl/cu124
!pip install coqui-tts
!pip install numpy scipy librosa soundfile pydub matplotlib transformers

# Verify GPU is available
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

Looking in indexes: https://download.pytorch.org/whl/cu124
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.6.0+cu124)
  Downloading https://download.pytorch.org/whl/cu124/nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m80.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.6.0+cu124)
  Downloading https://download.pytorch.org/whl/cu124/nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m58.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu12==12.4.127 (from torch==2.6.0+cu124)
  Downloading https://download.pytorch.org/whl/cu124/nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m5

## 🔧 Step 2: Import Libraries and Create directories

Now we'll import all necessary libraries and load the XTTS model. The model download happens automatically on first run.

In [2]:
# Import required libraries
import os
import time
import torch
import torchaudio
import gradio as gr
import numpy as np
from pathlib import Path
import tempfile
import warnings
warnings.filterwarnings("ignore")

# print the torchaudio version
print(f"Torchaudio version: {torchaudio.__version__}")

# from TTS.api import TTS # Move this import inside the class

print("📚 Libraries imported successfully!")

# Initialize XTTS model with retry logic
class XTTSVoiceCloner:
    def __init__(self, max_retries=3):
        """Initialize XTTS model with robust error handling."""
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.tts = None

        print(f"🔄 Loading XTTS model on {self.device}...")

        from TTS.api import TTS # Import TTS here

        for attempt in range(max_retries):
            try:
                print(f"   Attempt {attempt + 1}/{max_retries}...")
                self.tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(self.device)
                print("✅ XTTS model loaded successfully!")
                break
            except Exception as e:
                print(f"❌ Error loading model (attempt {attempt + 1}): {str(e)}")
                if attempt == max_retries - 1:
                    print("🚨 Failed to load model after all retries!")
                    raise e
                print("⏳ Retrying in 10 seconds...")
                time.sleep(10)

    def get_supported_languages(self):
        """Return list of supported languages."""
        return [
            ("English", "en"), ("Spanish", "es"), ("French", "fr"),
            ("German", "de"), ("Italian", "it"), ("Portuguese", "pt"),
            ("Polish", "pl"), ("Turkish", "tr"), ("Russian", "ru"),
            ("Dutch", "nl"), ("Czech", "cs"), ("Arabic", "ar"),
            ("Chinese", "zh-cn"), ("Japanese", "ja"), ("Hungarian", "hu"), ("Korean", "ko")
        ]

# Initialize the voice cloner
voice_cloner = XTTSVoiceCloner()
print("🎉 Voice cloner ready!")

Torchaudio version: 2.6.0+cu124
📚 Libraries imported successfully!
🔄 Loading XTTS model on cuda...
   Attempt 1/3...
 > You must confirm the following:
 | > "I have purchased a commercial license from Coqui: licensing@coqui.ai"
 | > "Otherwise, I agree to the terms of the non-commercial CPML: https://coqui.ai/cpml" - [y/n]
 | | > y


100%|█████████▉| 1.87G/1.87G [00:21<00:00, 97.9MiB/s]
100%|██████████| 1.87G/1.87G [00:21<00:00, 85.1MiB/s]
4.37kiB [00:00, 33.7kiB/s]

361kiB [00:00, 2.59MiB/s]
100%|██████████| 32.0/32.0 [00:00<00:00, 184iB/s]
100%|██████████| 7.75M/7.75M [00:17<00:00, 53.1MiB/s]

✅ XTTS model loaded successfully!
🎉 Voice cloner ready!


## 🎵 Step 3: Define Audio Processing Functions

These functions handle audio file validation, format conversion, and processing for optimal voice cloning results.

In [3]:
# Audio processing functions
def validate_audio_file(audio_file):
    """Validate uploaded audio file."""
    if audio_file is None:
        return False, "❌ No audio file uploaded!"

    # Check file size (max 50MB)
    file_size = os.path.getsize(audio_file) / (1024 * 1024)  # MB
    if file_size > 50:
        return False, f"❌ File too large ({file_size:.1f}MB). Max size: 50MB"

    # Check file extension
    valid_extensions = ['.wav', '.mp3', '.flac', '.m4a', '.ogg']
    file_ext = os.path.splitext(audio_file)[1].lower()
    if file_ext not in valid_extensions:
        return False, f"❌ Unsupported format: {file_ext}. Use: {', '.join(valid_extensions)}"

    return True, "✅ Audio file is valid!"

def get_audio_info(audio_file):
    """Get information about the audio file."""
    try:
        import librosa
        y, sr = librosa.load(audio_file, sr=None)
        duration = len(y) / sr

        return {
            "duration": duration,
            "sample_rate": sr,
            "channels": 1 if len(y.shape) == 1 else y.shape[0],
            "format": os.path.splitext(audio_file)[1][1:].upper()
        }
    except Exception as e:
        return {"error": str(e)}

def process_reference_audio(audio_file):
    """Process and validate reference audio for voice cloning."""
    if not audio_file:
        return None, "❌ Please upload a reference audio file!"

    # Validate file
    is_valid, message = validate_audio_file(audio_file)
    if not is_valid:
        return None, message

    # Get audio info
    info = get_audio_info(audio_file)
    if "error" in info:
        return None, f"❌ Error reading audio: {info['error']}"

    # Check duration (recommend 3-30 seconds)
    duration = info["duration"]
    if duration < 1:
        return None, "⚠️ Audio too short! Use 3-30 seconds for best results."
    elif duration > 60:
        return None, "⚠️ Audio too long! Use 3-30 seconds for best results."

    status_msg = f"✅ Audio processed successfully!\n"
    status_msg += f"📊 Duration: {duration:.1f}s | Sample Rate: {info['sample_rate']}Hz | Format: {info['format']}"

    if duration < 3:
        status_msg += "\n💡 Tip: 3-10 seconds of clear speech works best!"

    return audio_file, status_msg

print("🎵 Audio processing functions defined!")

🎵 Audio processing functions defined!


## 🗣️ Step 4: Create Text-to-Speech Generation Function

This is the core function that performs voice cloning using the reference audio and input text.

In [4]:
# Main voice cloning function for Gradio
def clone_voice(reference_audio, input_text, language, progress=gr.Progress()):
    """
    Main function to clone voice using reference audio and input text.

    Args:
        reference_audio: Uploaded audio file path
        input_text: Text to convert to speech
        language: Selected language code
        progress: Gradio progress tracker

    Returns:
        tuple: (output_audio_path, status_message)
    """

    # Progress tracking
    progress(0.1, desc="🔍 Validating inputs...")

    # Validate inputs
    if not reference_audio:
        return None, "❌ Please upload a reference audio file!"

    if not input_text or len(input_text.strip()) < 5:
        return None, "❌ Please enter text (at least 5 characters)!"

    if len(input_text) > 1000:
        return None, "❌ Text too long! Please keep it under 1000 characters."

    progress(0.2, desc="🎵 Processing reference audio...")

    # Process reference audio
    processed_audio, audio_status = process_reference_audio(reference_audio)
    if not processed_audio:
        return None, audio_status

    progress(0.4, desc="🤖 Generating speech...")

    try:
        # Create temporary output file
        with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
            output_path = tmp_file.name

        # Generate speech using XTTS
        voice_cloner.tts.tts_to_file(
            text=input_text.strip(),
            file_path=output_path,
            speaker_wav=processed_audio,
            language=language
        )

        progress(0.9, desc="✅ Finalizing...")

        # Check if output file was created successfully
        if not os.path.exists(output_path) or os.path.getsize(output_path) == 0:
            return None, "❌ Failed to generate audio. Please try again."

        progress(1.0, desc="🎉 Complete!")

        # Success message
        char_count = len(input_text)
        word_count = len(input_text.split())
        success_msg = f"🎉 Voice cloning successful!\n"
        success_msg += f"📝 Generated {word_count} words ({char_count} characters)\n"
        success_msg += f"🎤 Language: {language.upper()}\n"
        success_msg += f"🔊 Audio ready for playback!"

        return output_path, success_msg

    except Exception as e:
        error_msg = f"❌ Error during voice generation: {str(e)}"
        print(f"Voice cloning error: {e}")
        return None, error_msg

def get_example_text():
    """Return example text for demonstration."""
    examples = [
        "Hello! This is a test of the voice cloning system. How do I sound?",
        "Welcome to our AI voice cloning demo. This technology can replicate voices with just a short audio sample.",
        "The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet.",
        "In a world where technology advances rapidly, voice cloning represents a fascinating frontier in artificial intelligence."
    ]
    return examples

print("🗣️ Voice cloning function ready!")

🗣️ Voice cloning function ready!


## 🎨 Step 5: Build Gradio Interface Components

Now we'll create the user-friendly web interface with all the input and output components.

In [15]:
# Create Gradio interface
def create_gradio_interface():
    """Create and configure the Gradio web interface."""

    # Custom CSS for better styling
    custom_css = """
    .gradio-container {
        max-width: 1200px !important;
        margin: auto !important;
    }
    .header {
        text-align: center;
        margin-bottom: 2rem;
    }
    .info-box {
        background: linear-gradient(45deg, #f0f9ff, #e0f2fe);
        padding: 1rem;
        border-radius: 8px;
        border-left: 4px solid #0ea5e9;
        margin: 1rem 0;
    }
    .info-box li {
        color: black !important;
    }
    """

    # Create the interface
    with gr.Blocks(css=custom_css, title="XTTS Voice Cloner", theme=gr.themes.Soft()) as interface:

        # Header
        gr.HTML("""
        <div class="header">
            <h1>🎤 XTTS Voice Cloner</h1>
            <p style="font-size: 1.2em; color: #666;">
                High-quality voice cloning using AI • Upload your voice, enter text, get AI speech!
            </p>
        </div>
        """)

        # Instructions
        gr.HTML("""
        <div class="info-box ">
            <h3 style="color: black">📋 How to use:</h3>
            <ol >
                <li ><strong style="color: black">Upload Reference Audio:</strong> A clear 3-30 second recording of the target voice</li>
                <li><strong style="color: black">Enter Text:</strong> What you want the AI to say (up to 1000 characters)</li>
                <li><strong style="color: black">Select Language:</strong> Choose the language for speech generation</li>
                <li><strong style="color: black">Generate:</strong> Click the button and wait for the magic! ✨</li>
            </ol>
            <p style="color: black"><strong style="color: black">💡 Tips:</strong> Use high-quality audio with minimal background noise for best results!</p>
        </div>
        """)

        with gr.Row():
            with gr.Column(scale=1):
                # Input section
                gr.HTML("<h3>🎯 Inputs</h3>")

                # File upload
                reference_audio = gr.Audio(
                    label="🎤 Reference Audio (Upload your voice sample)",
                    type="filepath",
                    sources=["upload"],
                    interactive=True
                )

                # Text input
                input_text = gr.Textbox(
                    label="📝 Text to Convert to Speech",
                    placeholder="Enter the text you want the AI to speak...",
                    lines=4,
                    max_lines=8,
                    interactive=True
                )

                # Language selection
                language_choices = voice_cloner.get_supported_languages()
                language = gr.Dropdown(
                    choices=language_choices,
                    value="en",
                    label="🌍 Language",
                    interactive=True
                )

                # Generate button
                generate_btn = gr.Button(
                    "🚀 Generate Cloned Voice",
                    variant="primary",
                    size="lg"
                )

                # Example texts
                gr.HTML("<h4>📚 Example Texts:</h4>")
                example_texts = get_example_text()
                for i, example in enumerate(example_texts):
                    gr.Button(
                        f"Example {i+1}",
                        size="sm"
                    ).click(
                        lambda x=example: x,
                        outputs=input_text
                    )

            with gr.Column(scale=1):
                # Output section
                gr.HTML("<h3>🎧 Results</h3>")

                # Status display
                status_output = gr.Textbox(
                    label="📊 Status",
                    interactive=False,
                    lines=6
                )

                # Audio player
                audio_output = gr.Audio(
                    label="🔊 Generated Speech",
                    interactive=False
                )

                # Download info
                gr.HTML("""
                <div style="background: #f8fafc; padding: 1rem; border-radius: 6px; margin-top: 1rem;">
                    <p style="color: black"><strong style="color: black">💾 Download:</strong> Click the ⋯ menu in the audio player above to download your generated speech!</p>
                </div>
                """)

        # Footer
        gr.HTML("""
        <div style="text-align: center; margin-top: 2rem; padding: 1rem; border-top: 1px solid #e5e7eb;">
            <p style="color: #6b7280;">
                🚀 Powered by <strong>XTTS v2</strong> •
                🔬 Running on <strong>Google Colab</strong> •
                ❤️ Open Source AI
            </p>
        </div>
        """)

        # Connect the generate button
        generate_btn.click(
            fn=clone_voice,
            inputs=[reference_audio, input_text, language],
            outputs=[audio_output, status_output],
            show_progress=True
        )

    return interface

print("🎨 Gradio interface components ready!")

🎨 Gradio interface components ready!


## 🚀 Step 6: Launch Gradio Web Application

Finally, let's launch the web interface! This will create a public URL that you can share with others.

In [16]:
# Launch the Gradio interface
print("🚀 Launching XTTS Voice Cloner Web Interface...")
print("=" * 60)

# Create the interface
app = create_gradio_interface()

# Launch with public sharing enabled
try:
    app.launch(
        share=True,          # Creates public URL for sharing
        inbrowser=True,      # Opens in browser automatically
        server_name="0.0.0.0",  # Allow external connections
        server_port=7860,    # Default port
        show_error=True,     # Show detailed errors
        quiet=False          # Show launch info
    )
except Exception as e:
    print(f"❌ Error launching interface: {e}")
    print("🔄 Trying alternative launch...")

    # Try alternative launch without inbrowser
    app.launch(
        share=True,
        server_name="0.0.0.0",
        show_error=True
    )

🚀 Launching XTTS Voice Cloner Web Interface...
❌ Error launching interface: Cannot find empty port in range: 7860-7860. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the `server_port` parameter to `launch()`.
🔄 Trying alternative launch...
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://d5cae106eea894c7e8.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


## 🎉 Success! Your Voice Cloner is Running!

If everything worked correctly, you should see:
1. **Local URL**: `http://localhost:7860` - For your use
2. **Public URL**: `https://xxxxxxx.gradio.live` - **Share this link with others!**


   ```markdown
   # XTTS Voice Cloner
   
   ## Quick Start
   1. Open this notebook in Google Colab
   2. Run all cells (Runtime → Run all)
   3. Use the web interface that opens
   4. Share the public URL with others!
   ```

3. **Include these requirements** in your repo:
   - This notebook file (`.ipynb`)
   - README.md with instructions
   - requirements.txt (optional, packages are installed in notebook)

### 🔧 Customization Options:

- **Change supported languages**: Modify the `get_supported_languages()` function
- **Adjust UI theme**: Change `theme=gr.themes.Soft()` to other Gradio themes
- **Add more examples**: Extend the `get_example_text()` function
- **Custom styling**: Modify the `custom_css` variable

### 🚨 Important Notes:

- **Colab session expires** after 12 hours of inactivity
- **GPU quota is limited** - use efficiently
- **Files are temporary** - download results before session ends
- **Public URLs expire** when Colab session ends

### 🌟 Features Available:

✅ Voice cloning with any audio sample  
✅ 16+ language support  
✅ Real-time progress tracking  
✅ Audio validation and processing  
✅ Download generated speech  
✅ Mobile-friendly interface  
✅ Public sharing capability  

**Enjoy your AI voice cloning experience!** 🎤✨