# 🎙️ Voice Cloning with Tortoise TTS

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/juanvolpe/voice2/blob/master/voice_cloning.ipynb)

This notebook allows you to clone voices using Tortoise TTS, supporting multiple languages and quality presets.

## 🎯 Features
- Voice cloning with Tortoise TTS
- Support for 13 languages
- Multiple quality presets
- GPU acceleration
- Easy voice sample management

## 📝 Prerequisites
1. **Google Colab**: Make sure you're running this in Colab with GPU runtime
2. **Voice Samples**: Prepare WAV/MP3 files (5-10 seconds each)
3. **Hugging Face Token**: Will be loaded from Colab secrets

# Section 1: Setup 🔧

First, ensure you're using a GPU runtime:
1. Click `Runtime` in the menu
2. Select `Change runtime type`
3. Choose `GPU` from the hardware accelerator dropdown
4. Click `Save`

In [None]:
# Check GPU availability
import torch
print("🔍 Checking GPU configuration...")
print(f"GPU Available: {'✅' if torch.cuda.is_available() else '❌'}")
if torch.cuda.is_available():
    print(f"Device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
else:
    print("\n⚠️ No GPU detected! Please change runtime to GPU for better performance:")
    print("1. Runtime > Change runtime type")
    print("2. Select GPU from Hardware accelerator dropdown")
    print("3. Click Save and restart runtime")

In [None]:
# Install dependencies with specific versions
print("📦 Installing dependencies...")

# First uninstall any existing installations
!pip uninstall -y transformers torch torchaudio

# Install specific versions of dependencies
!pip install --quiet torch==2.0.1 torchaudio==2.0.2
!pip install --quiet transformers==4.30.2 numpy==1.24.3 librosa==0.10.1
!pip install --quiet unidecode==1.3.6 inflect==7.0.0 tqdm==4.65.0

# Install Tortoise TTS
!pip install --quiet git+https://github.com/neonbjb/tortoise-tts.git

# Import dependencies
print("\n🔄 Importing dependencies...")
import os
import torch
import torchaudio
from IPython.display import Audio, display
from google.colab import files, userdata

# Verify transformers version
import transformers
print(f"\n📚 Transformers version: {transformers.__version__}")

# Import Tortoise TTS
from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_audio, load_voice, load_voices
print("\n✅ Tortoise TTS imported successfully!")

# Set up Hugging Face token
try:
    hf_token = userdata.get('HF_TOKEN')
    if hf_token:
        os.environ['HF_TOKEN'] = hf_token
        from huggingface_hub import login
        login(token=hf_token)
        print("✅ Hugging Face token configured successfully!")
    else:
        print("⚠️ HF_TOKEN not found in Colab secrets")
        print("Some features might be limited. Add HF_TOKEN to your Colab secrets if needed.")
except Exception as e:
    print(f"❌ Error configuring Hugging Face token: {str(e)}")

# Initialize device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"\n🖥️ Using device: {device}")

print("\n🚀 Setup complete! Ready to proceed with voice cloning.")

# Section 2: Voice Sample Management 🎤

## Voice Sample Requirements
For best results, your voice samples should:
- Be in WAV or MP3 format
- Last 5-10 seconds each
- Contain clear speech with minimal background noise
- Have consistent voice tone and quality
- Be recorded in a quiet environment

## Sample Management
The system will:
1. Create a directory for your samples
2. Validate file formats
3. Check audio quality
4. Analyze sample duration
5. Provide feedback on each sample

In [None]:
class VoiceSampleManager:
    def __init__(self, sample_dir='voice_samples'):
        """Initialize the voice sample manager"""
        self.sample_dir = sample_dir
        self.samples = []
        os.makedirs(sample_dir, exist_ok=True)
        print(f"📁 Sample directory ready: {sample_dir}")
        
    def upload_samples(self):
        """Upload and validate voice samples"""
        print("\n📂 Select your voice sample files (WAV or MP3)")
        uploaded = files.upload()
        
        for filename in uploaded.keys():
            if filename.lower().endswith(('.wav', '.mp3')):
                filepath = os.path.join(self.sample_dir, filename)
                os.rename(filename, filepath)
                print(f"✅ Uploaded: {filename}")
                self.samples.append(filepath)
            else:
                print(f"❌ Skipped: {filename} (not a WAV or MP3 file)")
    
    def analyze_samples(self):
        """Analyze uploaded samples for quality and duration"""
        print("\n🔍 Analyzing voice samples...")
        valid_samples = []
        
        for sample in os.listdir(self.sample_dir):
            filepath = os.path.join(self.sample_dir, sample)
            try:
                # Load and analyze audio
                waveform, sample_rate = torchaudio.load(filepath)
                duration = waveform.size(1) / sample_rate
                
                # Print analysis
                print(f"\n📊 Analysis for {sample}:")
                print(f"- Duration: {duration:.1f} seconds")
                print(f"- Channels: {waveform.size(0)}")
                print(f"- Sample Rate: {sample_rate} Hz")
                
                # Add recommendations
                if duration < 5:
                    print("⚠️ Sample is shorter than recommended (5-10 seconds)")
                elif duration > 10:
                    print("⚠️ Sample is longer than recommended (5-10 seconds)")
                else:
                    print("✅ Duration is within recommended range")
                
                valid_samples.append(filepath)
                
            except Exception as e:
                print(f"❌ Error analyzing {sample}: {str(e)}")
        
        return valid_samples

# Create manager and handle samples
voice_manager = VoiceSampleManager()
voice_manager.upload_samples()
valid_samples = voice_manager.analyze_samples()

# Summary
print(f"\n📝 Summary:")
print(f"- Total samples: {len(os.listdir(voice_manager.sample_dir))}")
print(f"- Valid samples: {len(valid_samples)}")
if len(valid_samples) == 0:
    print("\n⚠️ No valid samples found. Please upload some voice samples.")

# Section 3: Configuration ⚙️

## Available Settings

### Quality Presets
Each preset balances speed vs quality:
- `ultra_fast`: Fastest generation, lower quality
- `fast`: Quick generation with decent quality
- `standard`: Balanced speed and quality
- `high_quality`: Best quality, slower generation

### Language Support
Supported languages for input and output:
- English (en), Spanish (es), French (fr)
- German (de), Italian (it), Portuguese (pt)
- Polish (pl), Turkish (tr), Russian (ru)
- Dutch (nl), Czech (cs), Arabic (ar)
- Chinese (Simplified) (zh-cn)

> Note: Higher quality presets require more VRAM and processing time. Start with 'standard' and adjust based on results.

In [None]:
class VoiceGenerator:
    QUALITY_PRESETS = ['ultra_fast', 'fast', 'standard', 'high_quality']
    SUPPORTED_LANGUAGES = ['en', 'es', 'fr', 'de', 'it', 'pt', 'pl', 'tr', 'ru', 'nl', 'cs', 'ar', 'zh-cn']
    
    def __init__(self, device='cuda'):
        """Initialize the voice generator"""
        self.device = device
        self.tts = TextToSpeech(device=device)
        print(f"✅ Voice Generator initialized on {device}")
    
    def validate_settings(self, quality_preset, input_lang, output_lang):
        """Validate generation settings"""
        if quality_preset not in self.QUALITY_PRESETS:
            raise ValueError(f"Quality preset must be one of {self.QUALITY_PRESETS}")
        
        if input_lang not in self.SUPPORTED_LANGUAGES:
            raise ValueError(f"Input language must be one of {self.SUPPORTED_LANGUAGES}")
            
        if output_lang not in self.SUPPORTED_LANGUAGES:
            raise ValueError(f"Output language must be one of {self.SUPPORTED_LANGUAGES}")
    
    def generate_speech(self, text, voice_samples, quality_preset='standard', 
                       input_lang='en', output_lang='en'):
        """Generate speech using provided samples and settings"""
        # Validate settings
        self.validate_settings(quality_preset, input_lang, output_lang)
        
        # Load and process voice samples
        processed_samples = []
        for sample_path in voice_samples:
            try:
                audio = load_audio(sample_path, 22050)
                processed_samples.append(audio)
            except Exception as e:
                print(f"❌ Error processing {sample_path}: {str(e)}")
                continue
        
        if not processed_samples:
            raise ValueError("No valid voice samples available")
        
        # Generate speech
        print(f"\n🎵 Generating speech with {quality_preset} quality...")
        gen_audio = self.tts.tts_with_preset(
            text,
            voice_samples=processed_samples,
            preset=quality_preset,
            k=1,
            use_deterministic_seed=True
        )
        
        # Save and return audio
        output_path = 'generated_speech.wav'
        torchaudio.save(output_path, gen_audio.squeeze(0).cpu(), 24000)
        print(f"✅ Speech generated and saved as {output_path}")
        
        return Audio(output_path)

# Initialize generator
generator = VoiceGenerator(device=device)

# Print available options
print("\n📋 Available Settings:")
print("\nQuality Presets:")
for preset in generator.QUALITY_PRESETS:
    print(f"- {preset}")

print("\nSupported Languages:")
for lang in generator.SUPPORTED_LANGUAGES:
    print(f"- {lang}")

# Section 4: Generate and Test Speech 🔊

## Generation Parameters
- **Text**: The text you want to convert to speech
- **Quality**: Choose from available presets
- **Languages**: Select input/output languages
- **Voice Samples**: Uses previously uploaded samples

> Tip: Start with a short test phrase to verify quality before generating longer content.

In [None]:
def generate_test_speech(text, quality='standard', input_lang='en', output_lang='en'):
    """Generate a test speech sample"""
    try:
        # Get validated samples
        samples = voice_manager.analyze_samples()
        if not samples:
            return print("❌ No valid voice samples found. Please upload samples first.")
        
        # Generate and display speech
        print(f"\n🎯 Generating speech with following settings:")
        print(f"- Quality: {quality}")
        print(f"- Input Language: {input_lang}")
        print(f"- Output Language: {output_lang}")
        print(f"- Text: \"{text}\"")
        
        audio = generator.generate_speech(
            text=text,
            voice_samples=samples,
            quality_preset=quality,
            input_lang=input_lang,
            output_lang=output_lang
        )
        
        print("\n🔊 Playing generated audio...")
        display(audio)
        
        print("\n💾 To download the generated audio:")
        print("1. Find 'generated_speech.wav' in the file browser")
        print("2. Right-click and select 'Download'")
        
    except Exception as e:
        print(f"❌ Error generating speech: {str(e)}")

# Test with a short phrase
test_text = "Hello, this is a test of voice cloning using Tortoise TTS."
generate_test_speech(
    text=test_text,
    quality='standard',  # Try: 'ultra_fast', 'fast', 'standard', 'high_quality'
    input_lang='en',
    output_lang='en'
)

# Section 5: Troubleshooting ⚠️

## Common Issues and Solutions

### 1. Runtime Issues
- **GPU Not Available**
  - Check Runtime > Change runtime type
  - Select GPU from hardware accelerator
  - Restart runtime after changes

- **Runtime Disconnected**
  - Save your work frequently
  - Rerun setup cells after reconnection
  - Keep Colab tab active

### 2. Voice Sample Issues
- **Invalid Format**
  - Use WAV or MP3 files only
  - Check file extensions
  - Ensure files aren't corrupted

- **Quality Problems**
  - Record in a quiet environment
  - Use a good microphone
  - Keep consistent voice tone
  - Stay within 5-10 seconds per sample

### 3. Generation Issues
- **Out of Memory**
  - Try a lower quality preset
  - Reduce number of voice samples
  - Clear runtime memory (Runtime > Restart runtime)

- **Poor Quality Output**
  - Use higher quality preset
  - Provide clearer voice samples
  - Check input/output language settings

### 4. File Management
- **Lost Files**
  - Download generated audio immediately
  - Keep backup of voice samples
  - Save notebook frequently

- **Storage Full**
  - Clear old generated files
  - Remove unused voice samples
  - Reset runtime if needed

## Getting Help
- Check [Tortoise TTS Documentation](https://github.com/neonbjb/tortoise-tts)
- Visit [Google Colab FAQs](https://research.google.com/colaboratory/faq.html)
- Review error messages carefully
- Try the suggested solutions above