# üéß TheLostChapter - TTS Voice Cloning

Generate audiobook narration with your own cloned voice in **Vietnamese** and **English**.

## Features
- **viXTTS**: Fine-tuned for Vietnamese (best quality)
- **XTTS v2**: Multilingual support (18 languages)
- **Edge TTS**: Quick generation without cloning

## Instructions
1. **Runtime > Change runtime type > T4 GPU** (recommended)
2. Run cells in order
3. Upload your voice sample (6-30 seconds)
4. Enter text and generate!

---

## 1. Install Dependencies

In [None]:
#@title Install TTS and dependencies { display-mode: "form" }
!pip install -q TTS==0.22.0
!pip install -q edge-tts
!pip install -q soundfile
!pip install -q huggingface_hub

print("‚úÖ Dependencies installed!")

## 2. Download viXTTS Model (Vietnamese)

This downloads the fine-tuned Vietnamese model (~2GB). Only need to run once per session.

In [None]:
#@title Download viXTTS Model { display-mode: "form" }
from huggingface_hub import hf_hub_download
from pathlib import Path
import os

MODEL_DIR = Path("/content/models/vixtts")
MODEL_DIR.mkdir(parents=True, exist_ok=True)

# Download from capleaf/viXTTS
files = ["config.json", "model.pth", "vocab.json"]

print("Downloading viXTTS model...")
for filename in files:
    if not (MODEL_DIR / filename).exists():
        print(f"  Downloading {filename}...")
        hf_hub_download(
            repo_id="capleaf/viXTTS",
            filename=filename,
            local_dir=str(MODEL_DIR),
            local_dir_use_symlinks=False
        )
    else:
        print(f"  {filename} already exists")

print("\n‚úÖ viXTTS model ready!")

## 3. Upload Your Voice Sample

Upload a **6-30 second** audio clip of your voice:
- Clear speech, no background noise
- WAV or MP3 format
- Natural speaking pace

In [None]:
#@title Upload Voice Sample { display-mode: "form" }
from google.colab import files
from IPython.display import Audio, display
import shutil

print("Select your voice sample file (WAV or MP3):")
uploaded = files.upload()

if uploaded:
    SPEAKER_WAV = list(uploaded.keys())[0]
    print(f"\n‚úÖ Uploaded: {SPEAKER_WAV}")
    print("\nPreview:")
    display(Audio(SPEAKER_WAV))
else:
    print("‚ùå No file uploaded")

## 4. Load TTS Model

In [None]:
#@title Load viXTTS Model { display-mode: "form" }
import torch
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

print("Loading viXTTS model...")

config = XttsConfig()
config.load_json(str(MODEL_DIR / "config.json"))

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path=str(MODEL_DIR / "model.pth"),
    vocab_path=str(MODEL_DIR / "vocab.json")
)

if torch.cuda.is_available():
    model.cuda()
    print(f"‚úÖ Model loaded on GPU: {torch.cuda.get_device_name()}")
else:
    print("‚ö†Ô∏è Running on CPU (slower)")

# Compute speaker embedding from uploaded sample
print(f"\nProcessing voice sample: {SPEAKER_WAV}")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=SPEAKER_WAV)
print("‚úÖ Voice cloned successfully!")

## 5. Generate Vietnamese Audio üáªüá≥

Enter Vietnamese text (10+ words for best quality).

In [None]:
#@title Generate Vietnamese Audio { display-mode: "form" }
import soundfile as sf
from IPython.display import Audio, display

#@markdown ### Enter Vietnamese text:
text_vi = "Xin ch√†o c√°c b·∫°n, ƒë√¢y l√† gi·ªçng n√≥i c·ªßa t√¥i ƒë∆∞·ª£c t·∫°o b·∫±ng tr√≠ tu·ªá nh√¢n t·∫°o. Ch·∫•t l∆∞·ª£ng √¢m thanh s·∫Ω t·ªët h∆°n v·ªõi c√°c c√¢u d√†i h∆°n m∆∞·ªùi t·ª´." #@param {type:"string"}

#@markdown ### Settings:
temperature = 0.7 #@param {type:"slider", min:0.1, max:1.0, step:0.1}
output_filename = "output_vi.wav" #@param {type:"string"}

print(f"Generating audio for: {text_vi[:50]}...")

out = model.inference(
    text_vi,
    "vi",
    gpt_cond_latent,
    speaker_embedding,
    temperature=temperature
)

sf.write(output_filename, out["wav"], 24000)
print(f"\n‚úÖ Generated: {output_filename}")
print("\nPlayback:")
display(Audio(output_filename))

# Download button
files.download(output_filename)

## 6. Generate English Audio üá¨üáß

viXTTS also supports English (and 16 other languages).

In [None]:
#@title Generate English Audio { display-mode: "form" }

#@markdown ### Enter English text:
text_en = "Welcome to The Lost Chapter, an interactive audiobook experience. Let me guide you through this amazing story with my voice." #@param {type:"string"}

#@markdown ### Settings:
temperature_en = 0.7 #@param {type:"slider", min:0.1, max:1.0, step:0.1}
output_filename_en = "output_en.wav" #@param {type:"string"}

print(f"Generating audio for: {text_en[:50]}...")

out_en = model.inference(
    text_en,
    "en",
    gpt_cond_latent,
    speaker_embedding,
    temperature=temperature_en
)

sf.write(output_filename_en, out_en["wav"], 24000)
print(f"\n‚úÖ Generated: {output_filename_en}")
print("\nPlayback:")
display(Audio(output_filename_en))

files.download(output_filename_en)

## 7. Batch Generation (Multiple Sentences)

Generate audio for multiple paragraphs and combine them.

In [None]:
#@title Batch Generate from Text File { display-mode: "form" }
import re
import numpy as np

#@markdown ### Enter multiple paragraphs (separated by blank lines):
batch_text = """Ch∆∞∆°ng m·ªôt: Kh·ªüi ƒë·∫ßu.

Ng√†y x∆∞a, ·ªü m·ªôt v∆∞∆°ng qu·ªëc xa x√¥i, c√≥ m·ªôt ch√†ng trai tr·∫ª t√™n l√† Minh. Minh lu√¥n m∆° ∆∞·ªõc ƒë∆∞·ª£c kh√°m ph√° th·∫ø gi·ªõi r·ªông l·ªõn ngo√†i kia.

M·ªôt ng√†y n·ªç, Minh quy·∫øt ƒë·ªãnh r·ªùi kh·ªèi ng√¥i l√†ng nh·ªè c·ªßa m√¨nh ƒë·ªÉ b·∫Øt ƒë·∫ßu cu·ªôc phi√™u l∆∞u m·ªõi. Ch√†ng mang theo m·ªôt chi·∫øc ba l√¥ nh·ªè v√† tr√°i tim ƒë·∫ßy hy v·ªçng.""" #@param {type:"string"}

#@markdown ### Language:
batch_lang = "vi" #@param ["vi", "en"]
batch_output = "batch_output.wav" #@param {type:"string"}

# Split into paragraphs
paragraphs = [p.strip() for p in batch_text.split('\n\n') if p.strip()]
print(f"Found {len(paragraphs)} paragraphs")

all_audio = []
silence = np.zeros(int(24000 * 0.5))  # 0.5s silence between paragraphs

for i, para in enumerate(paragraphs):
    print(f"\nGenerating paragraph {i+1}/{len(paragraphs)}: {para[:40]}...")
    out = model.inference(
        para,
        batch_lang,
        gpt_cond_latent,
        speaker_embedding,
        temperature=0.7
    )
    all_audio.append(out["wav"])
    all_audio.append(silence)

# Combine all audio
combined = np.concatenate(all_audio)
sf.write(batch_output, combined, 24000)

print(f"\n‚úÖ Combined audio saved: {batch_output}")
print(f"Duration: {len(combined)/24000:.1f} seconds")
display(Audio(batch_output))

files.download(batch_output)

## 8. Edge TTS (No Voice Cloning)

Quick generation using Microsoft's neural voices. No GPU required.

In [None]:
#@title Edge TTS - Vietnamese Voices { display-mode: "form" }
import edge_tts
import asyncio

#@markdown ### Text:
edge_text = "Xin ch√†o, ƒë√¢y l√† gi·ªçng ƒë·ªçc t·ª´ Microsoft Edge. Ch·∫•t l∆∞·ª£ng kh√° t·ªët v√† kh√¥ng c·∫ßn GPU." #@param {type:"string"}

#@markdown ### Voice:
edge_voice = "vi-VN-HoaiMyNeural" #@param ["vi-VN-HoaiMyNeural", "vi-VN-NamMinhNeural", "en-US-GuyNeural", "en-US-JennyNeural", "en-GB-RyanNeural"]

edge_output = "edge_output.mp3" #@param {type:"string"}

async def generate_edge():
    communicate = edge_tts.Communicate(edge_text, edge_voice)
    await communicate.save(edge_output)

await generate_edge()

print(f"‚úÖ Generated: {edge_output}")
display(Audio(edge_output))
files.download(edge_output)

## 9. Generate from Chapter JSON

Upload a TheLostChapter chapter JSON file to generate all audio sections.

In [None]:
#@title Generate from Chapter JSON { display-mode: "form" }
import json
import os

print("Upload a chapter JSON file:")
chapter_upload = files.upload()

if chapter_upload:
    chapter_file = list(chapter_upload.keys())[0]
    with open(chapter_file, 'r', encoding='utf-8') as f:
        chapter = json.load(f)

    print(f"\nChapter: {chapter.get('title', 'Unknown')}")

    # Find audio sections
    audio_sections = [
        (i, s) for i, s in enumerate(chapter.get('sections', []))
        if s.get('type') == 'audio' and s.get('transcript')
    ]

    print(f"Found {len(audio_sections)} audio sections\n")

    os.makedirs('chapter_audio', exist_ok=True)

    for idx, section in audio_sections:
        transcript = section['transcript']
        lang = section.get('language', 'vi')
        output_name = section.get('src', f'section_{idx}.wav')
        output_path = f"chapter_audio/{output_name}"

        print(f"Generating: {output_name}")
        print(f"  Text: {transcript[:60]}...")

        out = model.inference(
            transcript,
            lang,
            gpt_cond_latent,
            speaker_embedding,
            temperature=0.7
        )

        sf.write(output_path, out["wav"], 24000)
        print(f"  ‚úÖ Saved: {output_path}\n")

    # Zip and download
    !zip -r chapter_audio.zip chapter_audio/
    files.download('chapter_audio.zip')
    print("\n‚úÖ All audio files zipped and ready for download!")

---

## Tips

### Vietnamese Quality
- Use sentences with **10+ words** for best results
- Shorter sentences may produce odd trailing sounds

### Voice Sample
- **6-30 seconds** of clear speech
- No background noise or music
- Natural speaking pace

### Temperature
- **0.5-0.7**: More consistent, robotic
- **0.7-0.9**: Natural, expressive
- **0.9-1.0**: More variation, may be unstable

### Supported Languages
Vietnamese (vi), English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko), Hindi (hi)

---

**TheLostChapter** | [GitHub](https://github.com/nmnhut-it/english-learning-app/tree/main/the-lost-chapter)