# üéß TheLostChapter - TTS Voice Cloning

Generate audiobook narration in **Vietnamese** and **English**.

## 2 Options:

| Method | Voice Clone | Quality | Speed | GPU |
|--------|-------------|---------|-------|-----|
| **Edge TTS** | ‚ùå No | Good | Fast | Not needed |
| **viXTTS** | ‚úÖ Yes | Best | Slower | Recommended |

## Quick Start
- **Just want Vietnamese TTS?** ‚Üí Skip to **Section A: Edge TTS** (no setup needed)
- **Want to clone your voice?** ‚Üí Go to **Section B: viXTTS**

---

# üÖ∞Ô∏è Section A: Edge TTS (Easy - No Voice Cloning)

Microsoft's neural voices. High quality, fast, no GPU needed.

**Available Vietnamese voices:**
- `vi-VN-HoaiMyNeural` - N·ªØ (Female)
- `vi-VN-NamMinhNeural` - Nam (Male)

In [None]:
#@title 1. Install Edge TTS { display-mode: "form" }
!pip install -q edge-tts
print("‚úÖ Edge TTS installed!")

In [None]:
#@title 2. Generate Vietnamese Audio (Edge TTS) { display-mode: "form" }
import edge_tts
import asyncio
from IPython.display import Audio, display
from google.colab import files

#@markdown ### Nh·∫≠p vƒÉn b·∫£n ti·∫øng Vi·ªát:
text = "Xin ch√†o c√°c b·∫°n, ƒë√¢y l√† gi·ªçng ƒë·ªçc t·ª´ Microsoft Edge. Ch·∫•t l∆∞·ª£ng kh√° t·ªët v√† ho√†n to√†n mi·ªÖn ph√≠, kh√¥ng c·∫ßn GPU hay voice sample." #@param {type:"string"}

#@markdown ### Ch·ªçn gi·ªçng ƒë·ªçc:
voice = "vi-VN-HoaiMyNeural" #@param ["vi-VN-HoaiMyNeural", "vi-VN-NamMinhNeural"]

#@markdown ### T√™n file output:
output_file = "output_vi.mp3" #@param {type:"string"}

print(f"Generating with voice: {voice}")
print(f"Text: {text[:60]}...\n")

async def generate():
    communicate = edge_tts.Communicate(text, voice)
    await communicate.save(output_file)

await generate()

print(f"‚úÖ Generated: {output_file}")
print("\nüîä Playback:")
display(Audio(output_file))

print("\nüì• Downloading...")
files.download(output_file)

In [None]:
#@title 3. Generate English Audio (Edge TTS) { display-mode: "form" }

#@markdown ### Enter English text:
text_en = "Welcome to The Lost Chapter, an interactive audiobook experience. This voice is generated using Microsoft Edge neural text to speech." #@param {type:"string"}

#@markdown ### Select voice:
voice_en = "en-US-GuyNeural" #@param ["en-US-GuyNeural", "en-US-JennyNeural", "en-GB-RyanNeural", "en-GB-SoniaNeural", "en-AU-WilliamNeural"]

output_en = "output_en.mp3" #@param {type:"string"}

async def generate_en():
    communicate = edge_tts.Communicate(text_en, voice_en)
    await communicate.save(output_en)

await generate_en()

print(f"‚úÖ Generated: {output_en}")
display(Audio(output_en))
files.download(output_en)

In [None]:
#@title 4. Batch Generate - Multiple Paragraphs (Edge TTS) { display-mode: "form" }
from pydub import AudioSegment
import os

!pip install -q pydub

#@markdown ### Nh·∫≠p nhi·ªÅu ƒëo·∫°n vƒÉn (c√°ch nhau b·∫±ng d√≤ng tr·ªëng):
batch_text = """Ch∆∞∆°ng m·ªôt: Kh·ªüi ƒë·∫ßu.

Ng√†y x∆∞a, ·ªü m·ªôt v∆∞∆°ng qu·ªëc xa x√¥i, c√≥ m·ªôt ch√†ng trai tr·∫ª t√™n l√† Minh. Minh lu√¥n m∆° ∆∞·ªõc ƒë∆∞·ª£c kh√°m ph√° th·∫ø gi·ªõi r·ªông l·ªõn ngo√†i kia.

M·ªôt ng√†y n·ªç, Minh quy·∫øt ƒë·ªãnh r·ªùi kh·ªèi ng√¥i l√†ng nh·ªè c·ªßa m√¨nh ƒë·ªÉ b·∫Øt ƒë·∫ßu cu·ªôc phi√™u l∆∞u m·ªõi. Ch√†ng mang theo m·ªôt chi·∫øc ba l√¥ nh·ªè v√† tr√°i tim ƒë·∫ßy hy v·ªçng.""" #@param {type:"string"}

#@markdown ### Settings:
batch_voice = "vi-VN-HoaiMyNeural" #@param ["vi-VN-HoaiMyNeural", "vi-VN-NamMinhNeural", "en-US-GuyNeural", "en-US-JennyNeural"]
batch_output = "batch_output.mp3" #@param {type:"string"}
pause_ms = 800 #@param {type:"integer"}

paragraphs = [p.strip() for p in batch_text.split('\n\n') if p.strip()]
print(f"Found {len(paragraphs)} paragraphs\n")

os.makedirs('temp_audio', exist_ok=True)
audio_segments = []

for i, para in enumerate(paragraphs):
    print(f"[{i+1}/{len(paragraphs)}] {para[:50]}...")
    temp_file = f"temp_audio/para_{i}.mp3"

    async def gen(p, f):
        comm = edge_tts.Communicate(p, batch_voice)
        await comm.save(f)

    await gen(para, temp_file)
    audio_segments.append(AudioSegment.from_mp3(temp_file))

# Combine with pauses
silence = AudioSegment.silent(duration=pause_ms)
combined = audio_segments[0]
for seg in audio_segments[1:]:
    combined += silence + seg

combined.export(batch_output, format="mp3")

print(f"\n‚úÖ Combined: {batch_output}")
print(f"Duration: {len(combined)/1000:.1f} seconds")
display(Audio(batch_output))
files.download(batch_output)

# Cleanup
!rm -rf temp_audio

---

# üÖ±Ô∏è Section B: viXTTS (Voice Cloning)

Clone your voice or use sample voices. Better quality but requires GPU.

**‚ö†Ô∏è Requirements:**
- Runtime > Change runtime type > **T4 GPU**
- ~2GB download for model

In [None]:
#@title 1. Install viXTTS Dependencies { display-mode: "form" }
import sys

# Check Python version
py_version = f"{sys.version_info.major}.{sys.version_info.minor}"
print(f"Python version: {py_version}")

# Install compatible TTS version
if sys.version_info.minor >= 12:
    print("Python 3.12+ detected, using latest TTS...")
    !pip install -q TTS
else:
    print("Installing TTS 0.22.0...")
    !pip install -q TTS==0.22.0

!pip install -q soundfile huggingface_hub edge-tts pydub

print("\n‚úÖ Dependencies installed!")

In [None]:
#@title 2. Download viXTTS Model (~2GB) { display-mode: "form" }
from huggingface_hub import hf_hub_download
from pathlib import Path

MODEL_DIR = Path("/content/models/vixtts")
MODEL_DIR.mkdir(parents=True, exist_ok=True)

model_files = ["config.json", "model.pth", "vocab.json"]

print("Downloading viXTTS model from capleaf/viXTTS...")
for filename in model_files:
    target = MODEL_DIR / filename
    if not target.exists():
        print(f"  üì• {filename}...")
        hf_hub_download(
            repo_id="capleaf/viXTTS",
            filename=filename,
            local_dir=str(MODEL_DIR),
            local_dir_use_symlinks=False
        )
    else:
        print(f"  ‚úì {filename} (cached)")

print("\n‚úÖ viXTTS model ready!")

In [None]:
#@title 3. Create Sample Voice (using Edge TTS) { display-mode: "form" }
#@markdown Generate a sample Vietnamese voice using Edge TTS.
#@markdown This will be used as the reference voice for cloning.
#@markdown **Skip this cell if you want to upload your own voice.**

import edge_tts
from pydub import AudioSegment
from pathlib import Path
from IPython.display import Audio, display

#@markdown ### Select voice type:
voice_type = "vietnamese_female" #@param ["vietnamese_female", "vietnamese_male", "english_female", "english_male"]

VOICE_MAP = {
    "vietnamese_female": ("vi-VN-HoaiMyNeural", "Xin ch√†o c√°c b·∫°n, t√¥i l√† m·ªôt tr·ª£ l√Ω ·∫£o th√¥ng minh. T√¥i c√≥ th·ªÉ gi√∫p b·∫°n ƒë·ªçc s√°ch, k·ªÉ chuy·ªán, v√† nhi·ªÅu ƒëi·ªÅu th√∫ v·ªã kh√°c. H√£y c√πng kh√°m ph√° th·∫ø gi·ªõi c·ªßa nh·ªØng c√¢u chuy·ªán tuy·ªát v·ªùi nh√©."),
    "vietnamese_male": ("vi-VN-NamMinhNeural", "Xin ch√†o c√°c b·∫°n, t√¥i l√† m·ªôt tr·ª£ l√Ω ·∫£o th√¥ng minh. T√¥i c√≥ th·ªÉ gi√∫p b·∫°n ƒë·ªçc s√°ch, k·ªÉ chuy·ªán, v√† nhi·ªÅu ƒëi·ªÅu th√∫ v·ªã kh√°c. H√£y c√πng kh√°m ph√° th·∫ø gi·ªõi c·ªßa nh·ªØng c√¢u chuy·ªán tuy·ªát v·ªùi nh√©."),
    "english_female": ("en-US-JennyNeural", "Hello everyone, I am an intelligent virtual assistant. I can help you read books, tell stories, and many other interesting things. Let us explore the world of wonderful stories together."),
    "english_male": ("en-US-GuyNeural", "Hello everyone, I am an intelligent virtual assistant. I can help you read books, tell stories, and many other interesting things. Let us explore the world of wonderful stories together."),
}

Path("samples").mkdir(exist_ok=True)
edge_voice, sample_text = VOICE_MAP[voice_type]

# Generate with Edge TTS (MP3) then convert to WAV
temp_mp3 = f"samples/{voice_type}_temp.mp3"
SPEAKER_WAV = f"samples/{voice_type}.wav"

print(f"Generating {voice_type} sample with Edge TTS...")

async def create_sample():
    communicate = edge_tts.Communicate(sample_text, edge_voice)
    await communicate.save(temp_mp3)

await create_sample()

# Convert MP3 to WAV (required by viXTTS)
audio = AudioSegment.from_mp3(temp_mp3)
audio = audio.set_frame_rate(22050).set_channels(1)  # Mono, 22kHz
audio.export(SPEAKER_WAV, format="wav")

# Clean up temp file
import os
os.remove(temp_mp3)

print(f"‚úÖ Created sample voice: {SPEAKER_WAV}")
print(f"Duration: {len(audio)/1000:.1f} seconds")
print("\nüîä Preview:")
display(Audio(SPEAKER_WAV))

In [None]:
#@title 4. OR Upload Your Own Voice Sample { display-mode: "form" }
#@markdown Upload 6-30 seconds of clear speech (WAV/MP3).
#@markdown **Skip this if you used the sample voice above.**

from google.colab import files
from IPython.display import Audio, display

print("Select your voice sample file:")
uploaded = files.upload()

if uploaded:
    SPEAKER_WAV = list(uploaded.keys())[0]
    print(f"\n‚úÖ Using: {SPEAKER_WAV}")
    display(Audio(SPEAKER_WAV))
else:
    print("No file uploaded. Using previous sample voice.")

In [None]:
#@title 5. Load Model & Clone Voice { display-mode: "form" }
import torch
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

print("Loading viXTTS model...")

config = XttsConfig()
config.load_json(str(MODEL_DIR / "config.json"))

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path=str(MODEL_DIR / "model.pth"),
    vocab_path=str(MODEL_DIR / "vocab.json")
)

if torch.cuda.is_available():
    model.cuda()
    print(f"‚úÖ Model loaded on GPU: {torch.cuda.get_device_name()}")
else:
    print("‚ö†Ô∏è Running on CPU (will be slow)")

# Clone voice
print(f"\nüé§ Cloning voice from: {SPEAKER_WAV}")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=SPEAKER_WAV)
print("‚úÖ Voice cloned successfully!")

In [None]:
#@title 6. Generate Vietnamese Audio (viXTTS) { display-mode: "form" }
import soundfile as sf
from IPython.display import Audio, display
from google.colab import files

#@markdown ### Nh·∫≠p vƒÉn b·∫£n ti·∫øng Vi·ªát (10+ t·ª´ cho ch·∫•t l∆∞·ª£ng t·ªët nh·∫•t):
text_vi = "Xin ch√†o c√°c b·∫°n, ƒë√¢y l√† gi·ªçng n√≥i c·ªßa t√¥i ƒë∆∞·ª£c t·∫°o b·∫±ng tr√≠ tu·ªá nh√¢n t·∫°o. C√¥ng ngh·ªá n√†y cho ph√©p clone gi·ªçng n√≥i ch·ªâ v·ªõi m·ªôt ƒëo·∫°n audio ng·∫Øn." #@param {type:"string"}

#@markdown ### Settings:
temperature = 0.7 #@param {type:"slider", min:0.1, max:1.0, step:0.1}
output_file = "vixtts_output_vi.wav" #@param {type:"string"}

print(f"Generating: {text_vi[:50]}...\n")

out = model.inference(
    text_vi,
    "vi",
    gpt_cond_latent,
    speaker_embedding,
    temperature=temperature
)

sf.write(output_file, out["wav"], 24000)

print(f"‚úÖ Generated: {output_file}")
print("\nüîä Playback:")
display(Audio(output_file))

print("\nüì• Downloading...")
files.download(output_file)

In [None]:
#@title 7. Generate English Audio (viXTTS) { display-mode: "form" }

#@markdown ### Enter English text:
text_en = "Welcome to The Lost Chapter. This is my voice, cloned using artificial intelligence. The technology allows creating natural sounding speech from just a short audio sample." #@param {type:"string"}

#@markdown ### Settings:
temp_en = 0.7 #@param {type:"slider", min:0.1, max:1.0, step:0.1}
output_en = "vixtts_output_en.wav" #@param {type:"string"}

print(f"Generating: {text_en[:50]}...\n")

out_en = model.inference(
    text_en,
    "en",
    gpt_cond_latent,
    speaker_embedding,
    temperature=temp_en
)

sf.write(output_en, out_en["wav"], 24000)

print(f"‚úÖ Generated: {output_en}")
display(Audio(output_en))
files.download(output_en)

In [None]:
#@title 8. Batch Generate - Audiobook Chapter (viXTTS) { display-mode: "form" }
import numpy as np

#@markdown ### Nh·∫≠p nhi·ªÅu ƒëo·∫°n vƒÉn:
batch_text = """Ch∆∞∆°ng m·ªôt: H√†nh tr√¨nh b·∫Øt ƒë·∫ßu.

Ng√†y x∆∞a, ·ªü m·ªôt v∆∞∆°ng qu·ªëc xa x√¥i, c√≥ m·ªôt ch√†ng trai tr·∫ª t√™n l√† Minh. Minh lu√¥n m∆° ∆∞·ªõc ƒë∆∞·ª£c kh√°m ph√° th·∫ø gi·ªõi r·ªông l·ªõn b√™n ngo√†i ng√¥i l√†ng nh·ªè c·ªßa m√¨nh.

M·ªôt ng√†y n·ªç, khi m·∫∑t tr·ªùi v·ª´a l√≥ d·∫°ng, Minh quy·∫øt ƒë·ªãnh l√™n ƒë∆∞·ªùng. Ch√†ng mang theo m·ªôt chi·∫øc ba l√¥ nh·ªè ch·ª©a ƒë·∫ßy hy v·ªçng v√† nh·ªØng gi·∫•c m∆° ch∆∞a th√†nh hi·ªán th·ª±c.

Con ƒë∆∞·ªùng ph√≠a tr∆∞·ªõc d√†i v√† ƒë·∫ßy th·ª≠ th√°ch, nh∆∞ng Minh kh√¥ng h·ªÅ s·ª£ h√£i. Ch√†ng bi·∫øt r·∫±ng m·ªói b∆∞·ªõc ch√¢n ƒë·ªÅu ƒë∆∞a m√¨nh ƒë·∫øn g·∫ßn h∆°n v·ªõi s·ªë ph·∫≠n c·ªßa ch√≠nh m√¨nh.""" #@param {type:"string"}

#@markdown ### Settings:
batch_lang = "vi" #@param ["vi", "en"]
batch_output = "chapter_audio.wav" #@param {type:"string"}

paragraphs = [p.strip() for p in batch_text.split('\n\n') if p.strip()]
print(f"üìñ Found {len(paragraphs)} paragraphs\n")

all_audio = []
silence = np.zeros(int(24000 * 0.7))  # 0.7s pause

for i, para in enumerate(paragraphs):
    print(f"[{i+1}/{len(paragraphs)}] {para[:45]}...")
    out = model.inference(
        para,
        batch_lang,
        gpt_cond_latent,
        speaker_embedding,
        temperature=0.7
    )
    all_audio.append(out["wav"])
    if i < len(paragraphs) - 1:
        all_audio.append(silence)

combined = np.concatenate(all_audio)
sf.write(batch_output, combined, 24000)

duration = len(combined) / 24000
print(f"\n‚úÖ Generated: {batch_output}")
print(f"‚è±Ô∏è Duration: {duration:.1f} seconds ({duration/60:.1f} minutes)")
display(Audio(batch_output))
files.download(batch_output)

---

## üìã Tips & Troubleshooting

### Vietnamese Quality (viXTTS)
- Use sentences with **10+ words** for best results
- Shorter sentences may produce odd trailing sounds
- Temperature 0.6-0.8 works best

### Voice Sample Requirements
- **6-30 seconds** of clear speech
- No background noise/music
- Natural speaking pace
- WAV format preferred

### Temperature Settings
| Value | Result |
|-------|--------|
| 0.3-0.5 | Very consistent, robotic |
| 0.6-0.7 | Natural, stable |
| 0.8-0.9 | Expressive, varied |
| 1.0+ | Unstable, experimental |

### Common Issues

**"No GPU available"**
‚Üí Go to Runtime > Change runtime type > T4 GPU

**"CUDA out of memory"**
‚Üí Runtime > Restart runtime, then run again

**Audio sounds robotic**
‚Üí Increase temperature to 0.8
‚Üí Use longer sentences

**Edge TTS not working**
‚Üí Try different voice option
‚Üí Check internet connection

### Supported Languages (viXTTS)
üáªüá≥ Vietnamese, üá∫üá∏ English, üá™üá∏ Spanish, üá´üá∑ French, üá©üá™ German, üáÆüáπ Italian, üáµüáπ Portuguese, üáµüá± Polish, üáπüá∑ Turkish, üá∑üá∫ Russian, üá≥üá± Dutch, üá®üáø Czech, üá∏üá¶ Arabic, üá®üá≥ Chinese, üáØüáµ Japanese, üá≠üá∫ Hungarian, üá∞üá∑ Korean, üáÆüá≥ Hindi

---

**TheLostChapter** | [GitHub](https://github.com/nmnhut-it/english-learning-app/tree/main/the-lost-chapter) | [viXTTS Demo](https://huggingface.co/spaces/thinhlpg/vixtts-demo)