# üéß TheLostChapter - Vietnamese Voice Cloning

Generate audiobook narration with **your cloned voice** in Vietnamese and English.

**Model:** [viXTTS](https://huggingface.co/capleaf/viXTTS) - Fine-tuned XTTS v2 for Vietnamese

## ‚ö†Ô∏è Requirements
1. Go to **Runtime ‚Üí Change runtime type ‚Üí T4 GPU**
2. Run cells 1-4 in order
3. Then use cells 5-7 to generate audio

---

In [None]:
#@title 1. Install Dependencies { display-mode: "form" }
#@markdown Installs coqui-tts (Python 3.12+ compatible) and audio libraries

import sys
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")

# Install coqui-tts fork (supports Python 3.12+)
print("\nüì¶ Installing coqui-tts...")
!pip install -q coqui-tts

# Install audio dependencies
print("üì¶ Installing audio libraries...")
!pip install -q torchcodec soundfile huggingface_hub pydub

print("\n‚úÖ All dependencies installed!")

In [None]:
#@title 2. Download viXTTS Model (~2GB) { display-mode: "form" }
from huggingface_hub import hf_hub_download
from pathlib import Path

MODEL_DIR = Path("/content/models/vixtts")
MODEL_DIR.mkdir(parents=True, exist_ok=True)

model_files = ["config.json", "model.pth", "vocab.json"]

print("üì• Downloading viXTTS model from capleaf/viXTTS...\n")
for filename in model_files:
    target = MODEL_DIR / filename
    if not target.exists():
        print(f"  Downloading {filename}...")
        hf_hub_download(
            repo_id="capleaf/viXTTS",
            filename=filename,
            local_dir=str(MODEL_DIR),
            local_dir_use_symlinks=False
        )
    else:
        print(f"  ‚úì {filename} (cached)")

print("\n‚úÖ viXTTS model ready!")

In [None]:
#@title 3. Upload Your Voice Sample { display-mode: "form" }
#@markdown Upload **6-30 seconds** of clear speech (WAV or MP3).
#@markdown - No background noise or music
#@markdown - Natural speaking pace
#@markdown - Single speaker only

from google.colab import files
from pydub import AudioSegment
from IPython.display import Audio, display
from pathlib import Path
import os

Path("samples").mkdir(exist_ok=True)

print("üìÅ Select your voice sample file (WAV/MP3):\n")
uploaded = files.upload()

if uploaded:
    uploaded_file = list(uploaded.keys())[0]
    
    # Convert to WAV if needed
    if uploaded_file.endswith('.mp3'):
        print("\nüîÑ Converting MP3 to WAV...")
        audio = AudioSegment.from_mp3(uploaded_file)
        SPEAKER_WAV = "samples/speaker.wav"
        audio = audio.set_frame_rate(22050).set_channels(1)
        audio.export(SPEAKER_WAV, format="wav")
        os.remove(uploaded_file)
    else:
        SPEAKER_WAV = f"samples/{uploaded_file}"
        os.rename(uploaded_file, SPEAKER_WAV)
    
    # Get duration
    audio = AudioSegment.from_wav(SPEAKER_WAV)
    duration = len(audio) / 1000
    
    print(f"\n‚úÖ Voice sample ready: {SPEAKER_WAV}")
    print(f"‚è±Ô∏è Duration: {duration:.1f} seconds")
    
    if duration < 6:
        print("\n‚ö†Ô∏è Warning: Sample is short. 6-30 seconds recommended for best quality.")
    elif duration > 30:
        print("\n‚ö†Ô∏è Warning: Sample is long. This may slow down processing.")
    
    print("\nüîä Preview:")
    display(Audio(SPEAKER_WAV))
else:
    print("‚ùå No file uploaded. Please run this cell again.")

In [None]:
#@title 4. Load Model & Clone Voice { display-mode: "form" }
import torch
import re
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
from TTS.tts.layers.xtts import tokenizer as xtts_tokenizer

# Patch tokenizer to support Vietnamese
print("üîß Patching tokenizer for Vietnamese...")

_original_preprocess = xtts_tokenizer.VoiceBpeTokenizer.preprocess_text

def _patched_preprocess(self, txt, lang):
    """Patched to support Vietnamese language."""
    if lang == "vi":
        # Simple text cleaning for Vietnamese (Latin script)
        txt = txt.replace('"', '')
        txt = re.sub(r'\s+', ' ', txt)
        txt = txt.strip()
        return txt
    return _original_preprocess(self, txt, lang)

xtts_tokenizer.VoiceBpeTokenizer.preprocess_text = _patched_preprocess
print("‚úÖ Vietnamese support enabled")

# Load model
print("\nüöÄ Loading viXTTS model...")

config = XttsConfig()
config.load_json(str(MODEL_DIR / "config.json"))

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path=str(MODEL_DIR / "model.pth"),
    vocab_path=str(MODEL_DIR / "vocab.json")
)

if torch.cuda.is_available():
    model.cuda()
    print(f"‚úÖ Model loaded on GPU: {torch.cuda.get_device_name()}")
else:
    print("‚ö†Ô∏è Running on CPU (will be slow). Enable GPU in Runtime settings.")

# Clone voice
print(f"\nüé§ Cloning voice from: {SPEAKER_WAV}")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=SPEAKER_WAV)
print("\n‚úÖ Voice cloned successfully! Ready to generate audio.")

---
## üéôÔ∏è Generate Audio

Now you can generate audio with your cloned voice!

In [None]:
#@title 5. Generate Vietnamese Audio { display-mode: "form" }
import soundfile as sf
from IPython.display import Audio, display
from google.colab import files

#@markdown ### Nh·∫≠p vƒÉn b·∫£n ti·∫øng Vi·ªát:
text_vi = "Xin ch√†o c√°c b·∫°n, ƒë√¢y l√† gi·ªçng n√≥i c·ªßa t√¥i ƒë∆∞·ª£c t·∫°o b·∫±ng tr√≠ tu·ªá nh√¢n t·∫°o. C√¥ng ngh·ªá n√†y cho ph√©p clone gi·ªçng n√≥i ch·ªâ v·ªõi m·ªôt ƒëo·∫°n audio ng·∫Øn." #@param {type:"string"}

#@markdown ### C√†i ƒë·∫∑t:
temperature = 0.7 #@param {type:"slider", min:0.1, max:1.0, step:0.1}
output_file = "output_vi.wav" #@param {type:"string"}

print(f"üìù Text: {text_vi[:60]}...\n")
print("‚è≥ Generating...")

out = model.inference(
    text_vi,
    "vi",
    gpt_cond_latent,
    speaker_embedding,
    temperature=temperature
)

sf.write(output_file, out["wav"], 24000)

duration = len(out["wav"]) / 24000
print(f"\n‚úÖ Generated: {output_file} ({duration:.1f}s)")
print("\nüîä Playback:")
display(Audio(output_file))

print("\nüì• Downloading...")
files.download(output_file)

In [None]:
#@title 6. Generate English Audio { display-mode: "form" }

#@markdown ### Enter English text:
text_en = "Welcome to The Lost Chapter. This is my voice, cloned using artificial intelligence. The technology allows creating natural sounding speech from just a short audio sample." #@param {type:"string"}

#@markdown ### Settings:
temperature_en = 0.7 #@param {type:"slider", min:0.1, max:1.0, step:0.1}
output_en = "output_en.wav" #@param {type:"string"}

print(f"üìù Text: {text_en[:60]}...\n")
print("‚è≥ Generating...")

out_en = model.inference(
    text_en,
    "en",
    gpt_cond_latent,
    speaker_embedding,
    temperature=temperature_en
)

sf.write(output_en, out_en["wav"], 24000)

duration = len(out_en["wav"]) / 24000
print(f"\n‚úÖ Generated: {output_en} ({duration:.1f}s)")
print("\nüîä Playback:")
display(Audio(output_en))

print("\nüì• Downloading...")
files.download(output_en)

In [None]:
#@title 7. Batch Generate - Audiobook Chapter { display-mode: "form" }
import numpy as np

#@markdown ### Nh·∫≠p nhi·ªÅu ƒëo·∫°n vƒÉn (c√°ch nhau b·∫±ng d√≤ng tr·ªëng):
batch_text = """Ch∆∞∆°ng m·ªôt: H√†nh tr√¨nh b·∫Øt ƒë·∫ßu.

Ng√†y x∆∞a, ·ªü m·ªôt v∆∞∆°ng qu·ªëc xa x√¥i, c√≥ m·ªôt ch√†ng trai tr·∫ª t√™n l√† Minh. Minh lu√¥n m∆° ∆∞·ªõc ƒë∆∞·ª£c kh√°m ph√° th·∫ø gi·ªõi r·ªông l·ªõn b√™n ngo√†i ng√¥i l√†ng nh·ªè c·ªßa m√¨nh.

M·ªôt ng√†y n·ªç, khi m·∫∑t tr·ªùi v·ª´a l√≥ d·∫°ng, Minh quy·∫øt ƒë·ªãnh l√™n ƒë∆∞·ªùng. Ch√†ng mang theo m·ªôt chi·∫øc ba l√¥ nh·ªè ch·ª©a ƒë·∫ßy hy v·ªçng v√† nh·ªØng gi·∫•c m∆° ch∆∞a th√†nh hi·ªán th·ª±c.

Con ƒë∆∞·ªùng ph√≠a tr∆∞·ªõc d√†i v√† ƒë·∫ßy th·ª≠ th√°ch, nh∆∞ng Minh kh√¥ng h·ªÅ s·ª£ h√£i. Ch√†ng bi·∫øt r·∫±ng m·ªói b∆∞·ªõc ch√¢n ƒë·ªÅu ƒë∆∞a m√¨nh ƒë·∫øn g·∫ßn h∆°n v·ªõi s·ªë ph·∫≠n c·ªßa ch√≠nh m√¨nh.""" #@param {type:"string"}

#@markdown ### C√†i ƒë·∫∑t:
batch_lang = "vi" #@param ["vi", "en"]
batch_output = "chapter_audio.wav" #@param {type:"string"}
pause_seconds = 0.7 #@param {type:"slider", min:0.3, max:2.0, step:0.1}

paragraphs = [p.strip() for p in batch_text.split('\n\n') if p.strip()]
print(f"üìñ Found {len(paragraphs)} paragraphs\n")

all_audio = []
silence = np.zeros(int(24000 * pause_seconds))

for i, para in enumerate(paragraphs):
    print(f"[{i+1}/{len(paragraphs)}] {para[:50]}...")
    out = model.inference(
        para,
        batch_lang,
        gpt_cond_latent,
        speaker_embedding,
        temperature=0.7
    )
    all_audio.append(out["wav"])
    if i < len(paragraphs) - 1:
        all_audio.append(silence)

combined = np.concatenate(all_audio)
sf.write(batch_output, combined, 24000)

duration = len(combined) / 24000
print(f"\n‚úÖ Generated: {batch_output}")
print(f"‚è±Ô∏è Duration: {duration:.1f}s ({duration/60:.1f} minutes)")
print("\nüîä Playback:")
display(Audio(batch_output))

print("\nüì• Downloading...")
files.download(batch_output)

---

## üìã Tips

### Best Practices
- Use sentences with **10+ words** for natural results
- Temperature **0.6-0.8** works best for most cases
- Longer voice samples (15-30s) produce better cloning

### Temperature Guide
| Value | Result |
|-------|--------|
| 0.3-0.5 | Consistent, slightly robotic |
| 0.6-0.7 | Natural, stable (recommended) |
| 0.8-0.9 | Expressive, more variation |

### Troubleshooting

| Issue | Solution |
|-------|----------|
| "No GPU available" | Runtime ‚Üí Change runtime type ‚Üí T4 GPU |
| "CUDA out of memory" | Runtime ‚Üí Restart runtime, then run again |
| Audio sounds robotic | Increase temperature to 0.8 |
| Odd trailing sounds | Use longer sentences (10+ words) |

### Supported Languages
üáªüá≥ Vietnamese, üá∫üá∏ English, üá™üá∏ Spanish, üá´üá∑ French, üá©üá™ German, üáÆüáπ Italian, üáµüáπ Portuguese, üáµüá± Polish, üáπüá∑ Turkish, üá∑üá∫ Russian, üá≥üá± Dutch, üá®üáø Czech, üá∏üá¶ Arabic, üá®üá≥ Chinese, üáØüáµ Japanese, üá≠üá∫ Hungarian, üá∞üá∑ Korean, üáÆüá≥ Hindi

---

**TheLostChapter** | [GitHub](https://github.com/nmnhut-it/english-learning-app/tree/main/the-lost-chapter) | [viXTTS Model](https://huggingface.co/capleaf/viXTTS)