
# 🎓 Educative Colab: Lightweight Text → Speech with Piper TTS (Hugging Face Voices)

Welcome! This hands-on notebook is for **beginners** who want to try an **AI model** that turns **text** into **speech**. We’ll use a lightweight engine called **Piper TTS** and small ready-made **voices** from Hugging Face. No GPU required.

## What you'll learn
- What an **AI model** is (at a high level) and how we **use** it (inference).
- How to **install** a minimal toolchain on Colab.
- How to **download** a small pre-trained **voice** (model weights).
- How to **synthesize** audio from text (Text → Speech).
- How to adjust **knobs** like **speed**, **variation**, and **pauses** to hear differences.
- A few **ML concepts & glossary** along the way.

> **Tip:** You don't need to understand everything at once. Run the cells top to bottom, listen to results, and then re-read the explanations.



## 🧠 Big picture (before we code)
**Goal:** Convert your text into a spoken waveform (**.wav** file).

**Pieces involved:**
1. **Model/Engine (Piper TTS):** The program that knows how to pronounce text given a specific **voice**.
2. **Voice file (ONNX model + config):** The learned parameters that capture a **speaker’s voice** and pronunciation style.
3. **Inputs:** Your **text** and a few **settings** (speed, variation, pauses).
4. **Output:** A **WAV** audio file.

```
Your Text  ──>  TTS Model + Voice  ──>  Audio (.wav)
```

We’ll use Piper because it's **lightweight** and **CPU-friendly**, ideal for Colab beginners.


In [None]:

#@title 🔧 Step 1 — Install minimal tools (1–2 minutes)
# This installs:
# - piper-tts: the TTS engine with a command-line interface (CLI)
# - huggingface_hub: lets us download small voice files from Hugging Face
# - soundfile + numpy: to read/write wav files and do small math if needed

!pip -q install --upgrade pip
!pip -q install piper-tts huggingface_hub soundfile numpy

print("✅ Install finished")


In [None]:

#@title 🧩 Step 2 — Imports & quick environment check
import os, shutil, json, subprocess, sys
import numpy as np
import soundfile as sf
from huggingface_hub import hf_hub_download
from IPython.display import Audio, display

# Check that 'piper' CLI exists (should be installed by piper-tts).
piper_cmd = shutil.which("piper")
if piper_cmd is None:
    # Fallback: try calling via module
    piper_cmd = [sys.executable, "-m", "piper"]
else:
    piper_cmd = [piper_cmd]

print("Using Piper command:", piper_cmd)
print("✅ Environment looks good!")



## 🗣️ Step 3 — Pick a **voice**
We’ll start with three small English voices (about ~63 MB each, 16 kHz). You can change voices later.

- `en_US-amy-low` (female, US)
- `en_US-ryan-low` (male, US)
- `en_US-kathleen-low` (female, US)

Each voice consists of **two files**:
- `*.onnx` — the model weights in ONNX format (portable, fast on CPU)
- `*.onnx.json` — config with sample rate and phoneme info

We'll download them from the Hugging Face repository `rhasspy/piper-voices`.


In [None]:

#@title ⬇️ Download a voice from Hugging Face (runs once per voice)
VOICE_REPO = "rhasspy/piper-voices"
VOICE_CHOICES = {
    "en_US-amy-low":       "en/en_US/amy/low",
    "en_US-ryan-low":      "en/en_US/ryan/low",
    "en_US-kathleen-low":  "en/en_US/kathleen/low",
}

voice_key = "en_US-amy-low" #@param ["en_US-amy-low", "en_US-ryan-low", "en_US-kathleen-low"]

def download_voice(voice_key: str):
    subdir = VOICE_CHOICES[voice_key]
    onnx_path = hf_hub_download(VOICE_REPO, filename=f"{subdir}/{voice_key}.onnx")
    cfg_path  = hf_hub_download(VOICE_REPO, filename=f"{subdir}/{voice_key}.onnx.json")
    return onnx_path, cfg_path

onnx_path, cfg_path = download_voice(voice_key)
print("✅ Voice files ready:")
print("ONNX :", onnx_path)
print("Config:", cfg_path)

# Read and show a couple config details (educational)
with open(cfg_path, "r") as f:
    cfg = json.load(f)
sr = int(cfg.get("sample_rate", 16000))
num_symbols = len(cfg.get("phoneme_id_map", {}))
print(f"Sample rate: {sr} Hz | Phoneme symbols: ~{num_symbols}")



## ▶️ Step 4 — Quick start: Make your **first audio**
Just run this once to hear a sample. We call the Piper CLI and save a WAV file. Then we play it inline.


In [None]:

#@title ▶️ Quick start: synthesize a line of speech
text = "Hello! This is a beginner-friendly text-to-speech demo using Piper." #@param {type:"string"}
out_wav = "hello_demo.wav" #@param {type:"string"}

cmd = piper_cmd + [
    "--model", onnx_path,
    "--config", cfg_path,
    "--output_file", out_wav,
    "--text", text,
]

proc = subprocess.run(cmd, capture_output=True, text=True)
if proc.returncode != 0:
    print(proc.stderr)
    raise RuntimeError("Synthesis failed. See error above.")

audio, sr = sf.read(out_wav, dtype="float32")
display(Audio(filename=out_wav, rate=sr))
print(f"✅ Saved: {out_wav}  |  Sample rate: {sr} Hz  |  Duration: {len(audio)/sr:.2f} s")



## 🎛️ Step 5 — Understand the **knobs**
These parameters affect how speech sounds. Try small changes and re-run to hear the difference.

- **`length_scale`** — **Speaking speed**. Higher than `1.0` → slower; lower than `1.0` → faster.  
- **`noise_scale`** — **Global variation** (prosody randomness). Lower values → more monotone; higher values → more expressive, but too high can sound unstable.
- **`noise_w`** — **Phoneme-level variation** (how much individual sounds vary). Moderate values are usually best.
- **`sentence_silence`** — A **pause** (seconds) automatically added after each sentence.

> Educational idea: Change **one** knob at a time and listen. Note how your brain perceives speed vs. expressiveness.


In [None]:

#@title 🛠️ Helper: a small TTS function using Piper
def tts_piper(
    text: str,
    model_path: str,
    config_path: str,
    length_scale: float = 1.0,
    noise_scale: float = 0.667,
    noise_w: float = 0.8,
    sentence_silence: float = 0.2,
    out_wav: str = "tts_out.wav",
):
    cmd = piper_cmd + [
        "--model", model_path,
        "--config", config_path,
        "--length_scale", str(length_scale),
        "--noise_scale", str(noise_scale),
        "--noise_w", str(noise_w),
        "--sentence_silence", str(sentence_silence),
        "--output_file", out_wav,
        "--text", text,
    ]
    proc = subprocess.run(cmd, capture_output=True, text=True)
    if proc.returncode != 0:
        print(proc.stderr)
        raise RuntimeError("Synthesis failed")
    return out_wav

#@title 🎚️ Try different settings
user_text = "The quick brown fox jumps over the lazy dog. Piper makes lightweight TTS easy." #@param {type:"string"}
length_scale = 1.0     #@param {type:"number"}
noise_scale = 0.667    #@param {type:"number"}
noise_w = 0.8          #@param {type:"number"}
sentence_silence = 0.2 #@param {type:"number"}

out_wav = tts_piper(
    text=user_text,
    model_path=onnx_path,
    config_path=cfg_path,
    length_scale=length_scale,
    noise_scale=noise_scale,
    noise_w=noise_w,
    sentence_silence=sentence_silence,
    out_wav="knobs_demo.wav",
)
audio, sr = sf.read(out_wav, dtype="float32")
display(Audio(filename=out_wav, rate=sr))
print(f"✅ Saved: {out_wav} | length_scale={length_scale} | noise_scale={noise_scale} | noise_w={noise_w} | sentence_silence={sentence_silence}")



## 🔄 Step 6 — Switch voices
Try a different voice to hear how **timbre** and **pronunciation** change (identity), even if your knobs are the same.


In [None]:

#@title 🔁 Download & use another voice
new_voice = "en_US-ryan-low" #@param ["en_US-amy-low", "en_US-ryan-low", "en_US-kathleen-low"]
onnx2, cfg2 = download_voice(new_voice)

sample_text = "Comparing voices helps you hear the difference between identity and settings."
out_wav2 = tts_piper(
    text=sample_text,
    model_path=onnx2,
    config_path=cfg2,
    length_scale=1.0,
    noise_scale=0.667,
    noise_w=0.8,
    sentence_silence=0.2,
    out_wav="voice_switch.wav",
)
display(Audio(filename=out_wav2, rate=sr))
print("✅ Switched voice:", new_voice, "| File:", out_wav2)



## 🧪 Step 7 — Mini-exercises
Try these to practice and reflect:

1. **Speed vs. intelligibility:** Try `length_scale = 0.9`, `1.0`, `1.2`. Which sounds most natural?
2. **Expressiveness:** Set `noise_scale = 0.3`, then `0.8`. How do pitch/intonation feel?
3. **Phoneme variation:** Try `noise_w = 0.5` vs `1.0`. Listen to consonants and vowels.
4. **Pacing:** Increase `sentence_silence = 0.6` for narration/long sentences.
5. **Voice identity:** Generate the same sentence with two voices; which timbre do you prefer and why?
6. **Batch narration (bonus):** Split a long paragraph into sentences; synthesize each, then concatenate the WAVs using `numpy.concatenate` and `soundfile.write`.



## 🛠️ Troubleshooting
- **`piper: not found`** — Re-run the **Install** cell. If it still fails, restart the runtime and try again.
- **No audio / silent output** — Check the **text** (no unsupported characters). Try default knobs.
- **Clipping / distortion** — Lower any external gain you’re applying or keep defaults.
- **Slow runtime** — Piper runs on CPU; short texts are recommended for quick tests.
- **Permissions/network** — Voice downloads need internet access from Colab.



## 📖 Glossary (beginner-friendly)
- **Model**: A learned mathematical function. Here, it converts text into audio frames (speech).
- **Pretrained**: The model was trained earlier on lots of data; we are only **using** it (inference), not training.
- **Inference**: Running the model to produce outputs from inputs (e.g., text → speech). No learning happens here.
- **Weights / Checkpoint**: The parameters the model learned during training (stored in files like `.onnx`).
- **ONNX**: A file format so models can run on many platforms efficiently.
- **Sample rate (Hz)**: How many audio samples per second (e.g., 16,000). Higher can capture more detail.
- **Prosody**: Rhythm, stress, and intonation patterns in speech.



## ⏭️ Next steps
- Try more voices from the same repository (there are many languages).
- Script a simple **batch** to turn a paragraph into narration.
- Explore **higher-tier** voices (e.g., *medium* or *high*) if you want higher sample rates (costs more bandwidth).

## ⚖️ Use responsibly
Generated voices can sound convincing. Always:
- Respect **licenses** for voices and code.
- Disclose synthetic audio when appropriate.
- Avoid imitating real individuals without permission.
