# üéôÔ∏è F5-TTS Worker ‚Äî Voice Cloning on Colab T4

This notebook uses F5-TTS for high-quality, natural-sounding narration with voice cloning.
Provide a 10-15s reference audio clip and F5-TTS will generate all narration in that voice.

**Advantages over Kokoro:**
- Voice cloning from a short reference clip
- More natural prosody, emphasis, and pacing
- GPU-bound model ‚Äî T4 provides real speedup (RTF ~0.3-0.5)

**Setup:** Runtime ‚Üí Change runtime type ‚Üí T4 GPU

## 1. Install Dependencies

In [None]:
!pip install -q f5-tts soundfile
# Verify GPU
!nvidia-smi

## 2. Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 3. Configure Paths

Job directory structure on Google Drive:
```
My Drive/
  autonomous-recording/
    f5-tts-jobs/          ‚Üê separate from kokoro jobs
      <job-id>/
        request.json      ‚Üê local machine writes this
        ref_audio.wav     ‚Üê reference voice clip (copied from settings)
        audio/            ‚Üê worker writes WAVs here
        done.marker       ‚Üê worker writes when complete
    voice-refs/           ‚Üê store your reference voice clips here
      teacher-voice.wav
```

In [None]:
import os

DRIVE_BASE = "/content/drive/MyDrive/autonomous-recording/f5-tts-jobs"
VOICE_REFS_DIR = "/content/drive/MyDrive/autonomous-recording/voice-refs"

os.makedirs(DRIVE_BASE, exist_ok=True)
os.makedirs(VOICE_REFS_DIR, exist_ok=True)

print(f"Job directory: {DRIVE_BASE}")
print(f"Voice refs directory: {VOICE_REFS_DIR}")

# List existing voice references
refs = [f for f in os.listdir(VOICE_REFS_DIR) if f.endswith('.wav')] if os.path.exists(VOICE_REFS_DIR) else []
if refs:
    print(f"\nAvailable voice references: {refs}")
else:
    print(f"\n‚ö†Ô∏è  No voice references found in {VOICE_REFS_DIR}")
    print("Upload a 10-15s WAV clip of the target voice to that directory.")
    print("The default F5-TTS reference voice will be used as fallback.")

## 4. Verify GPU + PyTorch

In [None]:
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB")
else:
    print("‚ö†Ô∏è  No GPU detected. Check Runtime ‚Üí Change runtime type ‚Üí T4 GPU")

## 5. Initialize F5-TTS Model

First load downloads the model (~1.2 GB). Subsequent runs use the cached version.

In [None]:
import time
from f5_tts.api import F5TTS

print("Loading F5-TTS model (first run downloads ~1.2 GB)...")
t0 = time.time()
f5tts = F5TTS(model_type="F5-TTS", ckpt_file="", device=None)  # auto-detect GPU
print(f"‚úì Model loaded in {time.time() - t0:.1f}s")

# Warm up
print("Warming up GPU...")
t0 = time.time()
_ = f5tts.infer(
    ref_file="",  # uses built-in default reference
    ref_text="",
    gen_text="Hello, this is a warm up sentence for the GPU.",
    seed=42,
)
print(f"‚úì Warm-up done in {time.time() - t0:.2f}s")

## 6. Upload a Voice Reference (Optional)

Upload a 10-15 second WAV clip of the voice you want to clone.
Place it in `My Drive/autonomous-recording/voice-refs/`.

**Tips for good reference audio:**
- 10-15 seconds of clear speech, no background noise
- Normal speaking pace (not too fast, not too slow)
- Conversational tone matching your tutorial style
- WAV format, 16kHz+ sample rate

In [None]:
# You can also upload directly from your local machine:
# from google.colab import files
# uploaded = files.upload()  # opens file picker
# for name, data in uploaded.items():
#     dest = os.path.join(VOICE_REFS_DIR, name)
#     with open(dest, 'wb') as f:
#         f.write(data)
#     print(f"Saved {name} to {dest}")

# List available references
refs = [f for f in os.listdir(VOICE_REFS_DIR) if f.endswith('.wav')] if os.path.exists(VOICE_REFS_DIR) else []
print(f"Voice references: {refs if refs else 'none (will use F5-TTS default)'}")

## 7. F5-TTS Job Processor

In [None]:
import json
import soundfile as sf
import tempfile
import numpy as np


def process_f5_tts_job(job_dir: str) -> dict:
    """Process an F5-TTS job from a directory on Google Drive.

    request.json format:
    {
        "ref_audio": "teacher-voice.wav",     # filename in voice-refs/ or path
        "ref_text": "Transcription of the reference audio.",
        "speed": 1.0,
        "seed": 42,
        "nfe_step": 32,
        "steps": [
            {"id": "step-01", "narration": "Text to synthesize..."},
            ...
        ]
    }
    """
    request_path = os.path.join(job_dir, "request.json")
    audio_dir = os.path.join(job_dir, "audio")
    done_marker = os.path.join(job_dir, "done.marker")
    error_marker = os.path.join(job_dir, "error.marker")

    if os.path.exists(done_marker):
        return {"status": "already_done", "job_dir": job_dir}

    if not os.path.exists(request_path):
        return {"status": "no_request", "job_dir": job_dir}

    os.makedirs(audio_dir, exist_ok=True)

    with open(request_path, "r") as f:
        request = json.load(f)

    # Resolve reference audio
    ref_audio_name = request.get("ref_audio", "")
    ref_text = request.get("ref_text", "")
    speed = float(request.get("speed", 1.0))
    seed = request.get("seed", None)
    nfe_step = int(request.get("nfe_step", 32))
    steps = request.get("steps", [])

    # Find reference audio file
    ref_file = ""
    if ref_audio_name:
        # Check job directory first (uploaded with job)
        job_ref = os.path.join(job_dir, ref_audio_name)
        if os.path.exists(job_ref):
            ref_file = job_ref
        else:
            # Check voice-refs directory
            refs_ref = os.path.join(VOICE_REFS_DIR, ref_audio_name)
            if os.path.exists(refs_ref):
                ref_file = refs_ref
            else:
                print(f"‚ö†Ô∏è  Reference audio '{ref_audio_name}' not found, using F5-TTS default")

    results = []
    total_duration = 0.0

    print(f"\n{'='*60}")
    print(f"Processing F5-TTS job: {os.path.basename(job_dir)}")
    print(f"Ref audio: {ref_file or '(F5-TTS default)'} | Speed: {speed} | Steps: {len(steps)}")
    print(f"NFE steps: {nfe_step} | Seed: {seed or 'random'}")
    print(f"{'='*60}")

    try:
        for idx, step in enumerate(steps, 1):
            step_id = str(step["id"])
            narration = str(step["narration"]).strip()
            wav_path = os.path.join(audio_dir, f"step-{step_id}.wav")

            # Skip if already generated
            if os.path.exists(wav_path) and os.path.getsize(wav_path) > 0:
                data, sr = sf.read(wav_path)
                duration = len(data) / sr
                print(f"  [{idx}/{len(steps)}] ‚ôª Reused {os.path.basename(wav_path)} ({duration:.2f}s)")
                results.append({"id": step_id, "duration": duration, "reused": True})
                total_duration += duration
                continue

            t0 = time.time()
            wav, sample_rate, _ = f5tts.infer(
                ref_file=ref_file,
                ref_text=ref_text,
                gen_text=narration,
                nfe_step=nfe_step,
                speed=speed,
                seed=seed,
            )
            elapsed = time.time() - t0
            duration = len(wav) / sample_rate

            # Write atomically
            tmp_fd, tmp_path = tempfile.mkstemp(suffix=".wav", dir=audio_dir)
            os.close(tmp_fd)
            sf.write(tmp_path, wav, sample_rate)
            os.replace(tmp_path, wav_path)

            rtf = elapsed / duration if duration > 0 else 0
            print(f"  [{idx}/{len(steps)}] ‚úì {os.path.basename(wav_path)} ({duration:.2f}s audio, {elapsed:.2f}s gen, RTF={rtf:.2f})")
            results.append({"id": step_id, "duration": duration, "gen_time": elapsed})
            total_duration += duration

        # Write completion marker
        completion = {
            "status": "completed",
            "engine": "f5-tts",
            "total_duration": total_duration,
            "steps_generated": len(results),
            "results": results,
            "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
        }
        with open(done_marker, "w") as f:
            json.dump(completion, f, indent=2)

        print(f"\n‚úì Job complete: {len(results)} steps, {total_duration:.2f}s total audio")
        return completion

    except Exception as e:
        error_info = {"status": "error", "error": str(e), "step": idx if 'idx' in dir() else -1}
        with open(error_marker, "w") as f:
            json.dump(error_info, f, indent=2)
        print(f"\n‚úó Job failed: {e}")
        import traceback
        traceback.print_exc()
        return error_info

## 8. Job Watcher Loop

Polls the Drive job directory for new F5-TTS requests.

**To stop:** Interrupt the cell (‚¨õ stop button).

In [None]:
import datetime

POLL_INTERVAL = 5  # seconds


def watch_for_jobs():
    """Watch for new F5-TTS jobs."""
    print(f"üëÄ Watching for F5-TTS jobs in: {DRIVE_BASE}")
    print(f"   Poll interval: {POLL_INTERVAL}s")
    print(f"   Press ‚¨õ to stop\n")

    processed = set()

    # Skip already-completed jobs
    if os.path.exists(DRIVE_BASE):
        for name in os.listdir(DRIVE_BASE):
            job_dir = os.path.join(DRIVE_BASE, name)
            if os.path.isdir(job_dir):
                done = os.path.join(job_dir, "done.marker")
                error = os.path.join(job_dir, "error.marker")
                if os.path.exists(done) or os.path.exists(error):
                    processed.add(name)

    print(f"   Skipping {len(processed)} already-processed job(s)")

    while True:
        try:
            if not os.path.exists(DRIVE_BASE):
                time.sleep(POLL_INTERVAL)
                continue

            for name in sorted(os.listdir(DRIVE_BASE)):
                if name in processed:
                    continue

                job_dir = os.path.join(DRIVE_BASE, name)
                if not os.path.isdir(job_dir):
                    continue

                request_path = os.path.join(job_dir, "request.json")
                if not os.path.exists(request_path):
                    continue

                now = datetime.datetime.now().strftime("%H:%M:%S")
                print(f"\n[{now}] üìã New F5-TTS job detected: {name}")

                result = process_f5_tts_job(job_dir)
                processed.add(name)

                now = datetime.datetime.now().strftime("%H:%M:%S")
                print(f"[{now}] ‚úì Job {name} ‚Üí {result.get('status', 'unknown')}")

            time.sleep(POLL_INTERVAL)

        except KeyboardInterrupt:
            print("\n\nüõë Watcher stopped.")
            break


watch_for_jobs()

## 9. Quick Test

Generate a test audio to hear the voice quality.

In [None]:
# Quick test with default voice
test_text = "Welcome to this tutorial. Today we'll learn about bubble sort, a simple comparison-based sorting algorithm. It is not the fastest, but it is a great starting point."

# To test with YOUR voice, set ref_file and ref_text:
# ref_file = os.path.join(VOICE_REFS_DIR, "teacher-voice.wav")
# ref_text = "The transcription of what is said in the reference audio."
ref_file = ""  # empty = use F5-TTS built-in default
ref_text = ""

t0 = time.time()
wav, sr, _ = f5tts.infer(
    ref_file=ref_file,
    ref_text=ref_text,
    gen_text=test_text,
    seed=42,
)
elapsed = time.time() - t0
duration = len(wav) / sr

print(f"Generated {duration:.2f}s of audio in {elapsed:.2f}s (RTF: {elapsed/duration:.2f})")

sf.write("/tmp/test_f5tts.wav", wav, sr)

from IPython.display import Audio, display
display(Audio(wav, rate=sr))

## 10. Test with Voice Clone

Test with a reference voice clip from your Drive.

In [None]:
# Uncomment and set your reference audio:
# ref_file = os.path.join(VOICE_REFS_DIR, "teacher-voice.wav")
# ref_text = "The exact words spoken in the reference audio file."
#
# test_text = "Welcome to this tutorial. Today we will learn about bubble sort."
#
# wav, sr, _ = f5tts.infer(
#     ref_file=ref_file,
#     ref_text=ref_text,
#     gen_text=test_text,
#     seed=42,
# )
#
# from IPython.display import Audio, display
# display(Audio(wav, rate=sr))