# Alexandria Audiobook Generator — Google Colab

Run [Alexandria](https://github.com/Finrandojin/alexandria-audiobook) on a free Google Colab GPU. No local installation required.

> **Note:** This Colab notebook is provided for convenience so you can try Alexandria without local installation. It is not the primary focus of the project — for the best experience, install Alexandria locally via [Pinokio](https://pinokio.computer).

**What this notebook does:**
1. Checks your GPU runtime
2. Installs Alexandria and all dependencies
3. (Optional) Mounts Google Drive to cache TTS models across sessions
4. Starts the Alexandria server and creates a public URL via ngrok

**Before you start:**
- Make sure you've selected a **GPU runtime**: Runtime → Change runtime type → T4 GPU
- You need a free [ngrok account](https://dashboard.ngrok.com/signup) for the tunnel
- You need an LLM server for script generation (see Cell 5 for options)

**Colab T4 GPU (15 GB VRAM) recommended settings:**
- Parallel Workers: 5-10
- Max Chars/Batch: 1500-2000
- Compile Codec: enabled (recommended for speed)

---

## 1. Check GPU Runtime

Verify that a GPU is available. If this cell fails, go to **Runtime → Change runtime type → T4 GPU**.

In [None]:
import torch

if not torch.cuda.is_available():
    raise RuntimeError(
        "No GPU detected! Go to Runtime → Change runtime type → T4 GPU, then re-run this cell."
    )

gpu_name = torch.cuda.get_device_name(0)
vram_gb = torch.cuda.get_device_properties(0).total_mem / 1e9
print(f"GPU: {gpu_name}")
print(f"VRAM: {vram_gb:.1f} GB")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.version.cuda}")
print()
if vram_gb < 10:
    print("WARNING: Less than 10 GB VRAM. Batch sizes will be very limited.")
    print("Use Parallel Workers: 2-3 and Max Chars/Batch: 1000 in Setup.")
elif vram_gb < 16:
    print("T4 detected (15 GB). Recommended: Parallel Workers 5-10, Max Chars/Batch 1500-2000.")
else:
    print(f"{vram_gb:.0f} GB VRAM available. You can use higher batch sizes.")

## 2. Install Alexandria

Clones the repository and installs all Python dependencies. This takes 2-3 minutes.

In [None]:
import os

ALEXANDRIA_DIR = "/content/Alexandria"

# Clone the repository
if not os.path.exists(ALEXANDRIA_DIR):
    !git clone https://github.com/Finrandojin/alexandria-audiobook.git {ALEXANDRIA_DIR}
else:
    print(f"Alexandria already cloned at {ALEXANDRIA_DIR}")
    !cd {ALEXANDRIA_DIR} && git pull

# Install dependencies (skip torch — Colab already has it)
!pip install -q -r {ALEXANDRIA_DIR}/app/requirements.txt
!pip install -q qwen-tts==0.1.1
!pip install -q pyngrok

print()
print("Installation complete.")

## 3. Mount Google Drive (Optional)

TTS models are **~3.5 GB each** and download on first use. By default, they're lost when the Colab session ends.

Mounting Google Drive caches models persistently so you don't re-download them every session.

**Skip this cell** if you don't want to use Drive storage (models will re-download each session).

In [None]:
import os
from google.colab import drive

drive.mount("/content/drive")

# Set HuggingFace cache to Google Drive
hf_cache = "/content/drive/MyDrive/.cache/huggingface"
os.makedirs(hf_cache, exist_ok=True)
os.environ["HF_HOME"] = hf_cache

print(f"HuggingFace cache set to: {hf_cache}")
print("Models will persist across Colab sessions.")

## 4. Configure ngrok Tunnel

ngrok creates a public URL so you can access the Alexandria web UI from your browser.

1. Sign up for a free account at [ngrok.com](https://dashboard.ngrok.com/signup)
2. Copy your auth token from [dashboard.ngrok.com/get-started/your-authtoken](https://dashboard.ngrok.com/get-started/your-authtoken)
3. Paste it below and run the cell

In [None]:
NGROK_AUTH_TOKEN = ""  # @param {type:"string"}

if not NGROK_AUTH_TOKEN:
    raise ValueError(
        "Please set your ngrok auth token above.\n"
        "Get one free at: https://dashboard.ngrok.com/get-started/your-authtoken"
    )

from pyngrok import ngrok
ngrok.set_auth_token(NGROK_AUTH_TOKEN)
print("ngrok configured.")

## 5. Start Alexandria

This cell:
1. Writes a default config (local TTS mode, auto GPU device)
2. Starts the Alexandria server in the background
3. Opens an ngrok tunnel to the web UI

**After running this cell**, click the ngrok URL to open the Alexandria web UI.

### LLM Setup

Alexandria needs an LLM server for script generation. In the web UI **Setup tab**, configure one of:

| Provider | Base URL | API Key |
|----------|----------|--------|
| OpenAI | `https://api.openai.com/v1` | Your OpenAI API key |
| DeepSeek | `https://api.deepseek.com/v1` | Your DeepSeek API key |
| OpenRouter | `https://openrouter.ai/api/v1` | Your OpenRouter API key |
| Ollama (see Cell 6) | `http://localhost:11434/v1` | `local` |

Or any other OpenAI-compatible API.

In [None]:
import json
import os
import subprocess
import time
import requests
from pyngrok import ngrok

ALEXANDRIA_DIR = "/content/Alexandria"
APP_DIR = os.path.join(ALEXANDRIA_DIR, "app")
CONFIG_PATH = os.path.join(APP_DIR, "config.json")

# Write default config if none exists
if not os.path.exists(CONFIG_PATH):
    config = {
        "llm": {
            "base_url": "http://localhost:11434/v1",
            "api_key": "local",
            "model_name": ""
        },
        "tts": {
            "mode": "local",
            "device": "auto",
            "language": "English",
            "parallel_workers": 8,
            "compile_codec": True,
            "sub_batch_enabled": True,
            "sub_batch_min_size": 4,
            "sub_batch_ratio": 5,
            "sub_batch_max_chars": 2000
        }
    }
    with open(CONFIG_PATH, "w") as f:
        json.dump(config, f, indent=2)
    print("Default config written (local TTS, auto GPU).")
else:
    print("Existing config.json found, keeping it.")

# Start the server
print("Starting Alexandria server...")
server_process = subprocess.Popen(
    ["python", "app.py"],
    cwd=APP_DIR,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True,
    bufsize=1
)

# Wait for server to start
for i in range(30):
    try:
        r = requests.get("http://127.0.0.1:4200/api/config", timeout=2)
        if r.status_code == 200:
            print("Server is running.")
            break
    except:
        pass
    time.sleep(2)
else:
    print("WARNING: Server may not have started. Check output below.")

# Open ngrok tunnel
public_url = ngrok.connect(4200)
print()
print("=" * 60)
print(f"Alexandria is running at: {public_url}")
print("=" * 60)
print()
print("Open the URL above in your browser.")
print("Configure your LLM in the Setup tab before generating scripts.")
print()
print("First TTS generation will download the model (~3.5 GB).")
print("Check this cell's output for download progress.")

## 6. (Optional) Install Ollama for Local LLM

If you don't have a cloud LLM API, you can run Ollama on Colab for script generation.

**Important — VRAM sharing:** Ollama and TTS both use the T4 GPU. A 7B model uses ~5 GB VRAM, leaving only ~10 GB for TTS. This **will cause out-of-memory crashes** during batch generation, especially with LoRA voices.

**Workflow:** Use Ollama for script generation, then **run Cell 6b to stop Ollama** before starting TTS batch generation. The LLM is only needed during script generation.

**Skip this cell** if you're using a cloud API (OpenAI, DeepSeek, etc.) — that's the easiest option on Colab.

In [None]:
import subprocess
import time

OLLAMA_MODEL = "qwen2.5:7b"  # @param {type:"string"}

# Install zstd (required by Ollama installer)
!apt-get install -y -qq zstd

# Install Ollama
print("Installing Ollama...")
!curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama server in background
print("Starting Ollama server...")
ollama_process = subprocess.Popen(
    ["ollama", "serve"],
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL
)
time.sleep(3)

# Pull the model
print(f"Pulling {OLLAMA_MODEL} (this may take a few minutes)...")
!ollama pull {OLLAMA_MODEL}

print()
print(f"Ollama is running with {OLLAMA_MODEL}.")
print()
print("In Alexandria Setup tab, configure:")
print(f"  LLM Base URL: http://localhost:11434/v1")
print(f"  API Key: local")
print(f"  Model Name: {OLLAMA_MODEL}")

## 6b. Stop Ollama (Free VRAM for TTS)

**Run this cell after script generation is complete**, before starting batch TTS generation. This frees ~5 GB of VRAM that Ollama was using for the LLM.

You do NOT need to run this if you used a cloud LLM API instead of Ollama.

## 7. View Server Logs

Run this cell to see real-time server output (model loading, generation progress, errors).

**Interrupt the cell** (stop button) to stop viewing logs. The server keeps running.

In [None]:
import subprocess

# Stop Ollama server to free GPU memory for TTS
try:
    subprocess.run(["pkill", "-f", "ollama"], timeout=5)
    print("Ollama stopped. GPU memory freed for TTS.")
except:
    print("Ollama was not running.")

# Verify VRAM is freed
import torch
if torch.cuda.is_available():
    free_mem = torch.cuda.mem_get_info()[0] / 1e9
    total_mem = torch.cuda.mem_get_info()[1] / 1e9
    print(f"VRAM: {free_mem:.1f} GB free / {total_mem:.1f} GB total")

In [None]:
# Stream server output (interrupt to stop viewing, server keeps running)
try:
    for line in server_process.stdout:
        print(line, end="")
except KeyboardInterrupt:
    print("\nStopped viewing logs. Server is still running.")

## 8. Stop Server

Run this cell when you're done to clean up.

In [None]:
from pyngrok import ngrok

# Kill ngrok tunnels
ngrok.kill()
print("ngrok tunnel closed.")

# Kill server
try:
    server_process.terminate()
    server_process.wait(timeout=5)
    print("Server stopped.")
except:
    server_process.kill()
    print("Server killed.")

# Kill Ollama if running
try:
    ollama_process.terminate()
    print("Ollama stopped.")
except:
    pass