# MVP-Echo STT â€” Try It on Colab

Run **local speech-to-text** powered by [NVIDIA Parakeet](https://huggingface.co/csukuangfj/sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8) on a free Colab GPU.

**No local GPU required.** This notebook gives you a temporary OpenAI-compatible
`/v1/audio/transcriptions` endpoint backed by a T4 GPU.

| | |
|---|---|
| **Model** | Parakeet TDT 0.6b v2 (INT8) â€” English |
| **Engine** | [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) C++ with CUDA |
| **Speed** | ~20-50x faster than real-time |
| **VRAM** | ~426 MiB (fits easily on T4's 16GB) |
| **API** | OpenAI Whisper-compatible |
| **API Key** | `SK-COLAB-COMMUNITY` |

**How to use:**
1. Make sure the runtime is set to **GPU** (Runtime â†’ Change runtime type â†’ T4 GPU)
2. Run all cells in order (Runtime â†’ Run all)
3. Copy the public URL printed at the end
4. In MVP-Echo toolbar settings, set the endpoint URL and use API key: **`SK-COLAB-COMMUNITY`**

> **Connecting from MVP-Echo toolbar:** Enter the Colab public URL as the server
> endpoint and use `SK-COLAB-COMMUNITY` as the API key. The toolbar requires a key
> to complete the connection test â€” this well-known community key fulfills that check.
> Any non-empty key will work.

> This is a **community resource** for developers who don't have a local GPU.
> The session is temporary and will shut down when Colab reclaims the runtime.
> No data is stored. No telemetry. Audio is processed and immediately deleted.

---
## 1. Verify GPU

In [6]:
!nvidia-smi

import torch
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    vram = torch.cuda.get_device_properties(0).total_memory / (1024**3)
    print(f"\nGPU: {gpu_name} ({vram:.1f} GB VRAM)")
    print("CUDA is available â€” good to go.")
else:
    print("\nNo GPU detected!")
    print("Go to Runtime â†’ Change runtime type â†’ select T4 GPU")

Sun Feb 15 21:17:25 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   41C    P8             13W /   70W |       3MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

---
## 2. Install sherpa-onnx + dependencies

Downloads pre-built C++ binaries with CUDA 12 + cuDNN 9 support (~234 MB).

In [7]:
import os

SHERPA_VERSION = "v1.12.23"
SHERPA_URL = (
    f"https://github.com/k2-fsa/sherpa-onnx/releases/download/{SHERPA_VERSION}/"
    f"sherpa-onnx-{SHERPA_VERSION}-cuda-12.x-cudnn-9.x-linux-x64-gpu.tar.bz2"
)

if not os.path.exists("/opt/sherpa-onnx/bin/sherpa-onnx-offline-websocket-server"):
    print("Downloading sherpa-onnx pre-built binaries...")
    !curl -fSL -o /tmp/sherpa.tar.bz2 "{SHERPA_URL}"
    !mkdir -p /opt/sherpa-onnx
    !tar xjf /tmp/sherpa.tar.bz2 -C /opt/sherpa-onnx --strip-components=1
    !rm /tmp/sherpa.tar.bz2
    print("sherpa-onnx installed.")
else:
    print("sherpa-onnx already installed.")

# Set PATH and LD_LIBRARY_PATH for this session
os.environ["PATH"] = f"/opt/sherpa-onnx/bin:{os.environ['PATH']}"
os.environ["LD_LIBRARY_PATH"] = f"/opt/sherpa-onnx/lib:{os.environ.get('LD_LIBRARY_PATH', '')}"

# Verify
!sherpa-onnx-offline-websocket-server --help 2>&1 | head -3
print("\nsherpa-onnx is ready.")

sherpa-onnx already installed.
/home/runner/work/sherpa-onnx/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:PrintUsage:414 

Automatic speech recognition with sherpa-onnx using websocket.

sherpa-onnx is ready.


In [8]:
# Install Python dependencies
!pip install -q fastapi "uvicorn[standard]" python-multipart soundfile numpy websockets

---
## 3. Download model

Downloads the Parakeet TDT 0.6b v2 (INT8) model from HuggingFace (~300 MB).
This is NVIDIA's English speech recognition model, quantized for fast inference.

### HuggingFace Token (optional but recommended)

The model is on a **public repo**, so it will download without a token. However,
HuggingFace may rate-limit anonymous downloads. Adding a token avoids this.

**How to get a HuggingFace token:**
1. Create a free account at [huggingface.co](https://huggingface.co/join)
2. Go to [Settings â†’ Access Tokens](https://huggingface.co/settings/tokens)
3. Click **"New token"** â†’ name it anything (e.g. "Colab") â†’ select **"Read"** access
4. Copy the token (starts with `hf_...`)

**Two ways to add it in Colab:**

**Option 1 â€” Colab Secrets (recommended, more secure):**
1. Click the **ðŸ”‘ key icon** in the left sidebar
2. Click **"Add new secret"**
3. Name: `HF_TOKEN` â†’ Value: paste your token
4. Toggle **"Notebook access"** ON

**Option 2 â€” Paste directly in the cell below** (less secure, visible in notebook)

In [None]:
#@title Download model { display-mode: "form" }

#@markdown **HuggingFace token** (optional â€” leave blank to download without auth):
HF_TOKEN = ""  #@param {type:"string"}

# Try Colab Secrets first, then fall back to the field above
hf_token = None
try:
    from google.colab import userdata
    hf_token = userdata.get("HF_TOKEN")
    if hf_token:
        print("Using HuggingFace token from Colab Secrets.")
except (ImportError, userdata.SecretNotFoundError, Exception):
    pass

if not hf_token and HF_TOKEN.strip():
    hf_token = HF_TOKEN.strip()
    print("Using HuggingFace token from cell input.")

if not hf_token:
    print("No HuggingFace token provided â€” downloading anonymously.")
    print("(This works fine, but may be rate-limited on heavy usage days.)\n")

MODEL_ID = "parakeet-tdt-0.6b-v2-int8"
HF_REPO = f"csukuangfj/sherpa-onnx-nemo-{MODEL_ID}"
MODEL_DIR = f"/content/models/sherpa-onnx-nemo-{MODEL_ID}"

if os.path.exists(f"{MODEL_DIR}/tokens.txt"):
    print(f"Model already downloaded: {MODEL_ID}")
else:
    print(f"Downloading {MODEL_ID} from HuggingFace...")
    !pip install -q huggingface-hub
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id=HF_REPO,
        local_dir=MODEL_DIR,
        token=hf_token,
    )
    print(f"Model downloaded to {MODEL_DIR}")

# Verify model files
required = ["encoder.int8.onnx", "decoder.int8.onnx", "joiner.int8.onnx", "tokens.txt"]
for f in required:
    path = os.path.join(MODEL_DIR, f)
    size_mb = os.path.getsize(path) / (1024 * 1024) if os.path.exists(path) else 0
    status = f"{size_mb:.1f} MB" if os.path.exists(path) else "MISSING"
    print(f"  {f}: {status}")
print(f"\nModel ready: {MODEL_ID}")

---
## 4. Start the STT server

This starts:
1. The sherpa-onnx C++ WebSocket server (loads model into GPU memory)
2. A FastAPI HTTP server with the OpenAI-compatible `/v1/audio/transcriptions` endpoint

In [None]:
# Write the server script to disk
# (If you cloned the repo, you can skip this â€” the file is in mvp-stt-colab/server.py)

SERVER_SCRIPT = r'''
import asyncio, json, os, signal, struct, subprocess, tempfile, time
import numpy as np
import soundfile as sf
import uvicorn
import websockets
from contextlib import asynccontextmanager
from fastapi import FastAPI, File, Form, Request, UploadFile
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse

MODEL_DIR = os.environ.get("MODEL_DIR")
PROVIDER = os.environ.get("PROVIDER", "cuda")
NUM_THREADS = os.environ.get("NUM_THREADS", "4")
PORT = int(os.environ.get("PORT", "8000"))
WS_PORT = int(os.environ.get("WS_PORT", "7100"))

# Community API key â€” the toolbar requires a key to connect.
# Any non-empty Bearer token is accepted.
COMMUNITY_API_KEY = "SK-COLAB-COMMUNITY"

_process = None

async def start_sherpa():
    global _process
    cmd = [
        "sherpa-onnx-offline-websocket-server",
        f"--port={WS_PORT}", f"--provider={PROVIDER}",
        f"--encoder={MODEL_DIR}/encoder.int8.onnx",
        f"--decoder={MODEL_DIR}/decoder.int8.onnx",
        f"--joiner={MODEL_DIR}/joiner.int8.onnx",
        f"--tokens={MODEL_DIR}/tokens.txt",
        f"--num-threads={NUM_THREADS}", "--max-utterance-length=600",
    ]
    print(f"[server] Starting sherpa-onnx...")
    _process = await asyncio.create_subprocess_exec(
        *cmd, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE)
    for _ in range(60):
        if _process.returncode is not None:
            stderr = ""
            if _process.stderr:
                try:
                    raw = await asyncio.wait_for(_process.stderr.read(4096), 1.0)
                    stderr = raw.decode(errors="replace")
                except asyncio.TimeoutError: pass
            raise RuntimeError(f"sherpa-onnx exited: {stderr}")
        try:
            async with websockets.connect(f"ws://localhost:{WS_PORT}",
                                          close_timeout=2, open_timeout=2) as ws:
                await ws.send("Done")
            print(f"[server] sherpa-onnx ready (PID={_process.pid}, provider={PROVIDER})")
            return
        except: await asyncio.sleep(0.5)
    raise RuntimeError("sherpa-onnx failed to start")

def convert_to_wav(inp, out):
    try:
        r = subprocess.run(["ffmpeg","-y","-loglevel","error","-i",inp,
            "-ar","16000","-ac","1","-sample_fmt","s16","-f","wav",out],
            capture_output=True, timeout=30)
        return r.returncode == 0
    except: return False

async def transcribe_ws(samples, sample_rate):
    async with websockets.connect(f"ws://localhost:{WS_PORT}") as ws:
        header = struct.pack("<ii", sample_rate, samples.size * 4)
        buf = header + samples.tobytes()
        for start in range(0, len(buf), 10240):
            await ws.send(buf[start:start+10240])
        result = await ws.recv()
        await ws.send("Done")
    try: return json.loads(result).get("text", "").strip()
    except: return result.strip() if isinstance(result, str) else result.decode().strip()

def _check_auth(request):
    auth = request.headers.get("authorization", "")
    if not auth.startswith("Bearer "): return False
    return len(auth[7:].strip()) > 0

@asynccontextmanager
async def lifespan(application):
    await start_sherpa()
    print(f"[server] Ready on port {PORT}")
    print(f"[server] Community API key: {COMMUNITY_API_KEY}")
    yield
    if _process and _process.returncode is None:
        _process.terminate()

app = FastAPI(title="MVP-Echo STT", lifespan=lifespan)
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])

@app.get("/health")
async def health():
    return {"status": "ok", "model": os.path.basename(MODEL_DIR), "provider": PROVIDER}

@app.get("/v1/models")
async def list_models(request: Request):
    if not _check_auth(request):
        return JSONResponse(status_code=401,
            content={"error": "Invalid or missing API key. Use: SK-COLAB-COMMUNITY"})
    model_id = os.path.basename(MODEL_DIR)
    for prefix in ("sherpa-onnx-nemo-", "sherpa-onnx-"):
        if model_id.startswith(prefix):
            model_id = model_id[len(prefix):]; break
    return JSONResponse({"data": [{"id": model_id, "object": "model",
        "owned_by": "colab", "label": "English", "group": "gpu", "active": True}]})

@app.post("/v1/audio/transcriptions")
async def transcribe(
    file: UploadFile = File(...), model: str = Form(default=""),
    language: str = Form(default="en"), response_format: str = Form(default="verbose_json"),
    temperature: str = Form(default="0"),
):
    t0 = time.time()
    suffix = os.path.splitext(file.filename or "audio.webm")[1] or ".webm"
    with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp:
        tmp.write(await file.read()); tmp_path = tmp.name
    wav_path = tmp_path + ".wav"
    try:
        if not convert_to_wav(tmp_path, wav_path):
            return JSONResponse(status_code=400, content={"error": "Audio conversion failed"})
        samples, sr = sf.read(wav_path, dtype="float32")
        if len(samples.shape) > 1: samples = samples[:, 0]
        dur = len(samples) / sr
        text = await transcribe_ws(samples.astype(np.float32), sr)
        elapsed = time.time() - t0
        print(f"[server] {dur:.1f}s -> {elapsed:.2f}s (RTF={elapsed/dur:.3f}): \"{text[:80]}\"")
        if response_format == "verbose_json":
            return JSONResponse({"text": text, "language": language,
                "duration": round(dur, 2),
                "segments": [{"id":0,"start":0.0,"end":round(dur,2),"text":text}]})
        return JSONResponse({"text": text})
    except Exception as e:
        return JSONResponse(status_code=500, content={"error": str(e)})
    finally:
        for p in [tmp_path, wav_path]:
            try: os.unlink(p)
            except: pass

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=PORT)
'''.strip()

with open('/content/server.py', 'w') as f:
    f.write(SERVER_SCRIPT)
print('Server script written to /content/server.py')

In [11]:
import subprocess as _sp
import time as _time
import os as _os

# Set environment for the server process
server_env = _os.environ.copy()
server_env["MODEL_DIR"] = MODEL_DIR
server_env["PROVIDER"] = "cuda"
server_env["PORT"] = "8000"

# Start the server in the background
server_proc = _sp.Popen(
    ["python3", "/content/server.py"],
    env=server_env,
    stdout=_sp.PIPE,
    stderr=_sp.STDOUT,
)
print(f"Server starting (PID={server_proc.pid})...")

# Wait for the server to be ready
import urllib.request
for i in range(120):
    try:
        resp = urllib.request.urlopen("http://localhost:8000/health", timeout=2)
        data = resp.read().decode()
        print(f"\nServer is up! Health: {data}")
        break
    except Exception:
        if server_proc.poll() is not None:
            output = server_proc.stdout.read().decode()
            print(f"Server died! Output:\n{output}")
            break
        print(".", end="", flush=True)
        _time.sleep(1)
else:
    print("\nServer failed to start within 120 seconds.")
    output = server_proc.stdout.read(4096).decode() if server_proc.stdout else ""
    print(f"Output: {output}")

Server starting (PID=2267)...
.............
Server is up! Health: {"status":"ok","model":"sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8","provider":"cuda"}


---
## 5. Create public URL

Choose **one** of the options below to expose the server with a public URL.

**Option A** (cloudflared) â€” no signup needed, works immediately.

**Option B** (ngrok) â€” requires a free [ngrok.com](https://ngrok.com) account.

In [12]:
#@title Option A: cloudflared tunnel (no signup required)

import subprocess, time, re

# Install cloudflared
!wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 -O /usr/local/bin/cloudflared
!chmod +x /usr/local/bin/cloudflared

# Start tunnel
tunnel_proc = subprocess.Popen(
    ["cloudflared", "tunnel", "--url", "http://localhost:8000", "--no-autoupdate"],
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
)

# Wait for the URL to appear in stderr
public_url = None
deadline = time.time() + 30
buffer = b""

import select
while time.time() < deadline and public_url is None:
    ready, _, _ = select.select([tunnel_proc.stderr], [], [], 1.0)
    if ready:
        chunk = tunnel_proc.stderr.read1(4096) if hasattr(tunnel_proc.stderr, 'read1') else tunnel_proc.stderr.read(4096)
        if chunk:
            buffer += chunk
            text = buffer.decode(errors="replace")
            match = re.search(r'(https://[a-z0-9-]+\.trycloudflare\.com)', text)
            if match:
                public_url = match.group(1)

if public_url:
    print("=" * 60)
    print(f"PUBLIC URL: {public_url}")
    print("=" * 60)
    print()
    print(f"Health check:  {public_url}/health")
    print(f"Transcribe:    POST {public_url}/v1/audio/transcriptions")
    print()
    print("This URL is temporary and will stop when the Colab session ends.")
else:
    print("Failed to get tunnel URL. Check the output above for errors.")
    print("Stderr:", buffer.decode(errors="replace")[:500])

PUBLIC URL: https://tent-enjoy-films-driver.trycloudflare.com

Health check:  https://tent-enjoy-films-driver.trycloudflare.com/health
Transcribe:    POST https://tent-enjoy-films-driver.trycloudflare.com/v1/audio/transcriptions

This URL is temporary and will stop when the Colab session ends.


In [13]:
#@title Option B: ngrok tunnel (requires free account)

# 1. Sign up at https://ngrok.com (free)
# 2. Get your auth token from https://dashboard.ngrok.com/get-started/your-authtoken
# 3. Paste it below:

NGROK_AUTH_TOKEN = ""  #@param {type:"string"}

if not NGROK_AUTH_TOKEN:
    print("Paste your ngrok auth token above and re-run this cell.")
    print("Get one free at: https://dashboard.ngrok.com/get-started/your-authtoken")
else:
    !pip install -q pyngrok
    from pyngrok import ngrok

    ngrok.set_auth_token(NGROK_AUTH_TOKEN)
    tunnel = ngrok.connect(8000, "http")
    public_url = tunnel.public_url

    print("=" * 60)
    print(f"PUBLIC URL: {public_url}")
    print("=" * 60)
    print()
    print(f"Health check:  {public_url}/health")
    print(f"Transcribe:    POST {public_url}/v1/audio/transcriptions")
    print()
    print("This URL is temporary and will stop when the Colab session ends.")

Paste your ngrok auth token above and re-run this cell.
Get one free at: https://dashboard.ngrok.com/get-started/your-authtoken


---
## 6. Test it

### From this notebook:

In [14]:
# Quick health check
!curl -s http://localhost:8000/health | python3 -m json.tool

{
    "status": "ok",
    "model": "sherpa-onnx-nemo-parakeet-tdt-0.6b-v2-int8",
    "provider": "cuda"
}


In [15]:
# Record a short test clip and transcribe it
# (This creates a 3-second sine wave as a test â€” replace with real audio for real results)

import numpy as np
import soundfile as sf

# Generate a silent test file (just to verify the pipeline works)
sr = 16000
duration = 2.0
samples = np.zeros(int(sr * duration), dtype=np.float32)
sf.write("/tmp/test_silence.wav", samples, sr)

print("Transcribing test audio (2s silence)...")
!curl -s -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "file=@/tmp/test_silence.wav" \
  -F "response_format=verbose_json" | python3 -m json.tool

print("\nTo test with real audio, upload a file and run:")
print('  !curl -X POST http://localhost:8000/v1/audio/transcriptions -F "file=@/path/to/audio.wav"')

Transcribing test audio (2s silence)...
{
    "text": "",
    "language": "en",
    "duration": 2.0,
    "segments": [
        {
            "id": 0,
            "start": 0.0,
            "end": 2.0,
            "text": ""
        }
    ]
}

To test with real audio, upload a file and run:
  !curl -X POST http://localhost:8000/v1/audio/transcriptions -F "file=@/path/to/audio.wav"


### From your machine (using the public URL):

```bash
# Health check
curl https://YOUR-URL.trycloudflare.com/health

# Transcribe an audio file
curl -X POST https://YOUR-URL.trycloudflare.com/v1/audio/transcriptions \
  -F "file=@recording.wav" \
  -F "response_format=verbose_json"
```

### From MVP-Echo toolbar:

Set the STT endpoint in settings to your public URL:
```
https://YOUR-URL.trycloudflare.com
```

> **About API keys:** This Colab server has **no authentication** â€” it's open access
> for community testing. You do not need an API key. If the toolbar has an API key
> field, you can leave it blank or enter any value â€” the server ignores it.
> The "Test Connection" button may report a warning about auth, but actual
> transcription will work fine regardless.

In [None]:
### From your machine (using the public URL):

```bash
# Health check
curl https://YOUR-URL.trycloudflare.com/health

# Transcribe an audio file (include the API key)
curl -X POST https://YOUR-URL.trycloudflare.com/v1/audio/transcriptions \
  -H "Authorization: Bearer SK-COLAB-COMMUNITY" \
  -F "file=@recording.wav" \
  -F "response_format=verbose_json"
```

### From MVP-Echo toolbar:

1. Open Settings
2. Set **Server URL** to your Colab public URL: `https://YOUR-URL.trycloudflare.com`
3. Set **API Key** to: `SK-COLAB-COMMUNITY`
4. Click **Test Connection** â€” it should succeed
5. Start transcribing

> **About the API key:** The toolbar requires an API key to complete its connection
> test (`GET /v1/models`). The community key `SK-COLAB-COMMUNITY` satisfies this
> check. Any non-empty key will work â€” this isn't real security, it's just
> compatibility with the toolbar's connection flow.