# Class 2 Demo – Config-backed routing (local + Modal)

**Follow-up to Class 1:** In [RelayServe_Demo.ipynb](../class1_runs/RelayServe_Demo.ipynb) you ran one mock backend and RelayServe (env-based backends). Here we use **real backends**—local llama.cpp and Modal—with config-driven routing; the request’s `model` field.

Here we use **real backends**: your **local llama.cpp** (port 8081) and a **Modal** deployment, with config-driven routing so one gateway can talk to both.

**Prerequisites (before running):**
1. **Local llama.cpp** running on port 8081 (e.g. `scripts/spawn_backends.py` — see [Serve_local_model.md](../class1_runs/Serve_local_model.md)).
2. **Modal** deployed and web URL ready: `cd modal && modal deploy modal_llama_server.py`, then get the `https://...modal.run` URL from the output or run `modal run modal_llama_server.py` once to see it. Paste that URL in section 2.

**Run all cells in order.** Section 1b frees port 8080 so RelayServe can bind; then we write config with your URLs, start RelayServe, and send `model=local` and `model=modal` requests.

## 1. Setup: path and install RelayServe

## 1b. Free port 8080 (so RelayServe can bind)

Run this once at the start so nothing else is on 8080. We do not touch 8081 (your local llama). If you re-run the notebook without restarting the kernel, run section 9 at the end first, then run from here.

In [1]:
import subprocess

subprocess.run(
    "lsof -i :8080 -t | xargs kill -9 2>/dev/null || true",
    shell=True, capture_output=True, timeout=2
)
print("Port 8080 freed. You can start RelayServe.")

Port 8080 freed. You can start RelayServe.


In [None]:
import sys
import os

REPO_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))
RELAY_SERVE_ROOT = os.path.join(REPO_ROOT, "RelayServe")
if not os.path.isdir(RELAY_SERVE_ROOT):
    RELAY_SERVE_ROOT = os.path.abspath(os.getcwd())

if RELAY_SERVE_ROOT not in sys.path:
    sys.path.insert(0, RELAY_SERVE_ROOT)

if os.path.isfile(os.path.join(RELAY_SERVE_ROOT, "pyproject.toml")):
    get_ipython().system(f'pip install -e "{RELAY_SERVE_ROOT}" -q')
else:
    get_ipython().system('pip install relayserve -q')

print("RelayServe root (path hidden)")

RelayServe root: /path/to/class1_resources/RelayServe


## 2. Set backend URLs and write config

Set your **local llama.cpp** URL (default port 8081) and your **Modal** web URL. If you haven't deployed Modal yet: `cd modal && modal deploy modal_llama_server.py`, then run `modal run modal_llama_server.py` once to see the `https://...modal.run` URL. We backup the existing config and write one that points to these backends; it's restored at the end.

In [None]:
LOCAL_LLAMA_URL = "http://127.0.0.1:8081"
MODAL_URL = "https://<your-modal-app>.modal.run"

config_path = os.path.join(RELAY_SERVE_ROOT, "config.yaml")
config_backup = None
if os.path.isfile(config_path):
    with open(config_path) as f:
        config_backup = f.read()

real_config = f"""
default_backend: local

backends:
  local:
    type: local
    url: {LOCAL_LLAMA_URL}
  modal:
    type: modal
    url: {MODAL_URL.rstrip("/")}
"""
with open(config_path, "w") as f:
    f.write(real_config)
print(f"Config written: local={LOCAL_LLAMA_URL}, modal={MODAL_URL}")
if "YOUR_WORKSPACE" in MODAL_URL or not MODAL_URL.startswith("https://"):
    print("  → Update MODAL_URL above with your Modal web URL, then re-run this cell.")

Config written: local=http://127.0.0.1:8081, modal=https://<your-modal-app>.modal.run


## 3. Ensure backends are running

Your **local llama.cpp** (port 8081) and **Modal** must already be running. If not, start local llama with `scripts/spawn_backends.py` and deploy Modal with `scripts/deploy_modal.sh`. The cell below optionally checks that both URLs are reachable.

In [4]:
import urllib.request

def check(url, timeout=3):
    try:
        req = urllib.request.Request(url, method="GET")
        urllib.request.urlopen(req, timeout=timeout)
        return True
    except Exception:
        return False

local_ok = check((LOCAL_LLAMA_URL or "").rstrip("/") + "/health") if LOCAL_LLAMA_URL else False
modal_ok = check((MODAL_URL or "").rstrip("/") + "/health") if (MODAL_URL and MODAL_URL.startswith("https://") and "YOUR_WORKSPACE" not in MODAL_URL) else False
print(f"Local ({LOCAL_LLAMA_URL}): {'OK' if local_ok else 'not reachable (is llama running on 8081?)'}")
print(f"Modal ({MODAL_URL[:50]}...): {'OK' if modal_ok else 'not reachable or URL not set'}")
if not local_ok:
    print("  → Start local llama: scripts/spawn_backends.py (see Serve_local_model.md)")

Local (http://127.0.0.1:8081): not reachable (is llama running on 8081?)
Modal (https://<your-modal-app>.modal.r...): not reachable or URL not set
  → Start local llama: scripts/spawn_backends.py (see Serve_local_model.md)


## 4. (Real Modal — no mock)

Using your deployed Modal backend. No mock server; ensure Modal is deployed and MODAL_URL is set in section 2.

In [5]:
# Real Modal backend; no mock. Ensure Modal is deployed and MODAL_URL set in section 2.
pass


## 5. Start RelayServe (with config-backed router)

In [6]:
import os
import sys
import threading

REPO_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))
RELAY_SERVE_ROOT = os.path.join(REPO_ROOT, "RelayServe")
if not os.path.isdir(RELAY_SERVE_ROOT):
    RELAY_SERVE_ROOT = os.path.abspath(os.getcwd())
if RELAY_SERVE_ROOT not in sys.path:
    sys.path.insert(0, RELAY_SERVE_ROOT)

os.environ["RELAYSERVE_ROOT"] = RELAY_SERVE_ROOT
os.environ["RELAYSERVE_PORT"] = "8080"
os.environ["RELAYSERVE_BACKENDS"] = ""

from relayserve.internal.config.settings import Settings
from relayserve.internal.server.app import build_app
from relayserve.internal.server.http_server import run_server, _make_handler
from http.server import ThreadingHTTPServer

settings = Settings.from_env()
app = build_app(settings)
handler_factory = _make_handler(app)
relay_server = ThreadingHTTPServer(("127.0.0.1", settings.port), handler_factory)
threading.Thread(target=relay_server.serve_forever, daemon=True).start()

print("RelayServe running at http://127.0.0.1:8080 (config: real local + modal)")

RelayServe running at http://127.0.0.1:8080 (config: real local + modal)


## 6. Request with model=local

Run this **after** sections 3 (ensure backends) and 5 (Start RelayServe). If you see "RelayServe not ready", run the Start RelayServe cell again, then retry.

In [7]:
import json
import re
import urllib.request

def _parse_response(body):
    try:
        return json.loads(body)
    except json.JSONDecodeError:
        pass
    plain = re.sub(r"\033\[[0-9;]*m", "", body)
    reply = ""
    backend = ""
    for line in plain.split("\n"):
        if line.startswith("Reply:"):
            reply = line.split("Reply:", 1)[1].strip()
        elif line.startswith("Backend:"):
            backend = line.split("Backend:", 1)[1].strip()
    return {"choices": [{"message": {"content": reply}}], "relay": {"backend": backend}}

last_err = None
out = None
for attempt in range(8):
    try:
        req = urllib.request.Request(
            "http://127.0.0.1:8080/v1/chat/completions",
            data=json.dumps({"model": "local", "messages": [{"role": "user", "content": "Say hello"}], "stream": False}).encode("utf-8"),
            headers={"Content-Type": "application/json"},
            method="POST",
        )
        with urllib.request.urlopen(req, timeout=5) as r:
            body = r.read().decode("utf-8")
            if not body.strip():
                status = getattr(r, "status", "?")
                last_err = RuntimeError(f"Empty response (HTTP {status})")
                if attempt == 0:
                    print(f"RelayServe returned empty body (HTTP {status}). Retrying...")
                continue
            out = _parse_response(body)
            break
    except Exception as e:
        last_err = e
else:
    msg = "RelayServe not ready. Run section 1b (Free ports), then sections 3–5 (mocks + RelayServe), then retry."
    if last_err:
        msg += f" Last error: {last_err}"
    raise RuntimeError(msg)

reply = out.get("choices", [{}])[0].get("message", {}).get("content", "")
relay_meta = out.get("relay", {})
print("model=local:", reply)
print("relay backend:", relay_meta.get("backend"))

model=local: Echo: Say hello
relay backend: metal


## 7. Request with model=modal

In [8]:
import json
import urllib.request

req = urllib.request.Request(
    "http://127.0.0.1:8080/v1/chat/completions",
    data=json.dumps({"model": "modal", "messages": [{"role": "user", "content": "Explain KV cache"}], "stream": False}).encode("utf-8"),
    headers={"Content-Type": "application/json", "Accept": "application/json"},
    method="POST",
)
with urllib.request.urlopen(req, timeout=10) as r:
    body = r.read().decode("utf-8")
if not body.strip():
    raise RuntimeError(f"RelayServe returned empty body (HTTP {getattr(r, 'status', '?')}). Run sections 1b and 3–5 first, then retry.")
try:
    out = json.loads(body)
except json.JSONDecodeError as e:
    print("Body (first 300 chars):", body[:300] if body else "(empty)")
    raise RuntimeError(f"Invalid JSON from RelayServe: {e}") from e

reply = out.get("choices", [{}])[0].get("message", {}).get("content", "")
relay_meta = out.get("relay", {})
print("model=modal:", reply)
print("relay backend:", relay_meta.get("backend"))

model=modal: Echo: Explain KV cache
relay backend: metal


## 8. Restore config.yaml

In [9]:
if config_backup is not None:
    with open(config_path, "w") as f:
        f.write(config_backup)
    print("config.yaml restored.")
else:
    print("No backup to restore.")

config.yaml restored.


## 9. Free port 8080

Run this at the end to free port 8080 (RelayServe). We do not kill the current process; we only kill other PIDs so the kernel is not stopped. Port 8081 (local llama) is left as-is.

In [10]:
import os
import subprocess

my_pid = os.getpid()
r = subprocess.run(
    "lsof -i :8080 -t",
    shell=True, capture_output=True, timeout=2, text=True
)
pids = (r.stdout or "").strip().split()
for pid in pids:
    try:
        if int(pid) != my_pid:
            subprocess.run(f"kill -9 {pid}", shell=True, capture_output=True, timeout=1)
    except (ValueError, OSError):
        pass
print("Port 8080 freed (current process skipped).")

Port 8080 freed (current process skipped).
