# RelayServe Demo – Run and see it work

This notebook starts a **mock backend** and **RelayServe**, then runs the API (non-streaming, streaming, request-id). No model download or llama.cpp required.

**Run all cells in order.**

## 1. Setup: install RelayServe and add to path

In [1]:
import sys
import os

# Path to the RelayServe clone (this repo)
RELAY_SERVE_ROOT = os.path.abspath(os.path.join(os.getcwd(), "RelayServe"))
if not os.path.isdir(RELAY_SERVE_ROOT):
    RELAY_SERVE_ROOT = os.path.abspath(os.path.join(os.getcwd(), "class1_resources", "RelayServe"))
if not os.path.isdir(RELAY_SERVE_ROOT):
    RELAY_SERVE_ROOT = os.path.abspath(os.getcwd())

if RELAY_SERVE_ROOT not in sys.path:
    sys.path.insert(0, RELAY_SERVE_ROOT)

# Install RelayServe in editable mode if running from repo
if os.path.isfile(os.path.join(RELAY_SERVE_ROOT, "pyproject.toml")):
    get_ipython().system(f'pip install -e "{RELAY_SERVE_ROOT}" -q')
else:
    get_ipython().system('pip install relayserve -q')

print("RelayServe root:", RELAY_SERVE_ROOT)

RelayServe root: /path/to/class1_resources/RelayServe


## 2. Start mock backend (fake LLM server)

A minimal HTTP server that speaks OpenAI-style `/v1/chat/completions` so RelayServe has something to talk to.

In [2]:
import json
import subprocess
import threading
from http.server import BaseHTTPRequestHandler, HTTPServer

MOCK_BACKEND_PORT = 8091

# Free port if still in use from a previous run (e.g. re-running this cell)
subprocess.run(
    f"lsof -i :{MOCK_BACKEND_PORT} -t | xargs kill -9 2>/dev/null || true",
    shell=True, capture_output=True, timeout=2
)
import time
time.sleep(0.3)

class MockBackendHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        if self.path.rstrip("/").endswith("/v1/chat/completions"):
            length = int(self.headers.get("Content-Length", 0))
            body = self.rfile.read(length).decode("utf-8") if length else "{}"
            try:
                data = json.loads(body)
            except json.JSONDecodeError:
                data = {}
            stream = data.get("stream", False)
            messages = data.get("messages", [])
            prompt = ""
            for m in messages:
                if m.get("role") == "user":
                    prompt = str(m.get("content", ""))
                    break
            reply = f"Echo from mock backend: {prompt[:50]}..." if len(prompt) > 50 else f"Echo from mock backend: {prompt}"

            if stream:
                self.send_response(200)
                self.send_header("Content-Type", "text/event-stream")
                self.end_headers()
                for word in reply.split():
                    chunk = {
                        "id": "mock-1",
                        "object": "chat.completion.chunk",
                        "model": "mock",
                        "choices": [{"index": 0, "delta": {"content": word + " "}, "finish_reason": None}],
                    }
                    self.wfile.write(f"data: {json.dumps(chunk)}\n\n".encode("utf-8"))
                    self.wfile.flush()
                self.wfile.write(b"data: {\"choices\":[{\"delta\":{},\"finish_reason\":\"stop\"}]}\n\n")
                self.wfile.write(b"data: [DONE]\n\n")
                self.wfile.flush()
            else:
                out = {
                    "id": "mock-1",
                    "object": "chat.completion",
                    "model": "mock",
                    "choices": [
                        {"index": 0, "message": {"role": "assistant", "content": reply}, "finish_reason": "stop"}
                    ],
                    "usage": {"prompt_tokens": 1, "completion_tokens": 1, "total_tokens": 2},
                }
                self.send_response(200)
                self.send_header("Content-Type", "application/json")
                body = json.dumps(out).encode("utf-8")
                self.send_header("Content-Length", len(body))
                self.end_headers()
                self.wfile.write(body)
        else:
            self.send_response(404)
            self.end_headers()

    def log_message(self, *args):
        pass

mock_server = HTTPServer(("127.0.0.1", MOCK_BACKEND_PORT), MockBackendHandler)
mock_thread = threading.Thread(target=mock_server.serve_forever, daemon=True)
mock_thread.start()
print(f"Mock backend running at http://127.0.0.1:{MOCK_BACKEND_PORT}")

Mock backend running at http://127.0.0.1:8091


## 3. Start RelayServe

RelayServe will proxy requests to the mock backend.

In [3]:
import os
os.environ["RELAYSERVE_BACKENDS"] = f"http://127.0.0.1:{MOCK_BACKEND_PORT}"
os.environ["RELAYSERVE_PORT"] = "8080"

from relayserve.internal.config.settings import Settings
from relayserve.internal.server.app import build_app
from relayserve.internal.server.http_server import run_server, _make_handler
from http.server import ThreadingHTTPServer

settings = Settings.from_env()
app = build_app(settings)
handler_factory = _make_handler(app)
relay_server = ThreadingHTTPServer(("127.0.0.1", settings.port), handler_factory)
relay_thread = threading.Thread(target=relay_server.serve_forever, daemon=True)
relay_thread.start()

import time
time.sleep(0.5)
print(f"RelayServe running at http://127.0.0.1:{settings.port}")

RelayServe running at http://127.0.0.1:8080


## 4. Test: health and models

In [5]:
import urllib.request
import time

def get(url):
    with urllib.request.urlopen(url, timeout=5) as r:
        return r.status, r.read().decode("utf-8")

base = "http://127.0.0.1:8080"
# Wait for RelayServe to be ready (run cell 3 first)
for attempt in range(15):
    try:
        get(base + "/healthz")
        break
    except Exception as e:
        if attempt == 0:
            print("Waiting for RelayServe on 8080...", end="", flush=True)
        print(".", end="", flush=True)
        if attempt >= 14:
            raise RuntimeError(
                "RelayServe not reachable at http://127.0.0.1:8080. "
                "Run the previous cell (## 3. Start RelayServe) first."
            ) from e
        time.sleep(0.5)
else:
    print(" OK", flush=True)

for path in ["/healthz", "/v1/models"]:
    status, body = get(base + path)
    print(f"GET {path} -> {status}")
    print(json.dumps(json.loads(body), indent=2))
    print()

GET /healthz -> 200
{
  "status": "ok"
}

GET /v1/models -> 200
{
  "data": [
    {
      "id": "relay-gguf",
      "object": "model"
    }
  ]
}



## 5. Test: non-streaming chat

In [6]:
req = urllib.request.Request(
    base + "/v1/chat/completions",
    data=json.dumps({
        "model": "relay-gguf",
        "messages": [{"role": "user", "content": "Hello, RelayServe!"}],
    }).encode("utf-8"),
    headers={"Content-Type": "application/json", "Accept": "application/json"},
    method="POST",
)
with urllib.request.urlopen(req, timeout=10) as r:
    print("Status:", r.status)
    print("X-Request-ID:", r.headers.get("X-Request-ID"))
    body = r.read().decode("utf-8")
    data = json.loads(body)
    print("Reply:", data["choices"][0]["message"]["content"])
    print("id in body:", data.get("id"))
    print("relay meta:", data.get("relay"))

Status: 200
X-Request-ID: 85c220ed42c440739bc958f1288a441b
Reply: Echo from mock backend: Hello, RelayServe!
id in body: 85c220ed42c440739bc958f1288a441b
relay meta: {'device': 'cpu:arm (8 cores)', 'backend': 'llama.cpp', 'queue_ms': 13.586291985120624, 'ttft_ms': 15.841207990888506, 'batch_size': 1}


## 6. Test: request-id (client sends X-Request-ID)

In [None]:
req = urllib.request.Request(
    base + "/v1/chat/completions",
    data=json.dumps({
        "model": "relay-gguf",
        "messages": [{"role": "user", "content": "Hi"}],
    }).encode("utf-8"),
    headers={"Content-Type": "application/json", "Accept": "application/json", "X-Request-ID": "my-id-123"},
    method="POST",
)
with urllib.request.urlopen(req, timeout=10) as r:
    print("X-Request-ID header:", r.headers.get("X-Request-ID"))
    data = json.loads(r.read().decode("utf-8"))
    print("id in body:", data.get("id"))
    assert data.get("id") == "my-id-123", "Request-ID should be echoed"
    print("✓ Request-ID echoed correctly")

X-Request-ID header: my-id-123
id in body: my-id-123
✓ Request-ID echoed correctly


## 7. Test: streaming

In [None]:
req = urllib.request.Request(
    base + "/v1/chat/completions",
    data=json.dumps({
        "model": "relay-gguf",
        "messages": [{"role": "user", "content": "Count to three"}],
        "stream": True,
    }).encode("utf-8"),
    headers={"Content-Type": "application/json", "X-Request-ID": "stream-456"},
    method="POST",
)
with urllib.request.urlopen(req, timeout=60) as r:
    print("Content-Type:", r.headers.get("Content-Type"))
    print("X-Request-ID:", r.headers.get("X-Request-ID"))
    print("Stream (first 1500 chars):")
    chunks = []
    while True:
        line = r.readline()
        if not line:
            break
        decoded = line.decode("utf-8")
        chunks.append(decoded)
        if "data: [DONE]" in decoded:
            break
    body = "".join(chunks)
    print(body[:1500])
    if "data: [DONE]" in body:
        print("...\n✓ Stream ends with data: [DONE]")

Content-Type: text/event-stream
X-Request-ID: stream-456
Stream (first 1500 chars):
data: {"id": "stream-456", "object": "chat.completion.chunk", "model": "mock", "choices": [{"index": 0, "delta": {"content": "Echo "}, "finish_reason": null}]}

data: {"id": "stream-456", "object": "chat.completion.chunk", "model": "mock", "choices": [{"index": 0, "delta": {"content": "from "}, "finish_reason": null}]}

data: {"id": "stream-456", "object": "chat.completion.chunk", "model": "mock", "choices": [{"index": 0, "delta": {"content": "mock "}, "finish_reason": null}]}

data: {"id": "stream-456", "object": "chat.completion.chunk", "model": "mock", "choices": [{"index": 0, "delta": {"content": "backend: "}, "finish_reason": null}]}

data: {"id": "stream-456", "object": "chat.completion.chunk", "model": "mock", "choices": [{"index": 0, "delta": {"content": "Count "}, "finish_reason": null}]}

data: {"id": "stream-456", "object": "chat.completion.chunk", "model": "mock", "choices": [{"index": 0, "d

: 

## 8. Done

You’ve seen RelayServe:
- **Health & models** – GET endpoints
- **Non-streaming chat** – one JSON response with relay meta and usage
- **Request-ID** – client sends `X-Request-ID`, server echoes it in header and body
- **Streaming** – `stream: true` returns SSE until `data: [DONE]`

**Next (Class 2):** For **config-backed routing**—one gateway, local + Modal backends selected by `model`—run [RelayServe_Class2_Demo.ipynb](../class2_runs/RelayServe_Class2_Demo.ipynb).

To use a **real local model**, follow **Serve_local_model.md** (llama.cpp + GGUF).