openclaw · scoootscooob · May 6, 2026 · May 6, 2026 · May 6, 2026
@@ -461,6 +461,26 @@ python3 scripts/run_posterior_dynamics_pipeline.py \
 clawbench diagnose profiles/local_ollama_gpt_oss.yaml
 ```
 
+### Running on Kubernetes
+
+See [`docs/kubernetes.md`](docs/kubernetes.md) for the full runbook. The short
+version:
+
+```bash
+export CLAWBENCH_NAMESPACE=clawbench-eval
+export OPENAI_API_KEY="sk-..."       # or ANTHROPIC_API_KEY, OPENROUTER_API_KEY, etc.
+export CLAWBENCH_MODEL="openai/gpt-5.5"
+# export MLFLOW_NAMESPACE="mlflow"   # MLflow deploys in a separate namespace (default: mlflow)
+
+./scripts/k8s/deploy.sh              # deploys OpenClaw + MLflow + starts eval
+./scripts/k8s/deploy.sh --logs       # follow progress
+./scripts/k8s/deploy.sh --teardown   # tear down openclaw & eval (does not delete MLflow)
+```
+
+API keys are stored in a Kubernetes Secret created by the deploy script.
+MLflow is deployed in its own namespace (default: `mlflow`, configurable via
+`MLFLOW_NAMESPACE`).
+
 ---
 
 ## Partner Trace Spec

@@ -0,0 +1,367 @@
+# Running ClawBench on Kubernetes
+
+ClawBench runs as a **sidecar** in the OpenClaw gateway pod. The sidecar
+connects to the gateway over loopback (`ws://localhost:18789`), runs the
+19-task eval suite, and optionally logs results to MLflow.
+
+```
+┌─── OpenClaw Pod ─────────────────────────────┐
+│  gateway container  (ws://localhost:18789)   │
+│  clawbench sidecar  ──► gateway via loopback │
+└──────────────────────────────────────────────┘
+         │                          │
+         ▼                          ▼
+   Model provider API         MLflow (optional)
+```
+
+All commands use `scripts/k8s/deploy.sh`. The script has these modes:
+
+| Flag | What it does |
+|------|-------------|
+| *(none)* | Full deploy: OpenClaw + MLflow + eval sidecar |
+| `--openclaw-only` | Deploy OpenClaw gateway only |
+| `--mlflow-only` | Deploy MLflow only |
+| `--add-sidecar` | Inject clawbench sidecar (starts eval) |
+| `--remove-sidecar` | Remove clawbench sidecar |
+| `--logs` | Tail sidecar logs |
+| `--teardown` | Delete eval namespace (keeps MLflow) |
+
+---
+
+## Prerequisites
+
+- `kubectl` on PATH, connected to a cluster (`kubectl cluster-info` succeeds)
+- A container image for ClawBench (see [Building images](#building-images))
+- At least one model provider API key (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.)
+
+For local testing with Kind:
+https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind
+
+---
+
+## Environment variables
+
+Set these **before** running `deploy.sh`.
+
+### Required
+
+| Variable | Purpose |
+|----------|---------|
+| `CLAWBENCH_NAMESPACE` | Namespace for OpenClaw + eval (e.g. `clawbench-eval`) |
+| `OPENAI_API_KEY` | Model provider key (or use another provider — see table below) |
+
+### Optional
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `CLAWBENCH_IMAGE` | `quay.io/sallyom/clawbench:latest` | ClawBench sidecar image |
+| `OPENCLAW_IMAGE` | `ghcr.io/openclaw/openclaw:latest` | OpenClaw gateway image |
+| `OPENCLAW_GATEWAY_TOKEN` | *(generated by script)* | Gateway token; set this when attaching the sidecar to an existing gateway |
+| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model to evaluate |
+| `MLFLOW_NAMESPACE` | `mlflow` | MLflow namespace |
+| `MLFLOW_TRACKING_URI` | *(deployed by script)* | External MLflow URI — skips MLflow deploy if set |
+| `MLFLOW_EXPERIMENT_ID` | | MLflow experiment ID |
+| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |
+| `MLFLOW_IMAGE` | `ghcr.io/mlflow/mlflow:v2.21.3` | MLflow server image |
+| `ANTHROPIC_API_KEY` | | Added to K8s secret if set |
+| `OPENROUTER_API_KEY` | | Added to K8s secret if set |
+| `GEMINI_API_KEY` | | Added to K8s secret if set |
+| `OPENAI_API_BASE` | | Base URL for OpenAI-compatible endpoints (e.g. vLLM, Ollama); patched into gateway config |
+
+### Model routing
+
+The gateway routes by provider prefix:
+
+| Model string | Required variables |
+|-------------|-------------------|
+| `openai/gpt-5.5` | `OPENAI_API_KEY` |
+| `anthropic/claude-sonnet-4-6` | `ANTHROPIC_API_KEY` |
+| `openrouter/anthropic/claude-sonnet-4-6` | `OPENROUTER_API_KEY` |
+| `openai/my-local-model` | `OPENAI_API_KEY` + `OPENAI_API_BASE` |
+
+For OpenAI-compatible endpoints (vLLM, Ollama, TGI, or any in-cluster model
+server), set `OPENAI_API_BASE` to the endpoint URL and use the `openai/`
+prefix for the model name:
+
+```bash
+export CLAWBENCH_MODEL="openai/meta-llama/Llama-4-Scout-17B"
+export OPENAI_API_KEY="none"  # dummy value if the endpoint doesn't require auth
+export OPENAI_API_BASE="http://vllm-service.my-ns.svc.cluster.local:8000/v1"
+```
+
+---
+
+## Full deploy (quick start)
+
+Deploys OpenClaw gateway, MLflow, and the eval sidecar in one command.
+
+```bash
+export CLAWBENCH_NAMESPACE=clawbench-eval
+
+# Export API keys before running. The script stores them in a K8s Secret
+# ("clawbench-secrets") that the gateway and sidecar containers read.
+export OPENAI_API_KEY="sk-..."
+
+# Model to evaluate (default: openai/gpt-5.5)
+# export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"
+
+./scripts/k8s/deploy.sh
+```
+
+Verify:
+
+```bash
+# Should show 2/2 containers (gateway + clawbench)
+kubectl get pods -n clawbench-eval
+
+# Follow eval progress
+./scripts/k8s/deploy.sh --logs
+```
+
+When the eval finishes, copy results and clean up:
+
+```bash
+# Copy results from the sidecar
+POD=$(kubectl get pod -n $CLAWBENCH_NAMESPACE -l app=openclaw -o jsonpath='{.items[0].metadata.name}')
+kubectl cp "$CLAWBENCH_NAMESPACE/$POD:/results/benchmark.json" -c clawbench ./benchmark.json
+
+# Remove the sidecar (keeps OpenClaw + MLflow running)
+./scripts/k8s/deploy.sh --remove-sidecar
+
+# Or tear down everything
+./scripts/k8s/deploy.sh --teardown
+```
+
+---
+
+## Existing cluster + existing MLflow
+
+If you already have an OpenShift or Kubernetes cluster and an MLflow instance,
+you only need to deploy OpenClaw and run the eval — no cluster or MLflow setup
+required.
+
+```bash
+export CLAWBENCH_NAMESPACE=clawbench-eval
+
+# API keys — export before running deploy.sh. The script creates a
+# Kubernetes Secret ("clawbench-secrets") from whichever keys are set.
+# At least one provider key is required.
+export OPENAI_API_KEY="sk-..."
+# export ANTHROPIC_API_KEY="sk-ant-..."
+# export OPENROUTER_API_KEY="sk-or-..."
+# export GEMINI_API_KEY="..."
+
+# Model to evaluate (default: openai/gpt-5.5)
+export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"
+
+# If attaching to an existing OpenClaw gateway, this must match that gateway.
+# If deploy.sh creates OpenClaw, it generates this token for you.
+# export OPENCLAW_GATEWAY_TOKEN="..."
+
+# Point to your existing MLflow
+export MLFLOW_TRACKING_URI="https://mlflow.example.com"
+export MLFLOW_EXPERIMENT_NAME="clawbench-gpt5.5"  # or use MLFLOW_EXPERIMENT_ID=42
+
+# Deploy OpenClaw gateway into your cluster
+./scripts/k8s/deploy.sh --openclaw-only
+```
+
+Verify OpenClaw is running:
+
+```bash
+kubectl get pods -n clawbench-eval
+# Expect: openclaw-xxxx  1/1  Running
+```
+
+Then start the eval:
+
+```bash
+./scripts/k8s/deploy.sh --add-sidecar
+./scripts/k8s/deploy.sh --logs
+```
+
+The deploy script sets `MLFLOW_TRACKING_URI` to skip its own MLflow deployment
+and patches the experiment name/ID into the clawbench ConfigMap. When the eval
+completes, `scripts/log_to_mlflow.py` logs results to your MLflow under that
+experiment.
+
+`MLFLOW_EXPERIMENT_NAME` creates the experiment if it doesn't exist.
+`MLFLOW_EXPERIMENT_ID` requires an existing experiment.
+
+---
+
+## Step-by-step deploy
+
+Use this when you want to deploy components individually or bring your own
+OpenClaw/MLflow.
+
+### Step 1: Deploy OpenClaw gateway
+
+```bash
+export CLAWBENCH_NAMESPACE=clawbench-eval
+export OPENAI_API_KEY="sk-..."
+./scripts/k8s/deploy.sh --openclaw-only
+```
+
+Verify:
+
+```bash
+kubectl get pods -n clawbench-eval
+# Expect: openclaw-xxxx  1/1  Running
+```
+
+This deploys from `scripts/k8s/openclaw/`: a single gateway pod with token
+auth, ClusterIP service, and 10Gi PVC. The deploy script generates a gateway
+token and creates the `clawbench-secrets` Secret automatically.
+
+**Skip this step** if you already have an OpenClaw deployment. Your existing
+gateway must have this config (see `scripts/k8s/openclaw/configmap.yaml`):
+
+```json
+{
+  "browser": {
+    "enabled": true,
+    "headless": true,
+    "noSandbox": true,
+    "ssrfPolicy": {
+      "allowedHostnames": ["localhost", "127.0.0.1"]
+    }
+  },
+  "tools": {
+    "profile": "coding",
+    "alsoAllow": ["browser"]
+  }
+}
+```
+
+Key requirements:
+- `browser.enabled: true` — activates the bundled browser plugin
+- `tools.alsoAllow: ["browser"]` — the `coding` profile does NOT include browser by default
+- `browser.ssrfPolicy` — several eval tasks need localhost access
+- Gateway must bind to loopback with token auth; export the matching
+  `OPENCLAW_GATEWAY_TOKEN` before running `--add-sidecar`
+
+### Step 2: Deploy MLflow
+
+```bash
+./scripts/k8s/deploy.sh --mlflow-only
+```
+
+Verify:
+
+```bash
+kubectl get pods -n mlflow
+# Expect: mlflow-xxxx  1/1  Running
+```
+
+Deploys a single-replica MLflow server with SQLite backend into the `mlflow`
+namespace. The clawbench ConfigMap defaults to
+`http://mlflow-service.mlflow.svc.cluster.local:5000`.
+
+**Skip this step** if you have an external MLflow — set `MLFLOW_TRACKING_URI`:
+
+```bash
+export MLFLOW_TRACKING_URI=http://my-mlflow.example.com:5000
+export MLFLOW_EXPERIMENT_ID=4  # or MLFLOW_EXPERIMENT_NAME
+```
+
+### Step 3: Run the eval
+
+```bash
+./scripts/k8s/deploy.sh --add-sidecar
+```
+
+This patches the OpenClaw deployment to inject a clawbench sidecar that:
+
+1. Waits for the gateway (TCP check on port 18789, up to 3 min)
+2. Checks MLflow connectivity if configured
+3. Runs `clawbench run` with settings from the ConfigMap
+4. Logs results to MLflow on success
+5. Sleeps indefinitely so you can retrieve logs and results
+
+Verify:
+
+```bash
+kubectl get pods -n $CLAWBENCH_NAMESPACE
+# Expect: openclaw-xxxx  2/2  Running  (gateway + clawbench)
+
+./scripts/k8s/deploy.sh --logs
+# Should show "Waiting for gateway..." then "Starting eval..."
+```
+
+When finished, remove the sidecar:
+
+```bash
+./scripts/k8s/deploy.sh --remove-sidecar
+```
+
+---
+
+## ConfigMap tuning
+
+The clawbench ConfigMap (`scripts/k8s/manifests/configmap.yaml`) controls eval
+behavior. Override at deploy time via env vars, or patch after deploy:
+
+| Key | Default | What it controls |
+|-----|---------|-----------------|
+| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model under test |
+| `CLAWBENCH_RUNS` | `3` | Runs per task (19 tasks x 3 = 57 total) |
+| `CLAWBENCH_CONCURRENCY` | `4` | Parallel eval lanes |
+| `CLAWBENCH_JUDGE_MODEL` | *(empty)* | Separate judge model (optional) |
+| `CLAWBENCH_TASKS` | *(empty — runs all)* | Space-separated task IDs (e.g. `t1-bugfix-discount t2-config-loader`) |
+| `CLAWBENCH_CONNECT_TIMEOUT` | `120` | Gateway connect timeout in seconds |
+| `CLAWBENCH_REQUEST_TIMEOUT` | `300` | Per-request timeout in seconds |
+| `CLAWBENCH_PER_RUN_BUDGET_SECONDS` | `600` | Max wall time per run |
+| `MLFLOW_TRACKING_URI` | `http://mlflow-service.mlflow.svc.cluster.local:5000` | MLflow endpoint |
+| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |
+
+---
+
+## MLflow integration
+
+Results are logged via `scripts/log_to_mlflow.py` after a successful eval.
+
+**What gets logged:**
+- **Params**: model, provider, benchmark version, OpenClaw version, judge model
+- **Metrics**: overall score, per-axis scores (completion, trajectory, behavior,
+  reliability), cost, tokens, latency, CI bounds, per-tier and per-task scores
+- **Tags**: submission ID, timestamp, certified flag
+- **Artifacts**: full benchmark result JSON
+
+---
+
+## Building images
+
+### ClawBench image
+
+`quay.io/sallyom/clawbench:latest` is public
+
+For Kubernetes, use the lightweight sidecar image instead — it only includes
+the eval harness and MLflow client:
+
+```bash
+docker build -t clawbench:latest -f scripts/k8s/Dockerfile .
+
+# For Kind clusters, load directly instead of pushing to a registry:
+kind load docker-image clawbench:latest --name openclaw
+
+# For non-Kind clusters, push to registry and set CLAWBENCH_IMAGE accordingly
+# Ensure you build for the right architecture, usually amd64 for non-local k8s
+```
+
+Set `CLAWBENCH_IMAGE=clawbench:latest` when running `deploy.sh` to use it.
+
+---
+
+## Cleanup
+
+```bash
+# Remove eval sidecar only (keeps OpenClaw + MLflow running for another eval)
+./scripts/k8s/deploy.sh --remove-sidecar
+
+# Delete eval namespace (keeps MLflow running)
+./scripts/k8s/deploy.sh --teardown
+
+# Delete the Kind cluster entirely
+kind delete cluster --name openclaw
+```