diff --git a/README.md b/README.md index 026c67e..cad482b 100644 --- a/README.md +++ b/README.md @@ -461,6 +461,26 @@ python3 scripts/run_posterior_dynamics_pipeline.py \ clawbench diagnose profiles/local_ollama_gpt_oss.yaml ``` +### Running on Kubernetes + +See [`docs/kubernetes.md`](docs/kubernetes.md) for the full runbook. The short +version: + +```bash +export CLAWBENCH_NAMESPACE=clawbench-eval +export OPENAI_API_KEY="sk-..." # or ANTHROPIC_API_KEY, OPENROUTER_API_KEY, etc. +export CLAWBENCH_MODEL="openai/gpt-5.5" +# export MLFLOW_NAMESPACE="mlflow" # MLflow deploys in a separate namespace (default: mlflow) + +./scripts/k8s/deploy.sh # deploys OpenClaw + MLflow + starts eval +./scripts/k8s/deploy.sh --logs # follow progress +./scripts/k8s/deploy.sh --teardown # tear down openclaw & eval (does not delete MLflow) +``` + +API keys are stored in a Kubernetes Secret created by the deploy script. +MLflow is deployed in its own namespace (default: `mlflow`, configurable via +`MLFLOW_NAMESPACE`). + --- ## Partner Trace Spec diff --git a/docs/kubernetes.md b/docs/kubernetes.md new file mode 100644 index 0000000..cf5e4af --- /dev/null +++ b/docs/kubernetes.md @@ -0,0 +1,367 @@ +# Running ClawBench on Kubernetes + +ClawBench runs as a **sidecar** in the OpenClaw gateway pod. The sidecar +connects to the gateway over loopback (`ws://localhost:18789`), runs the +19-task eval suite, and optionally logs results to MLflow. + +``` +┌─── OpenClaw Pod ─────────────────────────────┐ +│ gateway container (ws://localhost:18789) │ +│ clawbench sidecar ──► gateway via loopback │ +└──────────────────────────────────────────────┘ + │ │ + ▼ ▼ + Model provider API MLflow (optional) +``` + +All commands use `scripts/k8s/deploy.sh`. The script has these modes: + +| Flag | What it does | +|------|-------------| +| *(none)* | Full deploy: OpenClaw + MLflow + eval sidecar | +| `--openclaw-only` | Deploy OpenClaw gateway only | +| `--mlflow-only` | Deploy MLflow only | +| `--add-sidecar` | Inject clawbench sidecar (starts eval) | +| `--remove-sidecar` | Remove clawbench sidecar | +| `--logs` | Tail sidecar logs | +| `--teardown` | Delete eval namespace (keeps MLflow) | + +--- + +## Prerequisites + +- `kubectl` on PATH, connected to a cluster (`kubectl cluster-info` succeeds) +- A container image for ClawBench (see [Building images](#building-images)) +- At least one model provider API key (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.) + +For local testing with Kind: +https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind + +--- + +## Environment variables + +Set these **before** running `deploy.sh`. + +### Required + +| Variable | Purpose | +|----------|---------| +| `CLAWBENCH_NAMESPACE` | Namespace for OpenClaw + eval (e.g. `clawbench-eval`) | +| `OPENAI_API_KEY` | Model provider key (or use another provider — see table below) | + +### Optional + +| Variable | Default | Purpose | +|----------|---------|---------| +| `CLAWBENCH_IMAGE` | `quay.io/sallyom/clawbench:latest` | ClawBench sidecar image | +| `OPENCLAW_IMAGE` | `ghcr.io/openclaw/openclaw:latest` | OpenClaw gateway image | +| `OPENCLAW_GATEWAY_TOKEN` | *(generated by script)* | Gateway token; set this when attaching the sidecar to an existing gateway | +| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model to evaluate | +| `MLFLOW_NAMESPACE` | `mlflow` | MLflow namespace | +| `MLFLOW_TRACKING_URI` | *(deployed by script)* | External MLflow URI — skips MLflow deploy if set | +| `MLFLOW_EXPERIMENT_ID` | | MLflow experiment ID | +| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name | +| `MLFLOW_IMAGE` | `ghcr.io/mlflow/mlflow:v2.21.3` | MLflow server image | +| `ANTHROPIC_API_KEY` | | Added to K8s secret if set | +| `OPENROUTER_API_KEY` | | Added to K8s secret if set | +| `GEMINI_API_KEY` | | Added to K8s secret if set | +| `OPENAI_API_BASE` | | Base URL for OpenAI-compatible endpoints (e.g. vLLM, Ollama); patched into gateway config | + +### Model routing + +The gateway routes by provider prefix: + +| Model string | Required variables | +|-------------|-------------------| +| `openai/gpt-5.5` | `OPENAI_API_KEY` | +| `anthropic/claude-sonnet-4-6` | `ANTHROPIC_API_KEY` | +| `openrouter/anthropic/claude-sonnet-4-6` | `OPENROUTER_API_KEY` | +| `openai/my-local-model` | `OPENAI_API_KEY` + `OPENAI_API_BASE` | + +For OpenAI-compatible endpoints (vLLM, Ollama, TGI, or any in-cluster model +server), set `OPENAI_API_BASE` to the endpoint URL and use the `openai/` +prefix for the model name: + +```bash +export CLAWBENCH_MODEL="openai/meta-llama/Llama-4-Scout-17B" +export OPENAI_API_KEY="none" # dummy value if the endpoint doesn't require auth +export OPENAI_API_BASE="http://vllm-service.my-ns.svc.cluster.local:8000/v1" +``` + +--- + +## Full deploy (quick start) + +Deploys OpenClaw gateway, MLflow, and the eval sidecar in one command. + +```bash +export CLAWBENCH_NAMESPACE=clawbench-eval + +# Export API keys before running. The script stores them in a K8s Secret +# ("clawbench-secrets") that the gateway and sidecar containers read. +export OPENAI_API_KEY="sk-..." + +# Model to evaluate (default: openai/gpt-5.5) +# export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6" + +./scripts/k8s/deploy.sh +``` + +Verify: + +```bash +# Should show 2/2 containers (gateway + clawbench) +kubectl get pods -n clawbench-eval + +# Follow eval progress +./scripts/k8s/deploy.sh --logs +``` + +When the eval finishes, copy results and clean up: + +```bash +# Copy results from the sidecar +POD=$(kubectl get pod -n $CLAWBENCH_NAMESPACE -l app=openclaw -o jsonpath='{.items[0].metadata.name}') +kubectl cp "$CLAWBENCH_NAMESPACE/$POD:/results/benchmark.json" -c clawbench ./benchmark.json + +# Remove the sidecar (keeps OpenClaw + MLflow running) +./scripts/k8s/deploy.sh --remove-sidecar + +# Or tear down everything +./scripts/k8s/deploy.sh --teardown +``` + +--- + +## Existing cluster + existing MLflow + +If you already have an OpenShift or Kubernetes cluster and an MLflow instance, +you only need to deploy OpenClaw and run the eval — no cluster or MLflow setup +required. + +```bash +export CLAWBENCH_NAMESPACE=clawbench-eval + +# API keys — export before running deploy.sh. The script creates a +# Kubernetes Secret ("clawbench-secrets") from whichever keys are set. +# At least one provider key is required. +export OPENAI_API_KEY="sk-..." +# export ANTHROPIC_API_KEY="sk-ant-..." +# export OPENROUTER_API_KEY="sk-or-..." +# export GEMINI_API_KEY="..." + +# Model to evaluate (default: openai/gpt-5.5) +export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6" + +# If attaching to an existing OpenClaw gateway, this must match that gateway. +# If deploy.sh creates OpenClaw, it generates this token for you. +# export OPENCLAW_GATEWAY_TOKEN="..." + +# Point to your existing MLflow +export MLFLOW_TRACKING_URI="https://mlflow.example.com" +export MLFLOW_EXPERIMENT_NAME="clawbench-gpt5.5" # or use MLFLOW_EXPERIMENT_ID=42 + +# Deploy OpenClaw gateway into your cluster +./scripts/k8s/deploy.sh --openclaw-only +``` + +Verify OpenClaw is running: + +```bash +kubectl get pods -n clawbench-eval +# Expect: openclaw-xxxx 1/1 Running +``` + +Then start the eval: + +```bash +./scripts/k8s/deploy.sh --add-sidecar +./scripts/k8s/deploy.sh --logs +``` + +The deploy script sets `MLFLOW_TRACKING_URI` to skip its own MLflow deployment +and patches the experiment name/ID into the clawbench ConfigMap. When the eval +completes, `scripts/log_to_mlflow.py` logs results to your MLflow under that +experiment. + +`MLFLOW_EXPERIMENT_NAME` creates the experiment if it doesn't exist. +`MLFLOW_EXPERIMENT_ID` requires an existing experiment. + +--- + +## Step-by-step deploy + +Use this when you want to deploy components individually or bring your own +OpenClaw/MLflow. + +### Step 1: Deploy OpenClaw gateway + +```bash +export CLAWBENCH_NAMESPACE=clawbench-eval +export OPENAI_API_KEY="sk-..." +./scripts/k8s/deploy.sh --openclaw-only +``` + +Verify: + +```bash +kubectl get pods -n clawbench-eval +# Expect: openclaw-xxxx 1/1 Running +``` + +This deploys from `scripts/k8s/openclaw/`: a single gateway pod with token +auth, ClusterIP service, and 10Gi PVC. The deploy script generates a gateway +token and creates the `clawbench-secrets` Secret automatically. + +**Skip this step** if you already have an OpenClaw deployment. Your existing +gateway must have this config (see `scripts/k8s/openclaw/configmap.yaml`): + +```json +{ + "browser": { + "enabled": true, + "headless": true, + "noSandbox": true, + "ssrfPolicy": { + "allowedHostnames": ["localhost", "127.0.0.1"] + } + }, + "tools": { + "profile": "coding", + "alsoAllow": ["browser"] + } +} +``` + +Key requirements: +- `browser.enabled: true` — activates the bundled browser plugin +- `tools.alsoAllow: ["browser"]` — the `coding` profile does NOT include browser by default +- `browser.ssrfPolicy` — several eval tasks need localhost access +- Gateway must bind to loopback with token auth; export the matching + `OPENCLAW_GATEWAY_TOKEN` before running `--add-sidecar` + +### Step 2: Deploy MLflow + +```bash +./scripts/k8s/deploy.sh --mlflow-only +``` + +Verify: + +```bash +kubectl get pods -n mlflow +# Expect: mlflow-xxxx 1/1 Running +``` + +Deploys a single-replica MLflow server with SQLite backend into the `mlflow` +namespace. The clawbench ConfigMap defaults to +`http://mlflow-service.mlflow.svc.cluster.local:5000`. + +**Skip this step** if you have an external MLflow — set `MLFLOW_TRACKING_URI`: + +```bash +export MLFLOW_TRACKING_URI=http://my-mlflow.example.com:5000 +export MLFLOW_EXPERIMENT_ID=4 # or MLFLOW_EXPERIMENT_NAME +``` + +### Step 3: Run the eval + +```bash +./scripts/k8s/deploy.sh --add-sidecar +``` + +This patches the OpenClaw deployment to inject a clawbench sidecar that: + +1. Waits for the gateway (TCP check on port 18789, up to 3 min) +2. Checks MLflow connectivity if configured +3. Runs `clawbench run` with settings from the ConfigMap +4. Logs results to MLflow on success +5. Sleeps indefinitely so you can retrieve logs and results + +Verify: + +```bash +kubectl get pods -n $CLAWBENCH_NAMESPACE +# Expect: openclaw-xxxx 2/2 Running (gateway + clawbench) + +./scripts/k8s/deploy.sh --logs +# Should show "Waiting for gateway..." then "Starting eval..." +``` + +When finished, remove the sidecar: + +```bash +./scripts/k8s/deploy.sh --remove-sidecar +``` + +--- + +## ConfigMap tuning + +The clawbench ConfigMap (`scripts/k8s/manifests/configmap.yaml`) controls eval +behavior. Override at deploy time via env vars, or patch after deploy: + +| Key | Default | What it controls | +|-----|---------|-----------------| +| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model under test | +| `CLAWBENCH_RUNS` | `3` | Runs per task (19 tasks x 3 = 57 total) | +| `CLAWBENCH_CONCURRENCY` | `4` | Parallel eval lanes | +| `CLAWBENCH_JUDGE_MODEL` | *(empty)* | Separate judge model (optional) | +| `CLAWBENCH_TASKS` | *(empty — runs all)* | Space-separated task IDs (e.g. `t1-bugfix-discount t2-config-loader`) | +| `CLAWBENCH_CONNECT_TIMEOUT` | `120` | Gateway connect timeout in seconds | +| `CLAWBENCH_REQUEST_TIMEOUT` | `300` | Per-request timeout in seconds | +| `CLAWBENCH_PER_RUN_BUDGET_SECONDS` | `600` | Max wall time per run | +| `MLFLOW_TRACKING_URI` | `http://mlflow-service.mlflow.svc.cluster.local:5000` | MLflow endpoint | +| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name | + +--- + +## MLflow integration + +Results are logged via `scripts/log_to_mlflow.py` after a successful eval. + +**What gets logged:** +- **Params**: model, provider, benchmark version, OpenClaw version, judge model +- **Metrics**: overall score, per-axis scores (completion, trajectory, behavior, + reliability), cost, tokens, latency, CI bounds, per-tier and per-task scores +- **Tags**: submission ID, timestamp, certified flag +- **Artifacts**: full benchmark result JSON + +--- + +## Building images + +### ClawBench image + +`quay.io/sallyom/clawbench:latest` is public + +For Kubernetes, use the lightweight sidecar image instead — it only includes +the eval harness and MLflow client: + +```bash +docker build -t clawbench:latest -f scripts/k8s/Dockerfile . + +# For Kind clusters, load directly instead of pushing to a registry: +kind load docker-image clawbench:latest --name openclaw + +# For non-Kind clusters, push to registry and set CLAWBENCH_IMAGE accordingly +# Ensure you build for the right architecture, usually amd64 for non-local k8s +``` + +Set `CLAWBENCH_IMAGE=clawbench:latest` when running `deploy.sh` to use it. + +--- + +## Cleanup + +```bash +# Remove eval sidecar only (keeps OpenClaw + MLflow running for another eval) +./scripts/k8s/deploy.sh --remove-sidecar + +# Delete eval namespace (keeps MLflow running) +./scripts/k8s/deploy.sh --teardown + +# Delete the Kind cluster entirely +kind delete cluster --name openclaw +``` diff --git a/pyproject.toml b/pyproject.toml index 84b701e..a952a1c 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -33,6 +33,9 @@ dev = [ "pre-commit>=4.0,<5", "ruff>=0.9,<1", ] +mlflow = [ + "mlflow>=2.10,<3", +] hermes = [ "hermes-agent @ git+https://github.com/NousResearch/hermes-agent.git@main", ] diff --git a/scripts/k8s/Dockerfile b/scripts/k8s/Dockerfile new file mode 100644 index 0000000..c987d4c --- /dev/null +++ b/scripts/k8s/Dockerfile @@ -0,0 +1,33 @@ +# Lightweight ClawBench image for Kubernetes sidecar use. +# Does NOT include the full OpenClaw server or Chromium — the gateway runs +# in a separate container. Node.js is copied from the OpenClaw image for +# the device-identity handshake required by the gateway protocol. +FROM ghcr.io/openclaw/openclaw:latest AS openclaw + +FROM python:3.12-slim + +COPY --from=openclaw /usr/local/bin/node /usr/local/bin/node + +RUN apt-get update && \ + apt-get install -y --no-install-recommends git && \ + rm -rf /var/lib/apt/lists/* + +WORKDIR /app + +COPY pyproject.toml README.md CLAWBENCH_V0_4_SPEC.md PARTNER_TRACE_SPEC.md ./ +COPY clawbench/ clawbench/ +COPY tasks-public/ tasks-public/ +COPY tasks-domain/ tasks-domain/ +COPY profiles/ profiles/ +COPY baselines/ baselines/ +COPY scripts/ scripts/ + +RUN pip install --no-cache-dir ".[mlflow]" + +RUN mkdir -p /results && chmod 777 /results + +RUN useradd -m -d /home/node clawbench +USER clawbench +ENV HOME=/home/node + +ENTRYPOINT ["clawbench"] diff --git a/scripts/k8s/deploy.sh b/scripts/k8s/deploy.sh new file mode 100755 index 0000000..31b7a8b --- /dev/null +++ b/scripts/k8s/deploy.sh @@ -0,0 +1,486 @@ +#!/usr/bin/env bash +# Deploy ClawBench evals on Kubernetes (works on OpenShift too). +# +# 0-to-hero pipeline: +# Step 0: Create a cluster (see --help for Kind instructions) +# Step 1: Deploy OpenClaw gateway (optional — bring your own) +# Step 2: Deploy MLflow tracking server (optional — bring your own) +# Step 3: Run evals via sidecar (add / remove) +# +# Usage: +# ./scripts/k8s/deploy.sh # Full deploy: OpenClaw + MLflow + eval +# ./scripts/k8s/deploy.sh --openclaw-only # Step 1: deploy OpenClaw gateway +# ./scripts/k8s/deploy.sh --mlflow-only # Step 2: deploy MLflow +# ./scripts/k8s/deploy.sh --add-sidecar # Step 3: add eval sidecar (starts eval) +# ./scripts/k8s/deploy.sh --remove-sidecar # Step 3: remove eval sidecar +# ./scripts/k8s/deploy.sh --logs # Tail clawbench sidecar logs +# ./scripts/k8s/deploy.sh --teardown # Delete eval namespace (keeps MLflow) +# +# Environment (required): +# CLAWBENCH_NAMESPACE Namespace for OpenClaw + eval +# OPENAI_API_KEY Model provider API key (or another provider key) +# +# Environment (optional): +# CLAWBENCH_IMAGE Clawbench image (default: quay.io/sallyom/clawbench:latest) +# OPENCLAW_IMAGE OpenClaw image (default: ghcr.io/openclaw/openclaw:latest) +# OPENCLAW_GATEWAY_TOKEN Existing gateway token (generated if unset) +# CLAWBENCH_MODEL Model to eval (default: openai/gpt-5.5) +# MLFLOW_NAMESPACE MLflow namespace (default: mlflow) +# MLFLOW_TRACKING_URI External MLflow URI (skips MLflow deploy if set) +# MLFLOW_EXPERIMENT_ID MLflow experiment ID +# MLFLOW_EXPERIMENT_NAME MLflow experiment name +# MLFLOW_IMAGE MLflow image (default: ghcr.io/mlflow/mlflow:v2.21.3) +# ANTHROPIC_API_KEY Anthropic key (added to secret if set) +# OPENROUTER_API_KEY OpenRouter key (added to secret if set) +# GEMINI_API_KEY Gemini key (added to secret if set) +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +NS="${CLAWBENCH_NAMESPACE:-}" +MLFLOW_NS="${MLFLOW_NAMESPACE:-mlflow}" +CLAWBENCH_IMG="${CLAWBENCH_IMAGE:-quay.io/sallyom/clawbench:latest}" +OPENCLAW_IMG="${OPENCLAW_IMAGE:-ghcr.io/openclaw/openclaw:latest}" +MLFLOW_IMG="${MLFLOW_IMAGE:-ghcr.io/mlflow/mlflow:v2.21.3}" + +# --------------------------------------------------------------------------- +if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then + cat <<'HELP' +ClawBench Kubernetes Deployment +=============================== + +0-to-hero pipeline for running ClawBench evals on Kubernetes. + + Step 0: Create a cluster + For local testing with Kind, see: + https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind + + Step 1: Deploy OpenClaw gateway (optional — skip if you have one) + Step 2: Deploy MLflow tracking server (optional — skip if you have one) + Step 3: Run evals via sidecar (add/remove to OpenClaw deployment) + +Usage: + ./scripts/k8s/deploy.sh Full deploy (steps 1+2+3) + ./scripts/k8s/deploy.sh --openclaw-only Step 1: OpenClaw only + ./scripts/k8s/deploy.sh --mlflow-only Step 2: MLflow only + ./scripts/k8s/deploy.sh --add-sidecar Step 3: add eval sidecar (starts eval) + ./scripts/k8s/deploy.sh --remove-sidecar Step 3: remove eval sidecar + ./scripts/k8s/deploy.sh --logs Tail clawbench sidecar logs + ./scripts/k8s/deploy.sh --teardown Delete eval namespace (keeps MLflow) + +Required environment: + CLAWBENCH_NAMESPACE Namespace for OpenClaw + eval + OPENAI_API_KEY Model provider API key (or ANTHROPIC_API_KEY, etc.) + +Optional environment: + CLAWBENCH_IMAGE Clawbench image (default: quay.io/sallyom/clawbench:latest) + OPENCLAW_IMAGE OpenClaw image (default: ghcr.io/openclaw/openclaw:latest) + OPENCLAW_GATEWAY_TOKEN Existing gateway token (generated if unset) + CLAWBENCH_MODEL Model to eval (default: openai/gpt-5.5) + MLFLOW_NAMESPACE MLflow namespace (default: mlflow) + MLFLOW_TRACKING_URI External MLflow URI (skips MLflow deploy) + MLFLOW_EXPERIMENT_ID MLflow experiment ID + MLFLOW_EXPERIMENT_NAME MLflow experiment name + MLFLOW_IMAGE MLflow image (default: ghcr.io/mlflow/mlflow:v2.21.3) + ANTHROPIC_API_KEY Anthropic key (added to secret if set) + OPENROUTER_API_KEY OpenRouter key (added to secret if set) + GEMINI_API_KEY Gemini key (added to secret if set) + +Works on Kubernetes and OpenShift. +HELP + exit 0 +fi + +command -v kubectl &>/dev/null || { echo "Missing: kubectl" >&2; exit 1; } + +if [[ -z "$NS" ]]; then + echo "CLAWBENCH_NAMESPACE is required." >&2 + echo " export CLAWBENCH_NAMESPACE=clawbench-eval" >&2 + exit 1 +fi + +MODE="full" +while [[ $# -gt 0 ]]; do + case "$1" in + --openclaw-only) MODE="openclaw-only" ;; + --mlflow-only) MODE="mlflow-only" ;; + --add-sidecar) MODE="add-sidecar" ;; + --remove-sidecar) MODE="remove-sidecar" ;; + --logs) MODE="logs" ;; + --teardown) MODE="teardown" ;; + *) echo "Unknown option: $1" >&2; exit 1 ;; + esac + shift +done + +kubectl cluster-info &>/dev/null || { echo "Cannot connect to cluster. Check kubeconfig." >&2; exit 1; } + +# --------------------------------------------------------------------------- +# --logs +# --------------------------------------------------------------------------- +if [[ "$MODE" == "logs" ]]; then + kubectl logs deploy/openclaw -c clawbench -n "$NS" -f + exit 0 +fi + +# --------------------------------------------------------------------------- +# --teardown +# --------------------------------------------------------------------------- +if [[ "$MODE" == "teardown" ]]; then + echo "Deleting namespace '$NS'..." + kubectl delete namespace "$NS" --ignore-not-found + echo "Done. MLflow namespace '$MLFLOW_NS' was not deleted." + exit 0 +fi + +# --------------------------------------------------------------------------- +# --remove-sidecar +# --------------------------------------------------------------------------- +if [[ "$MODE" == "remove-sidecar" ]]; then + echo "Removing clawbench sidecar from openclaw in namespace '$NS'..." + INDEX=$(kubectl get deploy/openclaw -n "$NS" -o json \ + | python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print(next((i for i,c in enumerate(cs) if c['name']=='clawbench'),-1))") + if [[ "$INDEX" == "-1" ]]; then + echo "No clawbench sidecar found." + else + kubectl patch deploy/openclaw -n "$NS" --type=json \ + -p "[{\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$INDEX\"}]" + echo "Sidecar removed." + fi + exit 0 +fi + +# --------------------------------------------------------------------------- +# Create namespace + secret +# --------------------------------------------------------------------------- +ensure_namespace_and_secret() { + if ! kubectl get namespace "$NS" &>/dev/null; then + echo "Creating namespace '$NS'..." + kubectl create namespace "$NS" + fi + + if ! kubectl get secret clawbench-secrets -n "$NS" &>/dev/null; then + echo "Creating clawbench-secrets..." + if [[ -n "${OPENCLAW_GATEWAY_TOKEN:-}" ]]; then + GATEWAY_TOKEN="$OPENCLAW_GATEWAY_TOKEN" + GATEWAY_TOKEN_SOURCE="from OPENCLAW_GATEWAY_TOKEN" + else + GATEWAY_TOKEN=$(python3 -c "import secrets,base64; print(base64.b64encode(secrets.token_bytes(32)).decode())") + GATEWAY_TOKEN_SOURCE="generated" + fi + + SECRET_ARGS=( + --from-literal=OPENCLAW_GATEWAY_TOKEN="$GATEWAY_TOKEN" + ) + [[ -n "${OPENAI_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=OPENAI_API_KEY="$OPENAI_API_KEY") + [[ -n "${ANTHROPIC_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY") + [[ -n "${OPENROUTER_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=OPENROUTER_API_KEY="$OPENROUTER_API_KEY") + [[ -n "${GEMINI_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=GEMINI_API_KEY="$GEMINI_API_KEY") + + if [[ ${#SECRET_ARGS[@]} -eq 1 ]]; then + echo "Warning: No API keys provided. Set OPENAI_API_KEY or another provider key." >&2 + fi + + kubectl create secret generic clawbench-secrets -n "$NS" "${SECRET_ARGS[@]}" + echo " Gateway token: $GATEWAY_TOKEN_SOURCE" + [[ -n "${OPENAI_API_KEY:-}" ]] && echo " OPENAI_API_KEY: set" + [[ -n "${ANTHROPIC_API_KEY:-}" ]] && echo " ANTHROPIC_API_KEY: set" + [[ -n "${OPENROUTER_API_KEY:-}" ]] && echo " OPENROUTER_API_KEY: set" + [[ -n "${GEMINI_API_KEY:-}" ]] && echo " GEMINI_API_KEY: set" + else + echo "Secret clawbench-secrets already exists in '$NS'." + fi + return 0 +} + +# --------------------------------------------------------------------------- +# Step 1: Deploy OpenClaw +# --------------------------------------------------------------------------- +deploy_openclaw() { + echo "" + echo "Step 1: Deploying OpenClaw gateway (image: $OPENCLAW_IMG)..." + + kubectl apply -f "$SCRIPT_DIR/openclaw/configmap.yaml" -n "$NS" + + # Patch gateway config with custom OpenAI-compatible base URL + if [[ -n "${OPENAI_API_BASE:-}" ]]; then + echo " Patching gateway config: models.providers.openai.baseUrl = $OPENAI_API_BASE" + EXISTING_JSON=$(kubectl get configmap openclaw-config -n "$NS" -o jsonpath='{.data.openclaw\.json}') + PATCHED_JSON=$(echo "$EXISTING_JSON" | python3 -c " +import json, sys, os +cfg = json.load(sys.stdin) +openai_cfg = cfg.setdefault('models', {}).setdefault('providers', {}).setdefault('openai', {}) +openai_cfg['baseUrl'] = os.environ['OPENAI_API_BASE'] +openai_cfg.setdefault('models', []) +json.dump(cfg, sys.stdout, indent=2) +") + kubectl create configmap openclaw-config -n "$NS" \ + --from-literal="openclaw.json=$PATCHED_JSON" \ + --dry-run=client -o yaml | kubectl apply -f - -n "$NS" >/dev/null + fi + + kubectl apply -f "$SCRIPT_DIR/openclaw/pvc.yaml" -n "$NS" + kubectl apply -f "$SCRIPT_DIR/openclaw/service.yaml" -n "$NS" + + if [[ "$OPENCLAW_IMG" != "ghcr.io/openclaw/openclaw:latest" ]]; then + kubectl apply -f "$SCRIPT_DIR/openclaw/deployment.yaml" -n "$NS" + kubectl set image "deploy/openclaw" "gateway=$OPENCLAW_IMG" -n "$NS" + else + kubectl apply -f "$SCRIPT_DIR/openclaw/deployment.yaml" -n "$NS" + fi + + echo "Waiting for OpenClaw rollout..." + kubectl rollout status deploy/openclaw -n "$NS" --timeout=180s || \ + echo " (rollout still in progress)" + echo "OpenClaw deployed." +} + +# --------------------------------------------------------------------------- +# Step 2: Deploy MLflow +# --------------------------------------------------------------------------- +deploy_mlflow() { + if [[ -n "${MLFLOW_TRACKING_URI:-}" ]]; then + echo "" + echo "Step 2: Skipping MLflow deploy (MLFLOW_TRACKING_URI is set: $MLFLOW_TRACKING_URI)" + return + fi + + echo "" + echo "Step 2: Deploying MLflow (namespace: $MLFLOW_NS, image: $MLFLOW_IMG)..." + + if ! kubectl get namespace "$MLFLOW_NS" &>/dev/null; then + kubectl create namespace "$MLFLOW_NS" + fi + + kubectl apply -f "$SCRIPT_DIR/mlflow/pvc.yaml" -n "$MLFLOW_NS" + kubectl apply -f "$SCRIPT_DIR/mlflow/service.yaml" -n "$MLFLOW_NS" + + if [[ "$MLFLOW_IMG" != "ghcr.io/mlflow/mlflow:v2.21.3" ]]; then + kubectl apply -f "$SCRIPT_DIR/mlflow/deployment.yaml" -n "$MLFLOW_NS" + kubectl set image "deploy/mlflow" "mlflow=$MLFLOW_IMG" -n "$MLFLOW_NS" + else + kubectl apply -f "$SCRIPT_DIR/mlflow/deployment.yaml" -n "$MLFLOW_NS" + fi + + echo "Waiting for MLflow rollout..." + kubectl rollout status deploy/mlflow -n "$MLFLOW_NS" --timeout=120s || \ + echo " (rollout still in progress)" + + MLFLOW_TRACKING_URI="http://mlflow-service.${MLFLOW_NS}.svc.cluster.local:5000" + echo "MLflow deployed: $MLFLOW_TRACKING_URI" +} + +# --------------------------------------------------------------------------- +# Step 3: Add clawbench sidecar (starts eval) +# --------------------------------------------------------------------------- +add_sidecar() { + echo "" + echo "Step 3: Adding clawbench eval sidecar..." + + echo "Applying clawbench ConfigMap..." + kubectl apply -f "$SCRIPT_DIR/manifests/configmap.yaml" -n "$NS" >/dev/null + + if [[ -n "${CLAWBENCH_MODEL:-}" ]]; then + kubectl patch configmap clawbench-config -n "$NS" \ + --type merge -p "{\"data\":{\"CLAWBENCH_MODEL\":\"$CLAWBENCH_MODEL\"}}" >/dev/null + echo " Model: $CLAWBENCH_MODEL" + fi + + if [[ -n "${OPENAI_API_BASE:-}" ]]; then + kubectl patch configmap clawbench-config -n "$NS" \ + --type merge -p "{\"data\":{\"OPENAI_API_BASE\":\"$OPENAI_API_BASE\"}}" >/dev/null + echo " OpenAI API base: $OPENAI_API_BASE" + fi + + # Patch MLflow settings into ConfigMap + PATCH_DATA="" + MLFLOW_URI="${MLFLOW_TRACKING_URI:-http://mlflow-service.${MLFLOW_NS}.svc.cluster.local:5000}" + PATCH_DATA="\"MLFLOW_TRACKING_URI\":\"$MLFLOW_URI\"" + if [[ -n "${MLFLOW_EXPERIMENT_ID:-}" ]]; then + PATCH_DATA="$PATCH_DATA,\"MLFLOW_EXPERIMENT_ID\":\"$MLFLOW_EXPERIMENT_ID\"" + fi + if [[ -n "${MLFLOW_EXPERIMENT_NAME:-}" ]]; then + PATCH_DATA="$PATCH_DATA,\"MLFLOW_EXPERIMENT_NAME\":\"$MLFLOW_EXPERIMENT_NAME\"" + fi + kubectl patch configmap clawbench-config -n "$NS" \ + --type merge -p "{\"data\":{$PATCH_DATA}}" >/dev/null + echo " MLflow URI: $MLFLOW_URI" + [[ -n "${MLFLOW_EXPERIMENT_ID:-}" ]] && echo " MLflow experiment ID: $MLFLOW_EXPERIMENT_ID" + [[ -n "${MLFLOW_EXPERIMENT_NAME:-}" ]] && echo " MLflow experiment name: $MLFLOW_EXPERIMENT_NAME" + + # Check if sidecar already exists + HAS_SIDECAR=$(kubectl get deploy/openclaw -n "$NS" -o json \ + | python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print('yes' if any(c['name']=='clawbench' for c in cs) else 'no')") + + if [[ "$HAS_SIDECAR" == "yes" ]]; then + echo "Removing existing clawbench sidecar..." + INDEX=$(kubectl get deploy/openclaw -n "$NS" -o json \ + | python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print(next(i for i,c in enumerate(cs) if c['name']=='clawbench'))") + kubectl patch deploy/openclaw -n "$NS" --type=json \ + -p "[{\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$INDEX\"}]" >/dev/null + fi + + # Find the OpenClaw home volume, and capture existing volumes so add-sidecar + # also works with bring-your-own deployments that lack this repo's PVC layout. + VOLUME_INFO=$(kubectl get deploy/openclaw -n "$NS" -o json \ + | python3 -c " +import json, sys +spec = json.load(sys.stdin)['spec']['template']['spec'] +volume_names = [v.get('name') for v in spec.get('volumes', []) if v.get('name')] +home_volume = 'openclaw-home' +for c in spec['containers']: + if c['name'] == 'gateway': + for vm in c.get('volumeMounts', []): + if vm['mountPath'] == '/home/node/.openclaw': + home_volume = vm['name'] + break +print(json.dumps({ + 'home_volume': home_volume, + 'volumes_present': 'volumes' in spec, + 'volume_names': volume_names, +})) +") + + echo "Adding clawbench sidecar (image: $CLAWBENCH_IMG)..." + + PATCH=$(VOLUME_INFO="$VOLUME_INFO" CLAWBENCH_IMG="$CLAWBENCH_IMG" python3 - <<'PY' +import json +import os + +info = json.loads(os.environ["VOLUME_INFO"]) +home_volume = info["home_volume"] + +command = r"""echo "Waiting for gateway on localhost:18789..." +for i in $(seq 1 90); do + python3 -c "import socket; s=socket.create_connection((\"127.0.0.1\",18789),2); s.close()" 2>/dev/null && echo "Gateway ready" && break + sleep 2 +done + +if [ -n "${MLFLOW_TRACKING_URI:-}" ]; then + echo "Checking MLflow at ${MLFLOW_TRACKING_URI}..." + python3 -c "import httpx,os; r=httpx.get(os.environ[\"MLFLOW_TRACKING_URI\"]+\"/health\"); print(\"MLflow OK:\",r.status_code)" 2>&1 || echo "MLflow pre-check failed (will retry at log time)" +fi + +echo "Starting eval..." +clawbench run \ + --model "${CLAWBENCH_MODEL}" \ + --gateway-token "${OPENCLAW_GATEWAY_TOKEN}" \ + --runs "${CLAWBENCH_RUNS}" \ + --concurrency "${CLAWBENCH_CONCURRENCY}" \ + ${CLAWBENCH_JUDGE_MODEL:+--judge-model "${CLAWBENCH_JUDGE_MODEL}"} \ + $([ -n "${CLAWBENCH_TASKS:-}" ] && for t in ${CLAWBENCH_TASKS}; do printf -- "-t %s " "$t"; done) \ + -o /results/benchmark.json +RC=$? +if [ $RC -eq 0 ] && [ -n "${MLFLOW_TRACKING_URI:-}" ]; then + python scripts/log_to_mlflow.py /results/benchmark.json +fi +echo "ClawBench finished (exit=$RC)" +sleep infinity""" + +container = { + "name": "clawbench", + "image": os.environ["CLAWBENCH_IMG"], + "imagePullPolicy": "IfNotPresent", + "command": ["/bin/bash", "-c", command], + "envFrom": [{"configMapRef": {"name": "clawbench-config"}}], + "env": [ + { + "name": "OPENCLAW_GATEWAY_TOKEN", + "valueFrom": { + "secretKeyRef": { + "name": "clawbench-secrets", + "key": "OPENCLAW_GATEWAY_TOKEN", + } + }, + } + ], + "resources": { + "requests": {"memory": "1Gi", "cpu": "500m"}, + "limits": {"memory": "4Gi", "cpu": "2"}, + }, + "volumeMounts": [ + {"name": home_volume, "mountPath": "/home/node/.openclaw"}, + {"name": "clawbench-results", "mountPath": "/results"}, + {"name": "tmp-volume", "mountPath": "/tmp"}, + ], + "securityContext": { + "allowPrivilegeEscalation": False, + "capabilities": {"drop": ["ALL"]}, + }, +} + +patch = [{"op": "add", "path": "/spec/template/spec/containers/-", "value": container}] + +existing_volumes = set(info["volume_names"]) +required_volumes = [ + {"name": home_volume, "emptyDir": {}}, + {"name": "clawbench-results", "emptyDir": {}}, + {"name": "tmp-volume", "emptyDir": {}}, +] +missing_volumes = [] +for volume in required_volumes: + if volume["name"] not in existing_volumes and volume["name"] not in { + item["name"] for item in missing_volumes + }: + missing_volumes.append(volume) + +if missing_volumes: + if info["volumes_present"]: + patch.extend( + {"op": "add", "path": "/spec/template/spec/volumes/-", "value": volume} + for volume in missing_volumes + ) + else: + patch.append( + {"op": "add", "path": "/spec/template/spec/volumes", "value": missing_volumes} + ) + +print(json.dumps(patch)) +PY +) + + kubectl patch deploy/openclaw -n "$NS" --type=json -p "$PATCH" >/dev/null + + echo "" + echo "Waiting for rollout..." + kubectl rollout status deploy/openclaw -n "$NS" --timeout=300s 2>/dev/null || \ + echo " (rollout timeout — eval runs for 30-60 min)" + + echo "" + echo "Eval is running. Follow logs with:" + echo " ./scripts/k8s/deploy.sh --logs" + echo "" + echo "When finished, remove the sidecar with:" + echo " ./scripts/k8s/deploy.sh --remove-sidecar" +} + +# --------------------------------------------------------------------------- +# Execute +# --------------------------------------------------------------------------- +case "$MODE" in + full) + ensure_namespace_and_secret + deploy_openclaw + deploy_mlflow + add_sidecar + ;; + openclaw-only) + ensure_namespace_and_secret + deploy_openclaw + echo "" + echo "OpenClaw is running. Next steps:" + echo " ./scripts/k8s/deploy.sh --mlflow-only # Deploy MLflow" + echo " ./scripts/k8s/deploy.sh --add-sidecar # Start eval" + ;; + mlflow-only) + deploy_mlflow + ;; + add-sidecar) + if ! kubectl get deploy/openclaw -n "$NS" &>/dev/null; then + echo "Deployment 'openclaw' not found in namespace '$NS'." >&2 + echo "Deploy OpenClaw first with: ./scripts/k8s/deploy.sh --openclaw-only" >&2 + exit 1 + fi + ensure_namespace_and_secret + add_sidecar + ;; +esac diff --git a/scripts/k8s/manifests/configmap.yaml b/scripts/k8s/manifests/configmap.yaml new file mode 100644 index 0000000..04d379a --- /dev/null +++ b/scripts/k8s/manifests/configmap.yaml @@ -0,0 +1,18 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: clawbench-config + labels: + app: clawbench +data: + CLAWBENCH_MODEL: "openai/gpt-5.5" + OPENAI_API_BASE: "" + CLAWBENCH_RUNS: "3" + CLAWBENCH_CONCURRENCY: "4" + CLAWBENCH_JUDGE_MODEL: "" + CLAWBENCH_TASKS: "" + CLAWBENCH_CONNECT_TIMEOUT: "120" + CLAWBENCH_REQUEST_TIMEOUT: "300" + CLAWBENCH_PER_RUN_BUDGET_SECONDS: "600" + MLFLOW_TRACKING_URI: "http://mlflow-service.mlflow.svc.cluster.local:5000" + MLFLOW_EXPERIMENT_NAME: "clawbench" diff --git a/scripts/k8s/manifests/secret.yaml b/scripts/k8s/manifests/secret.yaml new file mode 100644 index 0000000..2071a6c --- /dev/null +++ b/scripts/k8s/manifests/secret.yaml @@ -0,0 +1,15 @@ +# Reference template — do NOT apply directly. +# The deploy script (scripts/k8s/deploy.sh) creates this secret automatically +# from exported environment variables (OPENAI_API_KEY, etc.). +apiVersion: v1 +kind: Secret +metadata: + name: clawbench-secrets + labels: + app: clawbench +type: Opaque +stringData: + OPENAI_API_KEY: "REPLACE_ME" + # Add other provider keys as needed: + # ANTHROPIC_API_KEY: "REPLACE_ME" + # OPENROUTER_API_KEY: "REPLACE_ME" diff --git a/scripts/k8s/mlflow/deployment.yaml b/scripts/k8s/mlflow/deployment.yaml new file mode 100644 index 0000000..f2745a3 --- /dev/null +++ b/scripts/k8s/mlflow/deployment.yaml @@ -0,0 +1,68 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: mlflow + labels: + app: mlflow +spec: + replicas: 1 + strategy: + type: Recreate + selector: + matchLabels: + app: mlflow + template: + metadata: + labels: + app: mlflow + spec: + containers: + - name: mlflow + image: ghcr.io/mlflow/mlflow:v2.21.3 + command: + - mlflow + - server + - --host + - "0.0.0.0" + - --port + - "5000" + - --backend-store-uri + - sqlite:///mlflow/mlflow.db + - --default-artifact-root + - /mlflow/artifacts + - --serve-artifacts + ports: + - name: http + containerPort: 5000 + protocol: TCP + livenessProbe: + httpGet: + path: /health + port: 5000 + initialDelaySeconds: 15 + periodSeconds: 30 + readinessProbe: + httpGet: + path: /health + port: 5000 + initialDelaySeconds: 5 + periodSeconds: 10 + resources: + requests: + cpu: 100m + memory: 256Mi + limits: + cpu: 500m + memory: 1Gi + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + volumeMounts: + - name: mlflow-data + mountPath: /mlflow + volumes: + - name: mlflow-data + persistentVolumeClaim: + claimName: mlflow-data-pvc diff --git a/scripts/k8s/mlflow/pvc.yaml b/scripts/k8s/mlflow/pvc.yaml new file mode 100644 index 0000000..e9d2c7a --- /dev/null +++ b/scripts/k8s/mlflow/pvc.yaml @@ -0,0 +1,12 @@ +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: mlflow-data-pvc + labels: + app: mlflow +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 5Gi diff --git a/scripts/k8s/mlflow/service.yaml b/scripts/k8s/mlflow/service.yaml new file mode 100644 index 0000000..49649c9 --- /dev/null +++ b/scripts/k8s/mlflow/service.yaml @@ -0,0 +1,15 @@ +apiVersion: v1 +kind: Service +metadata: + name: mlflow-service + labels: + app: mlflow +spec: + type: ClusterIP + selector: + app: mlflow + ports: + - name: http + port: 5000 + targetPort: 5000 + protocol: TCP diff --git a/scripts/k8s/openclaw/configmap.yaml b/scripts/k8s/openclaw/configmap.yaml new file mode 100644 index 0000000..20cf6dd --- /dev/null +++ b/scripts/k8s/openclaw/configmap.yaml @@ -0,0 +1,36 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: openclaw-config + labels: + app: openclaw +data: + openclaw.json: | + { + "gateway": { + "mode": "local", + "bind": "loopback", + "port": 18789, + "auth": { + "mode": "token" + } + }, + "browser": { + "enabled": true, + "headless": true, + "noSandbox": true, + "ssrfPolicy": { + "allowedHostnames": ["localhost", "127.0.0.1"] + } + }, + "tools": { + "profile": "coding", + "alsoAllow": ["browser"] + }, + "agents": { + "defaults": { + "workspace": "~/.openclaw/workspace" + } + }, + "cron": { "enabled": false } + } diff --git a/scripts/k8s/openclaw/deployment.yaml b/scripts/k8s/openclaw/deployment.yaml new file mode 100644 index 0000000..3ccc5df --- /dev/null +++ b/scripts/k8s/openclaw/deployment.yaml @@ -0,0 +1,146 @@ +# OpenClaw gateway deployment for ClawBench evals. +# +# Build the image with browser support: +# docker build --build-arg OPENCLAW_INSTALL_BROWSER=1 \ +# -t quay.io/yourorg/openclaw:eval . +# +# Or use upstream without browser (browser eval tasks will score 0): +# image: ghcr.io/openclaw/openclaw:latest +apiVersion: apps/v1 +kind: Deployment +metadata: + name: openclaw + labels: + app: openclaw +spec: + replicas: 1 + strategy: + type: Recreate + selector: + matchLabels: + app: openclaw + template: + metadata: + labels: + app: openclaw + spec: + initContainers: + - name: init-config + image: registry.access.redhat.com/ubi9-minimal:latest + command: + - sh + - -c + - | + cp /config/openclaw.json /home/node/.openclaw/openclaw.json + chmod 666 /home/node/.openclaw/openclaw.json + mkdir -p /home/node/.openclaw/workspace + mkdir -p /home/node/.openclaw/agents + chmod 777 /home/node/.openclaw /home/node/.openclaw/workspace /home/node/.openclaw/agents + echo "Config initialized" + volumeMounts: + - name: openclaw-home + mountPath: /home/node/.openclaw + - name: config-template + mountPath: /config + resources: + limits: + cpu: 200m + memory: 128Mi + requests: + cpu: 50m + memory: 64Mi + containers: + - name: gateway + image: ghcr.io/openclaw/openclaw:latest + imagePullPolicy: IfNotPresent + command: + - sh + - -c + - umask 007 && exec node dist/index.js gateway run --bind loopback --port 18789 --allow-unconfigured + env: + - name: HOME + value: /home/node + - name: NODE_ENV + value: production + - name: OPENCLAW_CONFIG_DIR + value: /home/node/.openclaw + - name: OPENCLAW_STATE_DIR + value: /home/node/.openclaw + - name: OPENCLAW_GATEWAY_TOKEN + valueFrom: + secretKeyRef: + name: clawbench-secrets + key: OPENCLAW_GATEWAY_TOKEN + - name: OPENAI_API_KEY + valueFrom: + secretKeyRef: + name: clawbench-secrets + key: OPENAI_API_KEY + optional: true + - name: ANTHROPIC_API_KEY + valueFrom: + secretKeyRef: + name: clawbench-secrets + key: ANTHROPIC_API_KEY + optional: true + - name: OPENROUTER_API_KEY + valueFrom: + secretKeyRef: + name: clawbench-secrets + key: OPENROUTER_API_KEY + optional: true + - name: GEMINI_API_KEY + valueFrom: + secretKeyRef: + name: clawbench-secrets + key: GEMINI_API_KEY + optional: true + ports: + - name: gateway + containerPort: 18789 + protocol: TCP + livenessProbe: + exec: + command: + - node + - -e + - "require('http').get('http://127.0.0.1:18789/',r=>process.exit(r.statusCode<400?0:1)).on('error',()=>process.exit(1))" + initialDelaySeconds: 60 + periodSeconds: 30 + timeoutSeconds: 10 + readinessProbe: + exec: + command: + - node + - -e + - "require('http').get('http://127.0.0.1:18789/',r=>process.exit(r.statusCode<400?0:1)).on('error',()=>process.exit(1))" + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 5 + resources: + requests: + cpu: 250m + memory: 1Gi + limits: + cpu: "2" + memory: 4Gi + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + volumeMounts: + - name: openclaw-home + mountPath: /home/node/.openclaw + - name: tmp-volume + mountPath: /tmp + terminationGracePeriodSeconds: 30 + volumes: + - name: openclaw-home + persistentVolumeClaim: + claimName: openclaw-home-pvc + - name: config-template + configMap: + name: openclaw-config + - name: tmp-volume + emptyDir: {} diff --git a/scripts/k8s/openclaw/pvc.yaml b/scripts/k8s/openclaw/pvc.yaml new file mode 100644 index 0000000..e834e78 --- /dev/null +++ b/scripts/k8s/openclaw/pvc.yaml @@ -0,0 +1,12 @@ +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: openclaw-home-pvc + labels: + app: openclaw +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 10Gi diff --git a/scripts/k8s/openclaw/secret.yaml b/scripts/k8s/openclaw/secret.yaml new file mode 100644 index 0000000..b58e4dc --- /dev/null +++ b/scripts/k8s/openclaw/secret.yaml @@ -0,0 +1,17 @@ +# Reference template — do NOT apply directly. +# The deploy script (scripts/k8s/deploy.sh) creates this secret automatically +# from exported environment variables (OPENAI_API_KEY, etc.). +apiVersion: v1 +kind: Secret +metadata: + name: clawbench-secrets + labels: + app: openclaw +type: Opaque +stringData: + OPENCLAW_GATEWAY_TOKEN: "REPLACE_ME" + OPENAI_API_KEY: "REPLACE_ME" + # Add other provider keys as needed: + # ANTHROPIC_API_KEY: "REPLACE_ME" + # OPENROUTER_API_KEY: "REPLACE_ME" + # GEMINI_API_KEY: "REPLACE_ME" diff --git a/scripts/k8s/openclaw/service.yaml b/scripts/k8s/openclaw/service.yaml new file mode 100644 index 0000000..41df621 --- /dev/null +++ b/scripts/k8s/openclaw/service.yaml @@ -0,0 +1,15 @@ +apiVersion: v1 +kind: Service +metadata: + name: openclaw + labels: + app: openclaw +spec: + type: ClusterIP + selector: + app: openclaw + ports: + - name: gateway + port: 18789 + targetPort: 18789 + protocol: TCP diff --git a/scripts/log_to_mlflow.py b/scripts/log_to_mlflow.py new file mode 100644 index 0000000..79a2ca0 --- /dev/null +++ b/scripts/log_to_mlflow.py @@ -0,0 +1,125 @@ +#!/usr/bin/env python3 +"""Log a ClawBench BenchmarkResult to MLflow. + +Standalone script -- not imported by the clawbench package. +Requires: pip install mlflow (or pip install clawbench[mlflow]) + +Usage: + python scripts/log_to_mlflow.py /results/benchmark.json + +Environment: + MLFLOW_TRACKING_URI MLflow tracking server (default: http://localhost:5000) + MLFLOW_EXPERIMENT_NAME Experiment name (default: clawbench) +""" + +from __future__ import annotations + +import json +import os +import sys +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) + + +def main(result_path: str) -> None: + try: + import mlflow + except ImportError: + print( + "mlflow is not installed. Install with: pip install mlflow" + " (or pip install clawbench[mlflow])", + file=sys.stderr, + ) + sys.exit(1) + + from clawbench.schemas import BenchmarkResult + + with open(result_path, encoding="utf-8") as f: + result = BenchmarkResult(**json.load(f)) + + experiment_id = os.environ.get("MLFLOW_EXPERIMENT_ID") + if experiment_id: + experiment = mlflow.set_experiment(experiment_id=experiment_id) + else: + experiment = mlflow.set_experiment(os.environ.get("MLFLOW_EXPERIMENT_NAME", "clawbench")) + + run_name = f"{result.model}-{result.submission_id[:8]}" + with mlflow.start_run(run_name=run_name): + mlflow.log_params( + { + "model": result.model, + "provider": result.provider, + "benchmark_version": result.benchmark_version, + "openclaw_version": result.openclaw_version or "unknown", + "judge_model": result.judge_model or "none", + "task_snapshot_fingerprint": result.task_snapshot_fingerprint or "unknown", + } + ) + + mlflow.log_metrics( + { + "overall_score": result.overall_score, + "overall_completion": result.overall_completion, + "overall_trajectory": result.overall_trajectory, + "overall_behavior": result.overall_behavior, + "overall_reliability": result.overall_reliability, + "overall_pass_hat_k": result.overall_pass_hat_k, + "overall_judge_score": result.overall_judge_score, + "overall_judge_confidence": result.overall_judge_confidence, + "overall_judge_pass_rate": result.overall_judge_pass_rate, + "judge_task_coverage": result.judge_task_coverage, + "overall_weighted_query_score": result.overall_weighted_query_score, + "overall_median_latency_ms": result.overall_median_latency_ms, + "overall_p95_latency_ms": result.overall_p95_latency_ms, + "overall_total_tokens": result.overall_total_tokens, + "overall_cost_usd": result.overall_cost_usd, + "overall_tokens_per_pass": result.overall_tokens_per_pass, + "overall_cost_per_pass": result.overall_cost_per_pass, + "overall_ci_lower": result.overall_ci_lower, + "overall_ci_upper": result.overall_ci_upper, + } + ) + + for tier in result.tier_results: + mlflow.log_metrics( + { + f"{tier.tier}/score": tier.mean_task_score, + f"{tier.tier}/completion": tier.mean_completion, + f"{tier.tier}/trajectory": tier.mean_trajectory, + f"{tier.tier}/behavior": tier.mean_behavior, + f"{tier.tier}/reliability": tier.mean_reliability, + } + ) + + for i, task in enumerate(result.task_results): + mlflow.log_metrics( + { + f"task/{task.task_id}/score": task.mean_task_score, + f"task/{task.task_id}/reliability": task.reliability_score, + }, + step=i, + ) + + mlflow.set_tags( + { + "submission_id": result.submission_id, + "timestamp": result.timestamp, + "certified": str(result.certified), + } + ) + + try: + mlflow.log_artifact(result_path) + except Exception as e: + print(f"Warning: artifact upload failed: {e}", file=sys.stderr) + print("Metrics and params were logged successfully.", file=sys.stderr) + + print(f"Logged to MLflow: experiment={experiment.name} run={run_name}") + + +if __name__ == "__main__": + if len(sys.argv) != 2: + print(f"Usage: {sys.argv[0]} ", file=sys.stderr) + sys.exit(1) + main(sys.argv[1])