Skip to content

Improve browser-run reliability and resource controls#57

Merged
joyzoursky merged 14 commits intomainfrom
resources-review
Mar 17, 2026
Merged

Improve browser-run reliability and resource controls#57
joyzoursky merged 14 commits intomainfrom
resources-review

Conversation

@joyzoursky
Copy link
Copy Markdown
Collaborator

Summary

  • Move browser run dispatching to a dedicated browser worker loop with adaptive polling/backoff, and tighten cancellation responsiveness with SLA-oriented guardrails.
  • Add runner resiliency paths (credential verification + repair API) and auto-repair stale local runner state during startup.
  • Reduce default runtime/resource footprint and add explicit Helm resource profiles (low, standard, high) plus dedicated browser-worker deployment/HPA controls.
  • Add CI/browser load-gate tooling and supporting docs for dependency lifecycle and macOS runner operations.

Changes

  • Added browser worker runtime and orchestration changes in web runtime, API routing, and Makefile local dev startup.
  • Extended runner protocol/registration/device sync behavior across control-plane, CLI, and macOS runner integration.
  • Added control-plane APIs for runner credential verification and runner repair.
  • Added Helm profile files, browser-worker deployment template, and HPA/config updates.
  • Added load-gate workflow/scripts and updated operator/maintainer docs.

Validation

  • Lint + type-check + dependency audit: npm run verify (pass)

Breaking Changes

  • Browser-run dispatch now depends on the browser worker process/deployment (SKYTEST_BROWSER_WORKER=true locally, browser-worker deployment in Helm).

Risks

  • This is a cross-cutting change touching runtime scheduling, runner registration/state sync, and deployment defaults; regressions may appear under high concurrency.
  • Resource profile shifts may require environment-specific tuning in clusters with strict quotas.

Follow-ups

  • Run staging load-gate scenarios to validate cancellation SLA and dispatch throughput.
  • Monitor runner repair frequency and claim/lease behavior after rollout.

- make runner transport cadence and claim retry behavior configurable\n- add per-pod local browser concurrency guard with dispatch locking\n- reduce runner API rate-limit DB writes via in-memory mode option\n- tune default polling/reaper intervals and document new env knobs\n- add memory-aware HPA metric support
- add bounded env validation for resource-related runtime knobs\n- implement adaptive backoff for SSE polling and local run status polling\n- lower default device-sync cadence risk by aligning defaults\n- optimize runner device sync writes to avoid per-device unchanged upserts\n- document new polling interval controls in Helm docs and env example
- add k6-based runner claim load gate workflow with DB tx/s and RSS checks\n- add seed and k6 scripts for repeatable claim-load benchmarking\n- add maintainer dependency lifecycle policy and docs index link
- add profile override files for low, standard, and high deployment sizes
- document profile-specific install commands
- include runtime env tuning guidance for each profile
- switch control-plane defaults to a low-cost 1-replica baseline
- reduce default polling and concurrency to lower CPU/memory pressure
- make memory-backed rate limits the default to reduce DB write load
- add a browser-run gate script that dispatches real local browser executions
- add a second load-gate workflow job with p95 latency, RSS, and OOM thresholds
- wire a workspace npm script for browser gate execution
- remove direct browser dispatch from API, MCP, cancellation, and lease-reaper paths
- add browser-runner worker loop and worker-mode guard for dispatcher
- keep CI browser load gate aligned with worker mode and clean related tests
- add browserWorker values and deployment template
- disable browser worker mode in control-plane pods explicitly
- update sizing profiles and docs for split control-plane/browser-worker topology
- add integration-style local browser runner cancel SLA assertion\n- align device sync transaction test mocks with batch sync implementation
- add active-run non-abort SLA coverage for local browser runner reconciliation\n- add device sync no-op skip-path coverage for unchanged recent devices\n- make max cancellation poll interval configurable and document env knob
- set control-plane default and low profile to 50m/96Mi requests and 250m/256Mi limits\n- keep browser worker at 500m/1Gi requests with 2 CPU / 2Gi limits
@joyzoursky joyzoursky changed the title runner: improve browser-run reliability and resource controls Improve browser-run reliability and resource controls Mar 17, 2026
@joyzoursky joyzoursky merged commit 2274b3d into main Mar 17, 2026
1 check passed
@joyzoursky joyzoursky deleted the resources-review branch March 17, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant