Improve browser-run reliability and resource controls by joyzoursky · Pull Request #57 · oursky/skytest-agent

joyzoursky · 2026-03-17T11:23:29Z

Summary

Move browser run dispatching to a dedicated browser worker loop with adaptive polling/backoff, and tighten cancellation responsiveness with SLA-oriented guardrails.
Add runner resiliency paths (credential verification + repair API) and auto-repair stale local runner state during startup.
Reduce default runtime/resource footprint and add explicit Helm resource profiles (low, standard, high) plus dedicated browser-worker deployment/HPA controls.
Add CI/browser load-gate tooling and supporting docs for dependency lifecycle and macOS runner operations.

Changes

Added browser worker runtime and orchestration changes in web runtime, API routing, and Makefile local dev startup.
Extended runner protocol/registration/device sync behavior across control-plane, CLI, and macOS runner integration.
Added control-plane APIs for runner credential verification and runner repair.
Added Helm profile files, browser-worker deployment template, and HPA/config updates.
Added load-gate workflow/scripts and updated operator/maintainer docs.

Validation

Lint + type-check + dependency audit: npm run verify (pass)

Breaking Changes

Browser-run dispatch now depends on the browser worker process/deployment (SKYTEST_BROWSER_WORKER=true locally, browser-worker deployment in Helm).

Risks

This is a cross-cutting change touching runtime scheduling, runner registration/state sync, and deployment defaults; regressions may appear under high concurrency.
Resource profile shifts may require environment-specific tuning in clusters with strict quotas.

Follow-ups

Run staging load-gate scenarios to validate cancellation SLA and dispatch throughput.
Monitor runner repair frequency and claim/lease behavior after rollout.

- make runner transport cadence and claim retry behavior configurable\n- add per-pod local browser concurrency guard with dispatch locking\n- reduce runner API rate-limit DB writes via in-memory mode option\n- tune default polling/reaper intervals and document new env knobs\n- add memory-aware HPA metric support

- add bounded env validation for resource-related runtime knobs\n- implement adaptive backoff for SSE polling and local run status polling\n- lower default device-sync cadence risk by aligning defaults\n- optimize runner device sync writes to avoid per-device unchanged upserts\n- document new polling interval controls in Helm docs and env example

- add k6-based runner claim load gate workflow with DB tx/s and RSS checks\n- add seed and k6 scripts for repeatable claim-load benchmarking\n- add maintainer dependency lifecycle policy and docs index link

- add profile override files for low, standard, and high deployment sizes - document profile-specific install commands - include runtime env tuning guidance for each profile

- switch control-plane defaults to a low-cost 1-replica baseline - reduce default polling and concurrency to lower CPU/memory pressure - make memory-backed rate limits the default to reduce DB write load

- add a browser-run gate script that dispatches real local browser executions - add a second load-gate workflow job with p95 latency, RSS, and OOM thresholds - wire a workspace npm script for browser gate execution

- remove direct browser dispatch from API, MCP, cancellation, and lease-reaper paths - add browser-runner worker loop and worker-mode guard for dispatcher - keep CI browser load gate aligned with worker mode and clean related tests

- add browserWorker values and deployment template - disable browser worker mode in control-plane pods explicitly - update sizing profiles and docs for split control-plane/browser-worker topology

- add integration-style local browser runner cancel SLA assertion\n- align device sync transaction test mocks with batch sync implementation

- add active-run non-abort SLA coverage for local browser runner reconciliation\n- add device sync no-op skip-path coverage for unchanged recent devices\n- make max cancellation poll interval configurable and document env knob

- set control-plane default and low profile to 50m/96Mi requests and 250m/256Mi limits\n- keep browser worker at 500m/1Gi requests with 2 CPU / 2Gi limits

joyzoursky added 14 commits March 17, 2026 15:38

Add load-gate CI and dependency lifecycle policy

f948ca9

- add k6-based runner claim load gate workflow with DB tx/s and RSS checks\n- add seed and k6 scripts for repeatable claim-load benchmarking\n- add maintainer dependency lifecycle policy and docs index link

Add Helm low/standard/high resource profiles

52a1241

- add profile override files for low, standard, and high deployment sizes - document profile-specific install commands - include runtime env tuning guidance for each profile

Lower default runtime and Helm resource footprint

8da2feb

- switch control-plane defaults to a low-cost 1-replica baseline - reduce default polling and concurrency to lower CPU/memory pressure - make memory-backed rate limits the default to reduce DB write load

Add browser execution CI load gate

868896f

- add a browser-run gate script that dispatches real local browser executions - add a second load-gate workflow job with p95 latency, RSS, and OOM thresholds - wire a workspace npm script for browser gate execution

Route browser runs through dedicated worker process

ed32de8

- remove direct browser dispatch from API, MCP, cancellation, and lease-reaper paths - add browser-runner worker loop and worker-mode guard for dispatcher - keep CI browser load gate aligned with worker mode and clean related tests

Add dedicated browser-worker deployment to Helm chart

b9d46e4

- add browserWorker values and deployment template - disable browser worker mode in control-plane pods explicitly - update sizing profiles and docs for split control-plane/browser-worker topology

Tighten cancellation responsiveness and autoscaling controls

512bfb2

Add cancellation SLA integration test

338ab08

- add integration-style local browser runner cancel SLA assertion\n- align device sync transaction test mocks with batch sync implementation

Add cancellation guardrail knob and test coverage

050dd3e

- add active-run non-abort SLA coverage for local browser runner reconciliation\n- add device sync no-op skip-path coverage for unchanged recent devices\n- make max cancellation poll interval configurable and document env knob

Lower Helm baseline for control-plane resources

25f8a6a

- set control-plane default and low profile to 50m/96Mi requests and 250m/256Mi limits\n- keep browser worker at 500m/1Gi requests with 2 CPU / 2Gi limits

Start browser worker in make dev

5e35759

feat(runner): auto-repair startup and sync stale local state

5553c94

joyzoursky changed the title ~~runner: improve browser-run reliability and resource controls~~ Improve browser-run reliability and resource controls Mar 17, 2026

joyzoursky merged commit 2274b3d into main Mar 17, 2026
1 check passed

joyzoursky deleted the resources-review branch March 17, 2026 11:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve browser-run reliability and resource controls#57

Improve browser-run reliability and resource controls#57
joyzoursky merged 14 commits intomainfrom
resources-review

joyzoursky commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joyzoursky commented Mar 17, 2026

Summary

Changes

Validation

Breaking Changes

Risks

Follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant