Skip to content

Add running latency metric and compression labels#168

Merged
sjmiller609 merged 2 commits into
mainfrom
codex/add-time-to-running-and-compression-labels
Mar 25, 2026
Merged

Add running latency metric and compression labels#168
sjmiller609 merged 2 commits into
mainfrom
codex/add-time-to-running-and-compression-labels

Conversation

@sjmiller609

@sjmiller609 sjmiller609 commented Mar 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add hypeman_instances_time_to_running_seconds for boot-to-running latency
  • label standby and restore duration histograms with compression algorithm and level
  • cover the new metric/label behavior with focused instance metrics tests

Testing

  • go test ./lib/instances -run 'TestSnapshotCompressionMetrics_RecordAndObserve|TestInstanceOldestInStateMetric_ObserveOldestAgePerState|TestInstanceTimeToRunningMetric_RecordWhenBootMarkersPersisted|TestLifecycleDurationMetrics_RecordCompressionLabels'

Note

Low Risk
Low risk instrumentation change: adds new metrics and labels and a small amount of extra metadata/filesystem inspection during standby/restore, without changing core lifecycle behavior.

Overview
Adds a new instance latency histogram, hypeman_instances_time_to_running_seconds, recorded when boot markers are persisted and the VM has reached Running.

Extends standby/restore duration histograms to include snapshot compression metadata (algorithm, level) via a shared recordDurationWithCompression helper; restore derives these labels from existing snapshot artifacts and policy resolution. Updates OTel docs and adds focused metrics tests for the new histogram and label behavior.

Written by Cursor Bugbot for commit e5a02c8. This will update automatically on new commits. Configure here.

@sjmiller609 sjmiller609 marked this pull request as ready for review March 25, 2026 17:42
@sjmiller609 sjmiller609 merged commit 66259ff into main Mar 25, 2026
6 checks passed
@sjmiller609 sjmiller609 deleted the codex/add-time-to-running-and-compression-labels branch March 25, 2026 18:19
hiroTamada added a commit that referenced this pull request Mar 27, 2026
* Add waitForState endpoint for blocking state transitions

Adds GET /instances/{id}/wait that blocks until an instance reaches
a target state, the timeout expires, or a terminal/error state is
detected. This avoids client-side polling when waiting for state
changes like waiting for an instance to become Running.

- Polls internally every 1s using existing GetInstance state derivation
- Default timeout 60s, max 5m, configurable via query parameter
- Returns immediately on Stopped (terminal) or Unknown (error) states
- Handles client disconnects via context cancellation
- Returns current state + timed_out flag in response

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add unit tests for waitForState endpoint

Tests cover: already-in-target-state, terminal state early return,
invalid/negative timeout validation, missing resolved instance (500),
instance deleted during wait (404), context cancellation, timeout
capping, and default timeout behavior.

All tests use fake middleware contexts to avoid Docker Hub / KVM
dependencies while still exercising the handler logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add waitForState endpoint to RouteScopes

The GET /instances/{id}/wait endpoint was missing from the scope
mapping, causing TestAllRoutesHaveScopes to fail in CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Move waitForState polling logic into lib/instances/wait.go

Extracts the poll loop, terminal/error state detection, and timeout
handling from the HTTP handler into a reusable WaitForState function.
The handler now only parses request params and maps the result to
HTTP responses.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Address bugbot review: terminal state checks and missing StateError

- Add Standby as terminal state for early return (alongside Stopped)
- Add initial terminal/error state check before entering poll loop
- Include StateError in context cancellation return path
- Extract isTerminalForWait helper for consistent terminal detection
- Add StandbyTerminalEarlyReturn test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add running latency metric and compression labels (#168)

* Fix timeout case to check if target state was reached

When the timeout timer fires, after fetching the latest state, check
whether it actually matches the target before reporting TimedOut: true.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Propagate ErrNotFound in timer path for consistency with ticker

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Refactor waitForState to channel-based subscription with polling fallback

Add a subscription mechanism (subscribe.go) that lets WaitForState receive
state changes via channels instead of relying solely on polling. Manager
methods (create, start, stop, standby, restore, fork, delete) now broadcast
state changes to subscribers via notifyStateChange. WaitForState uses a
3-way select: subscription channel, 5s polling fallback (with warning log
if it detects a missed event), and context cancellation/timeout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix gofmt formatting in wait_test.go

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix ForkInstance source notify and panic safety in Notify

Remove incorrect notifyStateChange(id, forked) in ForkInstance that
passed the fork result as the source instance state. Add trySend helper
with deferred recover to prevent panic if CloseAll closes a channel
between Notify's copy and send.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Downgrade polling fallback log from warn to debug

Derived state transitions (e.g. Initializing→Running via boot markers)
are not broadcast via subscription since they happen inside GetInstance.
The polling fallback detecting these is expected, not a missed event.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix race window in WaitForState and remove no-op notifications

Move Subscribe() call before initial state checks in WaitForState to
close the TOCTOU race between checking state and subscribing.

Remove notifyStateChange calls from CreateInstance, ForkInstance, and
ForkSnapshot — newly created instance IDs have no subscribers, so these
notifications were no-ops.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Rename wait_for_state to wait in stainless config

Follow the existing convention where instance method names match the
URL suffix (standby, restore, fork, start, stop, logs, stat, stats).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add Paused to terminal state check in WaitForState

Paused requires explicit user action (resume/shutdown/standby) to
progress, so WaitForState should return early instead of polling
until timeout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add WaitForState integration test and Paused terminal state

Add TestWaitForState_Integration: a real integration test that creates
an nginx instance and uses WaitForState (subscription + polling fallback)
to wait for it to reach Running, proving the mechanism works end-to-end.

Also add Paused to isTerminalForWait since it requires explicit user
action to progress, with a unit test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add Shutdown to terminal state check in WaitForState

Shutdown requires explicit user action (StopInstance or restart) to progress,
so WaitForState should return early instead of polling until timeout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add waitForState endpoint to stainless SDK config

Map the GET /instances/{id}/wait endpoint as instances.wait_for_state
and add WaitForStateResponse as a model in the instances resource.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Steven Miller <sjmiller609@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant