Add running latency metric and compression labels#168
Merged
sjmiller609 merged 2 commits intoMar 25, 2026
Conversation
hiroTamada
pushed a commit
that referenced
this pull request
Mar 25, 2026
hiroTamada
added a commit
that referenced
this pull request
Mar 27, 2026
* Add waitForState endpoint for blocking state transitions
Adds GET /instances/{id}/wait that blocks until an instance reaches
a target state, the timeout expires, or a terminal/error state is
detected. This avoids client-side polling when waiting for state
changes like waiting for an instance to become Running.
- Polls internally every 1s using existing GetInstance state derivation
- Default timeout 60s, max 5m, configurable via query parameter
- Returns immediately on Stopped (terminal) or Unknown (error) states
- Handles client disconnects via context cancellation
- Returns current state + timed_out flag in response
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add unit tests for waitForState endpoint
Tests cover: already-in-target-state, terminal state early return,
invalid/negative timeout validation, missing resolved instance (500),
instance deleted during wait (404), context cancellation, timeout
capping, and default timeout behavior.
All tests use fake middleware contexts to avoid Docker Hub / KVM
dependencies while still exercising the handler logic.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add waitForState endpoint to RouteScopes
The GET /instances/{id}/wait endpoint was missing from the scope
mapping, causing TestAllRoutesHaveScopes to fail in CI.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Move waitForState polling logic into lib/instances/wait.go
Extracts the poll loop, terminal/error state detection, and timeout
handling from the HTTP handler into a reusable WaitForState function.
The handler now only parses request params and maps the result to
HTTP responses.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Address bugbot review: terminal state checks and missing StateError
- Add Standby as terminal state for early return (alongside Stopped)
- Add initial terminal/error state check before entering poll loop
- Include StateError in context cancellation return path
- Extract isTerminalForWait helper for consistent terminal detection
- Add StandbyTerminalEarlyReturn test
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add running latency metric and compression labels (#168)
* Fix timeout case to check if target state was reached
When the timeout timer fires, after fetching the latest state, check
whether it actually matches the target before reporting TimedOut: true.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Propagate ErrNotFound in timer path for consistency with ticker
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Refactor waitForState to channel-based subscription with polling fallback
Add a subscription mechanism (subscribe.go) that lets WaitForState receive
state changes via channels instead of relying solely on polling. Manager
methods (create, start, stop, standby, restore, fork, delete) now broadcast
state changes to subscribers via notifyStateChange. WaitForState uses a
3-way select: subscription channel, 5s polling fallback (with warning log
if it detects a missed event), and context cancellation/timeout.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix gofmt formatting in wait_test.go
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix ForkInstance source notify and panic safety in Notify
Remove incorrect notifyStateChange(id, forked) in ForkInstance that
passed the fork result as the source instance state. Add trySend helper
with deferred recover to prevent panic if CloseAll closes a channel
between Notify's copy and send.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Downgrade polling fallback log from warn to debug
Derived state transitions (e.g. Initializing→Running via boot markers)
are not broadcast via subscription since they happen inside GetInstance.
The polling fallback detecting these is expected, not a missed event.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix race window in WaitForState and remove no-op notifications
Move Subscribe() call before initial state checks in WaitForState to
close the TOCTOU race between checking state and subscribing.
Remove notifyStateChange calls from CreateInstance, ForkInstance, and
ForkSnapshot — newly created instance IDs have no subscribers, so these
notifications were no-ops.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Rename wait_for_state to wait in stainless config
Follow the existing convention where instance method names match the
URL suffix (standby, restore, fork, start, stop, logs, stat, stats).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add Paused to terminal state check in WaitForState
Paused requires explicit user action (resume/shutdown/standby) to
progress, so WaitForState should return early instead of polling
until timeout.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add WaitForState integration test and Paused terminal state
Add TestWaitForState_Integration: a real integration test that creates
an nginx instance and uses WaitForState (subscription + polling fallback)
to wait for it to reach Running, proving the mechanism works end-to-end.
Also add Paused to isTerminalForWait since it requires explicit user
action to progress, with a unit test.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add Shutdown to terminal state check in WaitForState
Shutdown requires explicit user action (StopInstance or restart) to progress,
so WaitForState should return early instead of polling until timeout.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add waitForState endpoint to stainless SDK config
Map the GET /instances/{id}/wait endpoint as instances.wait_for_state
and add WaitForStateResponse as a model in the instances resource.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Steven Miller <sjmiller609@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
hypeman_instances_time_to_running_secondsfor boot-to-running latencyalgorithmandlevelTesting
go test ./lib/instances -run 'TestSnapshotCompressionMetrics_RecordAndObserve|TestInstanceOldestInStateMetric_ObserveOldestAgePerState|TestInstanceTimeToRunningMetric_RecordWhenBootMarkersPersisted|TestLifecycleDurationMetrics_RecordCompressionLabels'Note
Low Risk
Low risk instrumentation change: adds new metrics and labels and a small amount of extra metadata/filesystem inspection during standby/restore, without changing core lifecycle behavior.
Overview
Adds a new instance latency histogram,
hypeman_instances_time_to_running_seconds, recorded when boot markers are persisted and the VM has reachedRunning.Extends standby/restore duration histograms to include snapshot compression metadata (
algorithm,level) via a sharedrecordDurationWithCompressionhelper; restore derives these labels from existing snapshot artifacts and policy resolution. Updates OTel docs and adds focused metrics tests for the new histogram and label behavior.Written by Cursor Bugbot for commit e5a02c8. This will update automatically on new commits. Configure here.