Track per-phase duration on each instance#223
Conversation
Add a small phasetracking package that records cumulative time spent in each lifecycle phase (running, standby, paused, etc.) using transition bookkeeping. Tracker state is persisted with the instance's stored metadata, so it survives process restarts without a DB migration. Instrument transitions directly in create/start/stop/standby/restore/fork rather than subscribing to the lifecycle event stream — the subscription is lossy, and these numbers will feed billing. Expose current_phase, current_phase_since, and phase_durations_ms on the Instance API so callers (notably the kernel-api billing pipeline) can compute true running time instead of wall-clock since CreatedAt.
✱ Stainless preview builds for hypemanThis PR will update the
|
Piggyback on the firecracker/QEMU standby-restore cycles and the cloud-hypervisor fork-from-running test to assert end-to-end that transition-site instrumentation is wired up: - after standby: Current == standby, Cumulative[running] > 0 - after restore: Current == running, Cumulative[standby] > 0 - after fork-from-running: fork's Cumulative[running] is zero while source's is non-zero — locks down the Phases.Reset() semantics No new tests, no added sleeps. The assertions read state at the same points where the tests already check State.
Monitoring Plan CreatedThis PR introduces a new The change is purely additive — no existing fields or logic are modified — and pre-existing instance metadata gracefully degrades (zero-value tracker starts accumulating on the first state transition after deploy). The main risks to watch are metadata write failures during state transitions (which could leave on-disk metadata stale) and any disruption to the standby/restore cycle caused by the new Status updates will be posted automatically on this PR as monitoring progresses. |
The phase tracker's Since field is persisted and exposed in the API as current_phase_since. standby/stop were initializing `now` as local time while create/start/restore use UTC, leaving downstream consumers with mixed timezone offsets in the serialized value depending on which transition last occurred. Align all transition sites on UTC. StoppedAt moves to UTC as a byproduct, which is the correct normalization anyway.
cloneStoredMetadata previously shallow-copied the Phases tracker, which aliased the Cumulative map between source and forked metadata — a subsequent Record on either side would mutate both. Add Tracker.Clone and use it from cloneStoredMetadata. Also normalise the fork transition timestamp to UTC for consistency with the other transition sites.
The recording sites previously jumped straight to PhaseRunning the moment the VMM was up, but the public State machine stays in Initializing until both ProgramStartedAt and GuestAgentReadyAt are hydrated from the guest serial log. That meant Phases.Current reported "running" while the API reported "Initializing". Make phase tracking honest: - create/start record PhaseInitializing on VM boot - restore inspects the preserved markers and records whichever phase the guest is actually in (Running in the common case) - hydrateBootMarkersFromLogs / persistBootMarkers detect the Initializing → Running boundary and Record(PhaseRunning) using the later marker timestamp, so the accrued Initializing duration matches real guest boot time rather than the wall clock when hydration ran Transient internal substates (Paused/Shutdown inside Standby/Stop) remain unrecorded — they're sub-ms blips inside non-yielding orchestration that no external observer can see.
…se-tracking # Conflicts: # lib/instances/query_test.go
They were never recorded — internal Paused/Shutdown substates happen inside non-yielding orchestration calls and are intentionally not tracked (already documented in the package doc).
After a restore from early-standby (instance standbyed before boot markers ever hydrated), Phases.Since is set at restore time. The markers parsed afterwards can carry timestamps from the pre-standby boot session, predating Since by the entire standby interval. Without the clamp, Record would silently skip the negative-elapsed accrual but still move Since backwards — and every subsequent transition would then over-count Running. Since this field feeds billing, clamp forward so Since is monotonic. Adds a regression test covering the early-standby restore path.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a29fd64. Configure here.

Summary
Adds a small
phasetrackingpackage underlib/instances/phasetrackingthat records how long each instance has spent in each lifecycle phase (running, standby, paused, stopped, etc.). Tracker state lives onStoredMetadata, so it persists with the existingmetadata.jsonper instance — no schema migration.Instrumentation is done directly at the transition sites in
create/start/stop/standby/restore/fork. We intentionally do not subscribe to the existing lifecycle event stream — that pipeline is lossy and these numbers will feed billing, so we want a direct write at the point of transition.Each
InstanceAPI response now carries:current_phasecurrent_phase_sincephase_durations_ms(cumulative ms per phase, with the live phase counted up throughnow)Why
Today the billing pipeline that consumes hypeman instances charges based on wall-clock since
CreatedAt, which double-counts intra-session standby and any other non-running time. To fix that properly we need hypeman itself to report real per-phase durations so the upstream billing calculation can compute true running time. Once this ships, the kernel-api side can switch the formula for hypeman CPU back to a platform-uptime model.Test plan
go build ./...go test ./lib/instances/phasetracking/...go test -run TestInstanceToOAPI ./cmd/api/api/...phase_durations_msmatches expectationsNote
Medium Risk
Touches instance lifecycle transition paths (
create/start/stop/standby/restore/fork) and adds new API fields used for billing, so incorrect phase recording could impact cost/analytics calculations despite being additive and well-tested.Overview
Adds persistent per-instance lifecycle phase accounting via new
lib/instances/phasetrackingtracker, recording cumulative time spent in phases at each externally observable transition (including special handling to advanceinitializing→runningbased on boot markers).Exposes this data in the Instances API as
current_phase,current_phase_since, andphase_durations_ms(snapshotting live time through response time), and updates OpenAPI/generatedoapimodels accordingly.Adjusts fork behavior to reset phase history (and deep-clone metadata to avoid shared maps), and expands unit/integration tests to validate phase accrual across create/standby/restore/fork and API emission/omission rules.
Reviewed by Cursor Bugbot for commit 3a95db6. Bugbot is set up for automated code reviews on this repo. Configure here.