Skip to content

Add resource capacity monitoring metrics#162

Merged
sjmiller609 merged 6 commits intomainfrom
codex/resource-capacity-metrics
Mar 24, 2026
Merged

Add resource capacity monitoring metrics#162
sjmiller609 merged 6 commits intomainfrom
codex/resource-capacity-metrics

Conversation

@sjmiller609
Copy link
Collaborator

@sjmiller609 sjmiller609 commented Mar 24, 2026

Summary

  • add a configurable metrics refresh interval for cached resource capacity monitoring
  • add cached OTel resource capacity gauges for core resources, disk/image storage, and GPU slots
  • start the resource monitoring refresh loop during API startup while keeping /resources live

Testing

  • go test ./cmd/api/config ./lib/resources

Notes

  • A broader go test ./cmd/api/... run is currently blocked in this worktree by missing embedded binaries (vz-shim/vz-shim and guest_agent/guest-agent).

Note

Medium Risk
Adds a new background monitoring loop at API startup that periodically snapshots host resource status for OTel metrics; incorrect intervals or snapshot failures could impact observability and introduce extra load, but core request paths are mostly unchanged.

Overview
Adds cached OpenTelemetry resource-capacity metrics driven by a new ResourceManager.StartMonitoring refresh loop that periodically snapshots GetFullStatus and publishes observable gauges for capacity/limits/allocations, disk/image storage breakdown, and GPU slot availability.

Introduces a new metrics.resource_refresh_interval config (default 120s) with validation and examples, wires the refresh loop into API startup, and updates/extends tests (new lib/resources/monitoring_test.go, plus small stability tweaks in integration/registry tests around async waits and timeouts).

Written by Cursor Bugbot for commit 45bf9f4. This will update automatically on new commits. Configure here.

@sjmiller609 sjmiller609 marked this pull request as ready for review March 24, 2026 15:28
@sjmiller609 sjmiller609 requested a review from hiroTamada March 24, 2026 16:56
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Copy link
Contributor

@hiroTamada hiroTamada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solid design — snapshot-based caching with a single OTel callback is clean. one minor comment on goroutine durability.

@sjmiller609 sjmiller609 merged commit dce3318 into main Mar 24, 2026
6 checks passed
@sjmiller609 sjmiller609 deleted the codex/resource-capacity-metrics branch March 24, 2026 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants