feat(spawner): add Prometheus metrics export#821
Merged
Conversation
There was a problem hiding this comment.
2 issues found across 7 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="cmd/kelos-spawner/metrics_test.go">
<violation number="1" location="cmd/kelos-spawner/metrics_test.go:72">
P3: This histogram test is too weak: `CollectAndCount` does not verify that `Observe` increased histogram samples, so regressions in observation logic can go undetected.</violation>
</file>
<file name="internal/manifests/charts/kelos/templates/podmonitoring.yaml">
<violation number="1" location="internal/manifests/charts/kelos/templates/podmonitoring.yaml:2">
P2: PodMonitoring is a non-core CRD; rendering it unconditionally will cause Helm installs to fail on clusters that don’t have monitoring.googleapis.com/v1 installed (e.g., the kind cluster shown in the quick start). Gate these resources behind a values flag or add an alternative (e.g., ServiceMonitor) so installs don’t break on vanilla clusters.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
cmd/kelos-spawner/metrics_test.go
Outdated
|
|
||
| func TestDiscoveryDurationSecondsHistogram(t *testing.T) { | ||
| discoveryDurationSeconds.Observe(1.5) | ||
| count := testutil.CollectAndCount(discoveryDurationSeconds) |
There was a problem hiding this comment.
P3: This histogram test is too weak: CollectAndCount does not verify that Observe increased histogram samples, so regressions in observation logic can go undetected.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At cmd/kelos-spawner/metrics_test.go, line 72:
<comment>This histogram test is too weak: `CollectAndCount` does not verify that `Observe` increased histogram samples, so regressions in observation logic can go undetected.</comment>
<file context>
@@ -0,0 +1,76 @@
+
+func TestDiscoveryDurationSecondsHistogram(t *testing.T) {
+ discoveryDurationSeconds.Observe(1.5)
+ count := testutil.CollectAndCount(discoveryDurationSeconds)
+ if count == 0 {
+ t.Error("expected discoveryDurationSeconds to have collected metrics")
</file context>
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name=".github/workflows/deploy-dev.yaml">
<violation number="1" location=".github/workflows/deploy-dev.yaml:87">
P2: The spawner PodMonitoring is pinned to `kelos-system` instead of the workflow’s `KELOS_NAMESPACE`, so spawner metrics won’t be scraped when spawners run in a different namespace.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
9ae3010 to
c9819a6
Compare
Enable the metrics server in kelos-spawner and instrument the discovery cycle with five new metrics: discovery_total, discovery_errors_total, items_discovered_total, tasks_created_total, and discovery_duration_seconds. Add a metrics container port to spawner Deployments with drift detection so existing deployments pick up the port on the next reconcile. Also wire the controller's --metrics-bind-address flag, which was parsed but never passed to the controller-runtime Manager.
Apply Google Cloud PodMonitoring resources for both controller and spawner in the deploy-dev workflow, keeping the GCP-specific CRD out of the generic Helm chart.
c9819a6 to
8d741ac
Compare
This was referenced Mar 28, 2026
kelos-bot bot
pushed a commit
that referenced
this pull request
Mar 29, 2026
Add three new agent conventions from recent PR review feedback: 1. Per-TaskSpawner configuration should be CRD fields, not controller flags (PR #838 - gjkim42 review) 2. CRD API backward compatibility - never rename JSON field tags (PR #838 - P1 review finding) 3. Gate optional CRDs behind Helm values flags (PR #821 - PodMonitoring broke installs on clusters without monitoring.googleapis.com) Also includes previously proposed conventions from PR #786: - Consistent guidance across surfaces - Provider-agnostic API design - Idiomatic Helm values - Deploy-dev workflow sync - Controller-driven migration - Release note user action requirements Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds Prometheus metrics export to the kelos-spawner, which previously had metrics explicitly disabled (
BindAddress: "0"). This brings spawner observability in line with the controller.Spawner changes:
:8080kelos_spawner_discovery_total,kelos_spawner_discovery_errors_total,kelos_spawner_items_discovered_total,kelos_spawner_tasks_created_total,kelos_spawner_discovery_duration_secondsrunCycleWithSourceto record these metricsmetricscontainer port (8080/TCP) to spawner DeploymentsController fix:
--metrics-bind-addressflag (previously parsed but unused) through to the controller-runtime ManagerWhich issue(s) this PR is related to:
N/A
Special notes for your reviewer:
:8080) matches the deployment template, so behavior is unchanged.Does this PR introduce a user-facing change?
🤖 Generated with Claude Code
Summary by cubic
Adds Prometheus metrics to
kelos-spawneron :8080 and instruments the discovery cycle for better visibility. Also wires the controller metrics bind address and deploys PodMonitoring in dev.New Features
kelos-spawnermetrics server on :8080 and expose ametricscontainer port.kelos_spawner_discovery_total,kelos_spawner_discovery_errors_total,kelos_spawner_items_discovered_total,kelos_spawner_tasks_created_total,kelos_spawner_discovery_duration_seconds.kelos-system) and spawner in${KELOS_NAMESPACE}; excludes CronJob pods.Bug Fixes
--metrics-bind-addressto the controller-runtime Manager.Portsin Deployment drift updates so existing spawners pick up the metrics port.Written for commit 8d741ac. Summary will update on new commits.