Skip to content

feat(spawner): add Prometheus metrics export#821

Merged
gjkim42 merged 2 commits intomainfrom
feat/spawner-prometheus-metrics
Mar 28, 2026
Merged

feat(spawner): add Prometheus metrics export#821
gjkim42 merged 2 commits intomainfrom
feat/spawner-prometheus-metrics

Conversation

@gjkim42
Copy link
Copy Markdown
Collaborator

@gjkim42 gjkim42 commented Mar 28, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds Prometheus metrics export to the kelos-spawner, which previously had metrics explicitly disabled (BindAddress: "0"). This brings spawner observability in line with the controller.

Spawner changes:

  • Enables the metrics server on :8080
  • Adds 5 new metrics: kelos_spawner_discovery_total, kelos_spawner_discovery_errors_total, kelos_spawner_items_discovered_total, kelos_spawner_tasks_created_total, kelos_spawner_discovery_duration_seconds
  • Instruments runCycleWithSource to record these metrics
  • Adds a metrics container port (8080/TCP) to spawner Deployments
  • Adds a PodMonitoring resource for spawner pods

Controller fix:

  • Wires the existing --metrics-bind-address flag (previously parsed but unused) through to the controller-runtime Manager

Which issue(s) this PR is related to:

N/A

Special notes for your reviewer:

  • CronJob-based spawners (one-shot mode) do not start the manager, so metrics are not served in that mode — this is intentional since the process exits immediately.
  • The controller fix is a separate commit for easy review. The flag default (:8080) matches the deployment template, so behavior is unchanged.

Does this PR introduce a user-facing change?

Add Prometheus metrics export to kelos-spawner for discovery cycle observability

🤖 Generated with Claude Code


Summary by cubic

Adds Prometheus metrics to kelos-spawner on :8080 and instruments the discovery cycle for better visibility. Also wires the controller metrics bind address and deploys PodMonitoring in dev.

  • New Features

    • Enable kelos-spawner metrics server on :8080 and expose a metrics container port.
    • Export: kelos_spawner_discovery_total, kelos_spawner_discovery_errors_total, kelos_spawner_items_discovered_total, kelos_spawner_tasks_created_total, kelos_spawner_discovery_duration_seconds.
    • Instrument discovery for counts, errors, and duration; success counted only after status write; CronJob spawners don’t serve metrics.
    • Add GCP PodMonitoring for controller (kelos-system) and spawner in ${KELOS_NAMESPACE}; excludes CronJob pods.
  • Bug Fixes

    • Pass controller --metrics-bind-address to the controller-runtime Manager.
    • Include container Ports in Deployment drift updates so existing spawners pick up the metrics port.

Written for commit 8d741ac. Summary will update on new commits.

@github-actions github-actions bot added kind/feature Categorizes issue or PR as related to a new feature needs-triage needs-priority needs-actor release-note labels Mar 28, 2026
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 7 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="cmd/kelos-spawner/metrics_test.go">

<violation number="1" location="cmd/kelos-spawner/metrics_test.go:72">
P3: This histogram test is too weak: `CollectAndCount` does not verify that `Observe` increased histogram samples, so regressions in observation logic can go undetected.</violation>
</file>

<file name="internal/manifests/charts/kelos/templates/podmonitoring.yaml">

<violation number="1" location="internal/manifests/charts/kelos/templates/podmonitoring.yaml:2">
P2: PodMonitoring is a non-core CRD; rendering it unconditionally will cause Helm installs to fail on clusters that don’t have monitoring.googleapis.com/v1 installed (e.g., the kind cluster shown in the quick start). Gate these resources behind a values flag or add an alternative (e.g., ServiceMonitor) so installs don’t break on vanilla clusters.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.


func TestDiscoveryDurationSecondsHistogram(t *testing.T) {
discoveryDurationSeconds.Observe(1.5)
count := testutil.CollectAndCount(discoveryDurationSeconds)
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: This histogram test is too weak: CollectAndCount does not verify that Observe increased histogram samples, so regressions in observation logic can go undetected.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At cmd/kelos-spawner/metrics_test.go, line 72:

<comment>This histogram test is too weak: `CollectAndCount` does not verify that `Observe` increased histogram samples, so regressions in observation logic can go undetected.</comment>

<file context>
@@ -0,0 +1,76 @@
+
+func TestDiscoveryDurationSecondsHistogram(t *testing.T) {
+	discoveryDurationSeconds.Observe(1.5)
+	count := testutil.CollectAndCount(discoveryDurationSeconds)
+	if count == 0 {
+		t.Error("expected discoveryDurationSeconds to have collected metrics")
</file context>
Fix with Cubic

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name=".github/workflows/deploy-dev.yaml">

<violation number="1" location=".github/workflows/deploy-dev.yaml:87">
P2: The spawner PodMonitoring is pinned to `kelos-system` instead of the workflow’s `KELOS_NAMESPACE`, so spawner metrics won’t be scraped when spawners run in a different namespace.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@gjkim42 gjkim42 force-pushed the feat/spawner-prometheus-metrics branch 7 times, most recently from 9ae3010 to c9819a6 Compare March 28, 2026 13:13
gjkim42 added 2 commits March 28, 2026 22:16
Enable the metrics server in kelos-spawner and instrument the discovery
cycle with five new metrics: discovery_total, discovery_errors_total,
items_discovered_total, tasks_created_total, and
discovery_duration_seconds.

Add a metrics container port to spawner Deployments with drift detection
so existing deployments pick up the port on the next reconcile.

Also wire the controller's --metrics-bind-address flag, which was parsed
but never passed to the controller-runtime Manager.
Apply Google Cloud PodMonitoring resources for both controller and
spawner in the deploy-dev workflow, keeping the GCP-specific CRD out
of the generic Helm chart.
@gjkim42 gjkim42 force-pushed the feat/spawner-prometheus-metrics branch from c9819a6 to 8d741ac Compare March 28, 2026 13:16
@gjkim42 gjkim42 merged commit 5963cc9 into main Mar 28, 2026
9 checks passed
@gjkim42 gjkim42 deleted the feat/spawner-prometheus-metrics branch March 28, 2026 13:16
kelos-bot bot pushed a commit that referenced this pull request Mar 29, 2026
Add three new agent conventions from recent PR review feedback:

1. Per-TaskSpawner configuration should be CRD fields, not controller
   flags (PR #838 - gjkim42 review)
2. CRD API backward compatibility - never rename JSON field tags
   (PR #838 - P1 review finding)
3. Gate optional CRDs behind Helm values flags (PR #821 - PodMonitoring
   broke installs on clusters without monitoring.googleapis.com)

Also includes previously proposed conventions from PR #786:
- Consistent guidance across surfaces
- Provider-agnostic API design
- Idiomatic Helm values
- Deploy-dev workflow sync
- Controller-driven migration
- Release note user action requirements

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant