feat(spawner): add Prometheus metrics export by gjkim42 · Pull Request #821 · kelos-dev/kelos

gjkim42 · 2026-03-28T07:31:46Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds Prometheus metrics export to the kelos-spawner, which previously had metrics explicitly disabled (BindAddress: "0"). This brings spawner observability in line with the controller.

Spawner changes:

Enables the metrics server on :8080
Adds 5 new metrics: kelos_spawner_discovery_total, kelos_spawner_discovery_errors_total, kelos_spawner_items_discovered_total, kelos_spawner_tasks_created_total, kelos_spawner_discovery_duration_seconds
Instruments runCycleWithSource to record these metrics
Adds a metrics container port (8080/TCP) to spawner Deployments
Adds a PodMonitoring resource for spawner pods

Controller fix:

Wires the existing --metrics-bind-address flag (previously parsed but unused) through to the controller-runtime Manager

Which issue(s) this PR is related to:

N/A

Special notes for your reviewer:

CronJob-based spawners (one-shot mode) do not start the manager, so metrics are not served in that mode — this is intentional since the process exits immediately.
The controller fix is a separate commit for easy review. The flag default (:8080) matches the deployment template, so behavior is unchanged.

Does this PR introduce a user-facing change?

Add Prometheus metrics export to kelos-spawner for discovery cycle observability

🤖 Generated with Claude Code

Summary by cubic

Adds Prometheus metrics to kelos-spawner on :8080 and instruments the discovery cycle for better visibility. Also wires the controller metrics bind address and deploys PodMonitoring in dev.

New Features
- Enable kelos-spawner metrics server on :8080 and expose a metrics container port.
- Export: kelos_spawner_discovery_total, kelos_spawner_discovery_errors_total, kelos_spawner_items_discovered_total, kelos_spawner_tasks_created_total, kelos_spawner_discovery_duration_seconds.
- Instrument discovery for counts, errors, and duration; success counted only after status write; CronJob spawners don’t serve metrics.
- Add GCP PodMonitoring for controller (kelos-system) and spawner in ${KELOS_NAMESPACE}; excludes CronJob pods.
Bug Fixes
- Pass controller --metrics-bind-address to the controller-runtime Manager.
- Include container Ports in Deployment drift updates so existing spawners pick up the metrics port.

^{Written for commit 8d741ac. Summary will update on new commits.}

cubic-dev-ai

2 issues found across 7 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="cmd/kelos-spawner/metrics_test.go">

<violation number="1" location="cmd/kelos-spawner/metrics_test.go:72">
P3: This histogram test is too weak: `CollectAndCount` does not verify that `Observe` increased histogram samples, so regressions in observation logic can go undetected.</violation>
</file>

<file name="internal/manifests/charts/kelos/templates/podmonitoring.yaml">

<violation number="1" location="internal/manifests/charts/kelos/templates/podmonitoring.yaml:2">
P2: PodMonitoring is a non-core CRD; rendering it unconditionally will cause Helm installs to fail on clusters that don’t have monitoring.googleapis.com/v1 installed (e.g., the kind cluster shown in the quick start). Gate these resources behind a values flag or add an alternative (e.g., ServiceMonitor) so installs don’t break on vanilla clusters.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

internal/manifests/charts/kelos/templates/podmonitoring.yaml

cubic-dev-ai · 2026-03-28T07:37:18Z

cmd/kelos-spawner/metrics_test.go

+
+func TestDiscoveryDurationSecondsHistogram(t *testing.T) {
+	discoveryDurationSeconds.Observe(1.5)
+	count := testutil.CollectAndCount(discoveryDurationSeconds)


P3: This histogram test is too weak: CollectAndCount does not verify that Observe increased histogram samples, so regressions in observation logic can go undetected.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At cmd/kelos-spawner/metrics_test.go, line 72: <comment>This histogram test is too weak: `CollectAndCount` does not verify that `Observe` increased histogram samples, so regressions in observation logic can go undetected.</comment> <file context> @@ -0,0 +1,76 @@ + +func TestDiscoveryDurationSecondsHistogram(t *testing.T) { + discoveryDurationSeconds.Observe(1.5) + count := testutil.CollectAndCount(discoveryDurationSeconds) + if count == 0 { + t.Error("expected discoveryDurationSeconds to have collected metrics") </file context>

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name=".github/workflows/deploy-dev.yaml">

<violation number="1" location=".github/workflows/deploy-dev.yaml:87">
P2: The spawner PodMonitoring is pinned to `kelos-system` instead of the workflow’s `KELOS_NAMESPACE`, so spawner metrics won’t be scraped when spawners run in a different namespace.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

.github/workflows/deploy-dev.yaml

Enable the metrics server in kelos-spawner and instrument the discovery cycle with five new metrics: discovery_total, discovery_errors_total, items_discovered_total, tasks_created_total, and discovery_duration_seconds. Add a metrics container port to spawner Deployments with drift detection so existing deployments pick up the port on the next reconcile. Also wire the controller's --metrics-bind-address flag, which was parsed but never passed to the controller-runtime Manager.

Apply Google Cloud PodMonitoring resources for both controller and spawner in the deploy-dev workflow, keeping the GCP-specific CRD out of the generic Helm chart.

Add three new agent conventions from recent PR review feedback: 1. Per-TaskSpawner configuration should be CRD fields, not controller flags (PR #838 - gjkim42 review) 2. CRD API backward compatibility - never rename JSON field tags (PR #838 - P1 review finding) 3. Gate optional CRDs behind Helm values flags (PR #821 - PodMonitoring broke installs on clusters without monitoring.googleapis.com) Also includes previously proposed conventions from PR #786: - Consistent guidance across surfaces - Provider-agnostic API design - Idiomatic Helm values - Deploy-dev workflow sync - Controller-driven migration - Release note user action requirements Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions bot added kind/feature Categorizes issue or PR as related to a new feature needs-triage needs-priority needs-actor release-note labels Mar 28, 2026

cubic-dev-ai bot reviewed Mar 28, 2026

View reviewed changes

.github/workflows/deploy-dev.yaml Outdated Show resolved Hide resolved

gjkim42 force-pushed the feat/spawner-prometheus-metrics branch 7 times, most recently from 9ae3010 to c9819a6 Compare March 28, 2026 13:13

gjkim42 added 2 commits March 28, 2026 22:16

feat(ci): deploy PodMonitoring for metrics scraping in dev

8d741ac

Apply Google Cloud PodMonitoring resources for both controller and spawner in the deploy-dev workflow, keeping the GCP-specific CRD out of the generic Helm chart.

gjkim42 force-pushed the feat/spawner-prometheus-metrics branch from c9819a6 to 8d741ac Compare March 28, 2026 13:16

gjkim42 merged commit 5963cc9 into main Mar 28, 2026
9 checks passed

gjkim42 deleted the feat/spawner-prometheus-metrics branch March 28, 2026 13:16

gjkim42 added priority/important-longterm triage-accepted labels Mar 28, 2026

github-actions bot removed needs-triage needs-priority labels Mar 28, 2026

This was referenced Mar 28, 2026

Replace app.kubernetes.io/* labels with kelos.dev/* labels #826

Merged

Add CRD design, API compatibility, and Helm gating conventions #786

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spawner): add Prometheus metrics export#821

feat(spawner): add Prometheus metrics export#821
gjkim42 merged 2 commits intomainfrom
feat/spawner-prometheus-metrics

gjkim42 commented Mar 28, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

cubic-dev-ai bot Mar 28, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gjkim42 commented Mar 28, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gjkim42 commented Mar 28, 2026 •

edited by cubic-dev-ai bot

Loading

cubic-dev-ai bot Mar 28, 2026 •

edited

Loading