Skip to content

HYPERFLEET-856 - feat: add deletion observability metrics and alerts#115

Open
rafabene wants to merge 2 commits intoopenshift-hyperfleet:mainfrom
rafabene:HYPERFLEET-856-add-deletion-observability-metrics
Open

HYPERFLEET-856 - feat: add deletion observability metrics and alerts#115
rafabene wants to merge 2 commits intoopenshift-hyperfleet:mainfrom
rafabene:HYPERFLEET-856-add-deletion-observability-metrics

Conversation

@rafabene
Copy link
Copy Markdown
Contributor

@rafabene rafabene commented Apr 28, 2026

Summary

  • Add three API-side Prometheus metrics for deletion observability:
    • hyperfleet_api_resource_pending_deletion_total (counter) — tracks resources entering Pending Deletion state
    • hyperfleet_api_resource_pending_deletion_duration_seconds (histogram) — measures soft-delete to hard-delete duration (populated when hard-delete flow lands)
    • hyperfleet_api_resource_pending_deletion_stuck (gauge via collector) — queries DB on each scrape for resources stuck beyond configurable threshold
  • Add PrometheusRule template with two alerts:
    • HyperFleetResourceDeletionStuck (warning) — resources stuck >30min for 5m
    • HyperFleetResourceDeletionTimeout (critical) — resources stuck >30min for 30m
  • Add --metrics-deletion-stuck-threshold config flag (default 30m)
  • Instrument SoftDelete in cluster and nodepool services with RecordPendingDeletion()
  • Add unit tests for all metrics (counter, histogram, collector, labels, buckets, reset)
  • Add integration test for PendingDeletionCollector against real PostgreSQL via testcontainers
  • Document metrics, alerts, and PromQL queries in docs/metrics.md

Test plan

  • make test — 796 unit tests passing
  • make lint — 0 issues
  • make test-helm — all chart templates OK (including new PrometheusRule)
  • make test-integration — 93 integration tests passing (2 new for PendingDeletionCollector)

HYPERFLEET-856

Summary by CodeRabbit

  • New Features

    • Added Prometheus metrics for monitoring resource deletion progress, including pending deletion counters and duration tracking.
    • Added alerts for detecting stuck pending deletions at warning and critical severity levels.
    • Introduced configurable threshold for identifying stuck deletions (default: 30 minutes).
  • Documentation

    • Added deletion observability section with metrics examples and alert details.
  • Tests

    • Added integration tests for pending deletion monitoring.

@openshift-ci openshift-ci Bot requested review from Mischulee and jsell-rh April 28, 2026 14:10
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jsell-rh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This pull request introduces observability for resources stuck in "Pending Deletion" state across the hyperfleet-api system. It adds a new deletion_stuck_threshold configuration option (default 30m) to identify resources whose soft-deletion has not completed. The metrics subsystem records pending-deletion events via a counter and duration histogram labeled by resource type, exposes a gauge for stuck resources queried directly from the database at scrape time, and defines two Prometheus alert rules (warning at 5m, critical at 30m) that fire when stuck resources are detected. Service layer modifications record metrics during soft-delete operations for both clusters and node pools, while database migrations add performance indexes on deleted_time columns. Configuration validation ensures the threshold is positive, and comprehensive test coverage validates the metrics recording and collector behavior.

Sequence Diagram

sequenceDiagram
    participant Client as User/Client
    participant Service as Service Layer
    participant Metrics as Metrics Subsystem
    participant DB as Database
    participant Collector as Pending Deletion Collector
    participant Prometheus as Prometheus

    Client->>Service: Delete Cluster/NodePool (SoftDelete)
    Service->>DB: Set deleted_time, deleted_by, increment generation
    DB-->>Service: Persisted
    Service->>Metrics: RecordPendingDeletion(resource_type)
    Metrics->>Metrics: Increment counter, observe duration
    Prometheus->>Collector: Scrape metrics
    Collector->>DB: SELECT COUNT(*) WHERE deleted_time < (now - threshold)
    DB-->>Collector: Count of stuck resources per type
    Collector->>Prometheus: Emit hyperfleet_api_resource_pending_deletion_stuck gauge
    Prometheus->>Prometheus: Evaluate alert rules
    alt Stuck > 0 for 5m
        Prometheus-->>Client: Fire HyperFleetResourceDeletionStuckWarning
    end
    alt Stuck > 0 for 30m
        Prometheus-->>Client: Fire HyperFleetResourceDeletionStuckCritical
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly and clearly describes the main change: adding deletion observability metrics and alerts, which is the primary objective across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/config/metrics.go (1)

16-30: ⚠️ Potential issue | 🟠 Major

Reject non-positive deletion thresholds.

DeletionStuckThreshold is only marked required, so a negative duration still gets through. That would invert the cutoff used by the stuck-resource collector and make the new deletion alerts/metrics meaningless. Please add an explicit positive-value check in config validation before this value is consumed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/config/metrics.go` around lines 16 - 30, Add explicit validation to
reject non-positive DeletionStuckThreshold values: in the MetricsConfig
validation path (e.g., the method that validates or finalizes MetricsConfig
before use) check MetricsConfig.DeletionStuckThreshold > 0 and return an error
if it is <= 0; update any constructor/factory logic (NewMetricsConfig remains
fine) to rely on this validation so the stuck-resource collector and alerting
never receive a zero or negative duration. Ensure the error message references
DeletionStuckThreshold so callers know which config field is invalid.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@charts/templates/prometheusrule.yaml`:
- Around line 16-37: The alerts HyperFleetResourceDeletionStuck and
HyperFleetResourceDeletionTimeout currently use the raw metric
hyperfleet_api_resource_terminating_stuck > 0 which will fire per pod; change
the expr to aggregate across pods (e.g., use max by
(resource_type)(hyperfleet_api_resource_terminating_stuck) > 0) to deduplicate,
and replace the hard-coded durations in the annotations' description ("30
minutes"/"1 hour") with the templated for values (use the same {{
.Values.prometheusRule.rules.deletionStuck.for }} and {{
.Values.prometheusRule.rules.deletionTimeout.for }} references used for the for
fields) so the text matches configured alert durations for the alerts named
HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout.

In `@pkg/metrics/deletion.go`:
- Around line 134-149: Collect currently uses blocking DB calls via QueryRow;
change TerminatingCollector.Collect to use QueryRowContext with a bounded
context deadline so scrapes fail fast on slow DBs: create a context with timeout
(e.g., ctx, cancel := context.WithTimeout(context.Background(), c.queryTimeout)
and defer cancel()) and call c.db.QueryRowContext(ctx, q.query, threshold)
instead of c.db.QueryRow(...); ensure TerminatingCollector has a configurable
query timeout field (e.g., queryTimeout time.Duration) or use a sensible
constant, and preserve the existing error handling (log on Scan error) while
returning/continuing promptly when the context times out.

In `@pkg/services/cluster.go`:
- Around line 132-136: The cluster cascade path bypasses the node pool
soft-delete hook so nodepool termination metrics are not recorded; update the
cluster deletion flow (around the call to s.nodePoolDao.SoftDeleteByOwner) to
record nodepool terminations by invoking metrics.RecordTerminating("nodepool")
for the affected node pools (or ensure SoftDeleteByOwner emits that metric), so
the existing node_pool.go metric counts are preserved when deletions are
triggered by cluster cascade.

---

Outside diff comments:
In `@pkg/config/metrics.go`:
- Around line 16-30: Add explicit validation to reject non-positive
DeletionStuckThreshold values: in the MetricsConfig validation path (e.g., the
method that validates or finalizes MetricsConfig before use) check
MetricsConfig.DeletionStuckThreshold > 0 and return an error if it is <= 0;
update any constructor/factory logic (NewMetricsConfig remains fine) to rely on
this validation so the stuck-resource collector and alerting never receive a
zero or negative duration. Ensure the error message references
DeletionStuckThreshold so callers know which config field is invalid.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 2ad75a9a-1f66-4293-8c1f-68b67c9c0ec6

📥 Commits

Reviewing files that changed from the base of the PR and between 12333a0 and e63fdd4.

📒 Files selected for processing (13)
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/metrics.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go

Comment thread pkg/metrics/deletion.go Outdated
Comment thread pkg/services/cluster.go Outdated
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from 6487c64 to d429144 Compare April 28, 2026 14:28
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@charts/templates/prometheusrule.yaml`:
- Around line 16-37: The alert uses aggregation "max by (resource_type)" over
hyperfleet_api_resource_terminating_stuck which merges across releases because
the metric lacks a namespace label; don't attempt to add namespace to the
aggregation without the label. Fix by one of the suggested approaches: (a) add a
relabeling rule in the ServiceMonitor (or ServiceMonitor template) to preserve
__meta_kubernetes_namespace as a metric label and change the PrometheusRule
alerts HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout to
aggregate with "max by (namespace, resource_type)"; or (b) emit "namespace" from
the application so hyperfleet_api_resource_terminating_stuck includes namespace
and then aggregate by namespace; or (c) require and validate that
serviceMonitor.namespace is set (documenting multi-release limitations) so
scraping is release-scoped. Reference symbols to change: the PrometheusRule
alerts HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout,
the metric hyperfleet_api_resource_terminating_stuck, and the
ServiceMonitor/serviceMonitor.namespace/relabel_configs to implement option (a).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: a59c3d7c-dcbd-4ed1-90ac-ec5a21b50b8b

📥 Commits

Reviewing files that changed from the base of the PR and between 6019453 and d429144.

📒 Files selected for processing (13)
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/metrics.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/metrics/deletion_test.go
🚧 Files skipped from review as they are similar to previous changes (9)
  • pkg/config/flags.go
  • pkg/services/cluster.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go
  • docs/metrics.md
  • pkg/config/metrics.go
  • charts/values.yaml
  • pkg/metrics/deletion.go

Comment thread charts/templates/prometheusrule.yaml Outdated
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from d429144 to 79ebbb0 Compare April 28, 2026 14:43
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
charts/templates/prometheusrule.yaml (1)

16-37: ⚠️ Potential issue | 🟠 Major

Align the alert aggregation with the metric labels.

hyperfleet_api_resource_terminating_stuck only exports resource_type, component, and version, so max by (namespace, resource_type) collapses releases that share Prometheus and does not scope the alert by namespace. Either add a namespace label at scrape/metric time or remove namespace from the aggregation.

Please verify the label mismatch with a read-only search:

#!/bin/bash
set -euo pipefail
rg -n 'labelResourceType|labelComponent|labelVersion|stuckDesc|max by \(namespace, resource_type\)' pkg/metrics/deletion.go charts/templates/prometheusrule.yaml

Expected result: the metric definition has no namespace label while the alert rule groups by namespace, confirming the cross-release aggregation risk.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/templates/prometheusrule.yaml` around lines 16 - 37, The alert
aggregation uses "max by (namespace, resource_type)" but the metric
hyperfleet_api_resource_terminating_stuck (defined in pkg/metrics/deletion.go
via labelResourceType/labelComponent/labelVersion) does not include a namespace
label, causing cross-release aggregation; fix the two alerts
HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout in
charts/templates/prometheusrule.yaml by removing "namespace" from the
aggregation (change "max by (namespace, resource_type)" to "max by
(resource_type)") or alternatively ensure the metric is exported with a
namespace label at scrape time—pick one approach and apply it consistently to
both alert rules and their descriptions/runbook handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/metrics.md`:
- Around line 321-339: The PromQL aggregations are malformed and need to be
rewritten to the proper "sum by (...) (...)" form and the first rate() converted
to per-minute: change sum(rate(hyperfleet_api_resource_terminating_total[5m]))
by (resource_type) to sum by (resource_type)
(rate(hyperfleet_api_resource_terminating_total[5m])) * 60; change
sum(hyperfleet_api_resource_terminating_stuck) by (resource_type) to sum by
(resource_type) (hyperfleet_api_resource_terminating_stuck); for the average
terminating duration replace both numerator and denominator forms sum(... ) by
(resource_type) with sum by (resource_type) (...): sum by (resource_type)
(rate(hyperfleet_api_resource_terminating_duration_seconds_sum[5m])) / sum by
(resource_type)
(rate(hyperfleet_api_resource_terminating_duration_seconds_count[5m])); and
update the histogram_quantile call to histogram_quantile(0.99, sum by (le,
resource_type)
(rate(hyperfleet_api_resource_terminating_duration_seconds_bucket[5m]))).

In `@pkg/services/cluster.go`:
- Around line 143-150: The metric emission for nodepool termination is happening
after the follow-up read (FindSoftDeletedByOwner) which can fail and skip
emitting metrics; update the deletion flow to emit the metrics earlier or change
the SoftDeleteByOwner API to return the affected count/list. Concretely: either
modify SoftDeleteByOwner(...) to return (affectedCount int, affectedPools
[]NodePool, err error) and call metrics.RecordTerminating("nodepool") based on
that return value inside the deletion path, or move the loop that calls
metrics.RecordTerminating("nodepool") to immediately after the
SoftDeleteByOwner(...) call (before calling FindSoftDeletedByOwner), ensuring
metrics.RecordTerminating is invoked even if subsequent reads fail; adjust
references to SoftDeleteByOwner, FindSoftDeletedByOwner, and
metrics.RecordTerminating("nodepool") accordingly.

---

Duplicate comments:
In `@charts/templates/prometheusrule.yaml`:
- Around line 16-37: The alert aggregation uses "max by (namespace,
resource_type)" but the metric hyperfleet_api_resource_terminating_stuck
(defined in pkg/metrics/deletion.go via
labelResourceType/labelComponent/labelVersion) does not include a namespace
label, causing cross-release aggregation; fix the two alerts
HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout in
charts/templates/prometheusrule.yaml by removing "namespace" from the
aggregation (change "max by (namespace, resource_type)" to "max by
(resource_type)") or alternatively ensure the metric is exported with a
namespace label at scrape time—pick one approach and apply it consistently to
both alert rules and their descriptions/runbook handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 3252c284-6c38-4779-9ca9-a7fe3e9a0937

📥 Commits

Reviewing files that changed from the base of the PR and between d429144 and 79ebbb0.

📒 Files selected for processing (13)
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/metrics.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • pkg/config/flags.go
  • pkg/services/node_pool.go
  • pkg/config/loader.go
  • pkg/config/metrics.go

Comment thread docs/metrics.md
Comment thread pkg/services/cluster.go
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from 79ebbb0 to a29272c Compare April 28, 2026 15:04
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/services/cluster.go (1)

138-145: ⚠️ Potential issue | 🟠 Major

Nodepool terminating events can be permanently missed on read failure.

If FindSoftDeletedByOwner fails at Line 138 after SoftDeleteByOwner succeeded at Line 134, the function exits before Lines 143-145 emit nodepool metrics. Later retries short-circuit at Line 118, so those missed events are never recovered. Emit nodepool metric counts from the cascade write result (e.g., affected rows/returned IDs) instead of depending on a follow-up read.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/services/cluster.go` around lines 138 - 145, The current implementation
records nodepool terminating metrics by reading soft-deleted nodepools via
FindSoftDeletedByOwner (nodePools) after SoftDeleteByOwner, so if that read
fails the metrics are never emitted; update the cascade delete flow to emit
metrics from the cascade write result instead of the follow-up read: change
SoftDeleteByOwner (or the DAO method it calls) to return the affected row count
or the list of deleted IDs, have the caller capture that return value (instead
of relying on nodePools from FindSoftDeletedByOwner), and call
metrics.RecordTerminating the appropriate number of times (or once with the
count) based on that returned count/IDs; retain or remove the
FindSoftDeletedByOwner call as needed but ensure metrics emission no longer
depends on it.
charts/templates/prometheusrule.yaml (1)

17-17: ⚠️ Potential issue | 🟠 Major

namespace scoping in the alert expression is currently ineffective.

hyperfleet_api_resource_terminating_stuck doesn’t expose a namespace label, so max by (namespace, resource_type) still collapses across releases/namespaces for each resource_type. This can mix signals across deployments.

Run this to verify label shape and whether ServiceMonitor relabeling injects namespace:

#!/bin/bash
set -euo pipefail

echo "== Collector metric descriptor/emit labels =="
rg -n -C3 'resource_terminating_stuck|NewDesc\(|MustNewConstMetric\(' pkg/metrics/deletion.go

echo
echo "== Alert expressions =="
rg -n -C2 'HyperFleetResourceDeletion(Stuck|Timeout)|max by|hyperfleet_api_resource_terminating_stuck' charts/templates/prometheusrule.yaml

echo
echo "== ServiceMonitor relabeling (namespace injection) =="
fd -i 'servicemonitor.yaml' charts/templates | while read -r f; do
  echo "--- ${f} ---"
  rg -n -C3 '__meta_kubernetes_namespace|metricRelabelings|relabelings|targetLabels|namespace' "$f" || true
done

Expected: no namespace label in collector metric unless explicitly injected by scrape relabeling; if absent, current by(namespace, resource_type) doesn’t isolate per namespace.

Also applies to: 28-28

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/templates/prometheusrule.yaml` at line 17, The alert groups by a
non-existent namespace label so signals are mixed; either inject namespace at
scrape time or stop grouping by it: update the Prometheus relabeling in your
ServiceMonitor (servicemonitor.yaml) to add target label "namespace" from
__meta_kubernetes_namespace so the collector metric
hyperfleet_api_resource_terminating_stuck has a namespace label, or change the
rule in charts/templates/prometheusrule.yaml (expr referencing max by
(namespace, resource_type)) to group only by existing labels (e.g., max by
(resource_type)) and remove namespace from the by() clause to avoid incorrect
scoping.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@charts/templates/prometheusrule.yaml`:
- Line 17: The alert groups by a non-existent namespace label so signals are
mixed; either inject namespace at scrape time or stop grouping by it: update the
Prometheus relabeling in your ServiceMonitor (servicemonitor.yaml) to add target
label "namespace" from __meta_kubernetes_namespace so the collector metric
hyperfleet_api_resource_terminating_stuck has a namespace label, or change the
rule in charts/templates/prometheusrule.yaml (expr referencing max by
(namespace, resource_type)) to group only by existing labels (e.g., max by
(resource_type)) and remove namespace from the by() clause to avoid incorrect
scoping.

In `@pkg/services/cluster.go`:
- Around line 138-145: The current implementation records nodepool terminating
metrics by reading soft-deleted nodepools via FindSoftDeletedByOwner (nodePools)
after SoftDeleteByOwner, so if that read fails the metrics are never emitted;
update the cascade delete flow to emit metrics from the cascade write result
instead of the follow-up read: change SoftDeleteByOwner (or the DAO method it
calls) to return the affected row count or the list of deleted IDs, have the
caller capture that return value (instead of relying on nodePools from
FindSoftDeletedByOwner), and call metrics.RecordTerminating the appropriate
number of times (or once with the count) based on that returned count/IDs;
retain or remove the FindSoftDeletedByOwner call as needed but ensure metrics
emission no longer depends on it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 2a707980-1e96-49d8-88f5-25adfb2a8e1f

📥 Commits

Reviewing files that changed from the base of the PR and between 79ebbb0 and a29272c.

📒 Files selected for processing (13)
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/metrics.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (3)
  • pkg/config/flags.go
  • pkg/services/node_pool.go
  • charts/values.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
  • cmd/hyperfleet-api/servecmd/cmd.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • docs/metrics.md

Comment thread charts/templates/prometheusrule.yaml Outdated
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch 2 times, most recently from 964d474 to 3f46062 Compare April 28, 2026 17:22
Comment thread charts/templates/prometheusrule.yaml Outdated
summary: "HyperFleet resources stuck in Pending Deletion state"
description: >-
{{ "{{ $value }}" }} {{ "{{ $labels.resource_type }}" }} resource(s) have been in
Pending Deletion state for more than {{ .Values.prometheusRule.rules.deletionStuck.for | default "5m" }}.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this description correct, is it not hitting 35m and not 5m?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! You're right — the pending_deletion_stuck metric already embeds a 30m threshold from the collector (config.metrics.deletion_stuck_threshold), so by the time the Warning alert fires after its 5m for delay, the resource has actually been stuck for ~35m, not 5m.

Fixed in 39d7989 — the description now shows both components: 30m (stuck threshold) + 5m (alert delay).

Comment thread charts/values.yaml
Comment on lines +274 to +277
for: "5m"
runbookUrl: ""
deletionTimeout:
for: "30m"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what was your reasoning behind these defaults?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deletionStuck.for: 5m is meant as an early warning — the collector already flags resources stuck beyond the 30m threshold (deletion_stuck_threshold), so 5m of sustained signal is just enough to filter out transient scrape noise before paging.

The deletionTimeout.for: 30m escalates to critical — at that point the resource has been stuck for ~60m total (30m threshold + 30m alert delay), which signals something is genuinely broken and needs immediate attention.

Happy to adjust if you think different values make more sense for the team's SLOs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaning towards stupid defaults TBH, to avoid un-wanted noise for our partner teams. Most will prob just roll the defaults until they align on their SLO's. I would lean on warning after an hour and timeout at 2 or 3 hours.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes total sense — bumped the defaults to conservative values: warning at 30m (fires ~1h total) and critical at 2h (fires ~2.5h total). Better to start quiet and let GCP/ROSA teams tighten when they define their SLOs, rather than generating noise out of the box.

Comment thread pkg/services/cluster.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
pkg/services/cluster.go (1)

138-145: ⚠️ Potential issue | 🟠 Major

Nodepool metric emission still depends on a brittle follow-up read.

If SoftDeleteByOwner succeeds but FindSoftDeletedByOwner fails, nodepools are already soft-deleted but no nodepool pending-deletion metrics are emitted, and retries won’t replay because the cluster is already marked deleted.

Consider moving emission to data returned directly by SoftDeleteByOwner (e.g., affected count/list) so it doesn’t depend on a second query.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/services/cluster.go` around lines 138 - 145, The current flow calls
s.nodePoolDao.SoftDeleteByOwner(...) then separately calls
s.nodePoolDao.FindSoftDeletedByOwner(...) to emit metrics, which is brittle if
the second read fails; modify SoftDeleteByOwner on nodePoolDao to return the
affected nodepool count or list (e.g., return (count int, ids []string, error))
and update callers in cluster.go to call that result directly and invoke
metrics.RecordPendingDeletion("nodepool") using the returned count/list so
metric emission does not depend on FindSoftDeletedByOwner; ensure error handling
still returns on SoftDeleteByOwner failure and that the new return shape is
propagated where needed.
charts/templates/prometheusrule.yaml (1)

17-17: ⚠️ Potential issue | 🟠 Major

namespace grouping is still ineffective for this metric.

On Line 17 and Line 29, max by (namespace, resource_type) won’t scope alerts per namespace unless the metric actually has a namespace label. hyperfleet_api_resource_pending_deletion_stuck is emitted without namespace in pkg/metrics/deletion.go, so this can still merge signals across releases.

#!/bin/bash
# Verify metric labels for pending deletion stuck collector
rg -n -C3 'resource_pending_deletion_stuck|NewDesc\(|labelResourceType|labelComponent|labelVersion' pkg/metrics/deletion.go

# Verify whether ServiceMonitor injects namespace as a metric label
rg -n -C3 'ServiceMonitor|metricRelabelings|relabelings|__meta_kubernetes_namespace|namespace' charts/templates

Expected result:

  1. pkg/metrics/deletion.go shows only resource_type as variable label for this metric.
  2. No relabeling that preserves Kubernetes namespace into metric labels, confirming aggregation still cannot partition per namespace.

Also applies to: 29-29

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/templates/prometheusrule.yaml` at line 17, The alert groups by
(namespace, resource_type) but the metric
hyperfleet_api_resource_pending_deletion_stuck is emitted without a namespace
label (see pkg/metrics/deletion.go), so change one of two things: either add a
"namespace" label to the metric at its declaration/recording in
pkg/metrics/deletion.go (update the NewDesc/label list and ensure the collector
sets the namespace value when observing), or if you cannot emit namespace,
remove namespace from the PrometheusRule expression in
charts/templates/prometheusrule.yaml and use max by (resource_type) (and the
identical change for the other occurrence at line 29) so alerts aren’t
misleadingly grouped by a non-existent label.
🧹 Nitpick comments (1)
test/integration/deletion_metrics_test.go (1)

33-35: Assert delete response status to prevent false-positive test results.

Right now these checks only assert transport success (err == nil). If delete returns non-2xx, the second subtest can still pass spuriously.

Proposed test hardening
-		_, err = client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx))
+		delResp, err := client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx))
 		Expect(err).NotTo(HaveOccurred())
+		Expect(delResp.StatusCode()).To(BeNumerically(">=", 200))
+		Expect(delResp.StatusCode()).To(BeNumerically("<", 300))
...
-		_, err = client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx))
+		delResp, err := client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx))
 		Expect(err).NotTo(HaveOccurred())
+		Expect(delResp.StatusCode()).To(BeNumerically(">=", 200))
+		Expect(delResp.StatusCode()).To(BeNumerically("<", 300))

Also applies to: 61-63

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/integration/deletion_metrics_test.go` around lines 33 - 35, The test
currently only checks that client.DeleteClusterByIdWithResponse returned no
transport error (err == nil) which can hide non-2xx delete responses; update the
two places calling DeleteClusterByIdWithResponse (the call using cluster.ID and
the second similar call around lines 61-63) to capture the response object and
assert its HTTP status is the expected success code (e.g., 200/204) using the
response's status/status code accessor before or in addition to
Expect(err).NotTo(HaveOccurred()) so the test fails on non-2xx responses;
reference the DeleteClusterByIdWithResponse call, the returned response
variable, and the Expect assertions when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/metrics/deletion.go`:
- Around line 135-136: Metrics scrape COUNT queries for clusters and node_pools
(the entries with "SELECT COUNT(*) FROM clusters WHERE deleted_time IS NOT NULL
AND deleted_time < $1" and "SELECT COUNT(*) FROM node_pools WHERE deleted_time
IS NOT NULL AND deleted_time < $1" in pkg/metrics/deletion.go) need supporting
indexes; add a migration that creates either simple indexes on deleted_time
(CREATE INDEX ON clusters(deleted_time), CREATE INDEX ON
node_pools(deleted_time)) or partial indexes filtering NOT NULL (CREATE INDEX
... WHERE deleted_time IS NOT NULL) and also drop the stale indexes left on the
old deleted_at column (drop idx_clusters_deleted_at and
idx_node_pools_deleted_at) so the metrics queries use the new indexes and avoid
full-table scans.

---

Duplicate comments:
In `@charts/templates/prometheusrule.yaml`:
- Line 17: The alert groups by (namespace, resource_type) but the metric
hyperfleet_api_resource_pending_deletion_stuck is emitted without a namespace
label (see pkg/metrics/deletion.go), so change one of two things: either add a
"namespace" label to the metric at its declaration/recording in
pkg/metrics/deletion.go (update the NewDesc/label list and ensure the collector
sets the namespace value when observing), or if you cannot emit namespace,
remove namespace from the PrometheusRule expression in
charts/templates/prometheusrule.yaml and use max by (resource_type) (and the
identical change for the other occurrence at line 29) so alerts aren’t
misleadingly grouped by a non-existent label.

In `@pkg/services/cluster.go`:
- Around line 138-145: The current flow calls
s.nodePoolDao.SoftDeleteByOwner(...) then separately calls
s.nodePoolDao.FindSoftDeletedByOwner(...) to emit metrics, which is brittle if
the second read fails; modify SoftDeleteByOwner on nodePoolDao to return the
affected nodepool count or list (e.g., return (count int, ids []string, error))
and update callers in cluster.go to call that result directly and invoke
metrics.RecordPendingDeletion("nodepool") using the returned count/list so
metric emission does not depend on FindSoftDeletedByOwner; ensure error handling
still returns on SoftDeleteByOwner failure and that the new return shape is
propagated where needed.

---

Nitpick comments:
In `@test/integration/deletion_metrics_test.go`:
- Around line 33-35: The test currently only checks that
client.DeleteClusterByIdWithResponse returned no transport error (err == nil)
which can hide non-2xx delete responses; update the two places calling
DeleteClusterByIdWithResponse (the call using cluster.ID and the second similar
call around lines 61-63) to capture the response object and assert its HTTP
status is the expected success code (e.g., 200/204) using the response's
status/status code accessor before or in addition to
Expect(err).NotTo(HaveOccurred()) so the test fails on non-2xx responses;
reference the DeleteClusterByIdWithResponse call, the returned response
variable, and the Expect assertions when making this change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: ab847be3-0cd7-404c-b0c0-96a4da969b4e

📥 Commits

Reviewing files that changed from the base of the PR and between 79ebbb0 and 39d7989.

📒 Files selected for processing (13)
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/metrics.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • pkg/config/flags.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • charts/values.yaml
  • pkg/config/metrics.go
  • docs/metrics.md

Comment thread pkg/metrics/deletion.go
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from 39d7989 to 7b9a35f Compare April 29, 2026 13:58
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from 7b9a35f to 429af79 Compare April 29, 2026 14:00
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/services/cluster.go (1)

134-160: ⚠️ Potential issue | 🟠 Major

Emit nodepool metrics from the write path, not the follow-up read.

This still derives the nodepool counter from FindSoftDeletedByOwner(), so a transient read failure drops the metric entirely, and any nodepools that were already soft-deleted before this cascade will be counted again. Please have SoftDeleteByOwner() return the affected rows/IDs, or move the metric emission into that write path.

♻️ Proposed fix
-	if cascadeErr := s.nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy); cascadeErr != nil {
-		return nil, handleSoftDeleteError("NodePool", cascadeErr)
-	}
-
-	nodePools, err := s.nodePoolDao.FindSoftDeletedByOwner(ctx, id)
-	if err != nil {
-		return nil, errors.GeneralError("Failed to fetch cascade-deleted nodepools: %s", err)
-	}
-
-	for range nodePools {
+	nodePools, cascadeErr := s.nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy)
+	if cascadeErr != nil {
+		return nil, handleSoftDeleteError("NodePool", cascadeErr)
+	}
+
+	for range nodePools {
 		metrics.RecordPendingDeletion("nodepool")
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/services/cluster.go` around lines 134 - 160, The current metric emission
uses FindSoftDeletedByOwner() after SoftDeleteByOwner(), which can lose metrics
on transient reads and double-count pre-existing soft-deletes; change
nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy) to return the list of
affected nodepool IDs (or count), have SoftDeleteByOwner itself call
metrics.RecordPendingDeletion("nodepool") for each affected ID (or record the
count), and update this caller (and any other callers) to stop iterating over
results from FindSoftDeletedByOwner(); keep UpdateClusterStatusFromAdapters and
batchUpdateNodePoolStatusesFromAdapters calls as-is but remove the post-read
metric loop so metrics are emitted reliably from the write path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/services/cluster.go`:
- Around line 134-160: The current metric emission uses FindSoftDeletedByOwner()
after SoftDeleteByOwner(), which can lose metrics on transient reads and
double-count pre-existing soft-deletes; change
nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy) to return the list of
affected nodepool IDs (or count), have SoftDeleteByOwner itself call
metrics.RecordPendingDeletion("nodepool") for each affected ID (or record the
count), and update this caller (and any other callers) to stop iterating over
results from FindSoftDeletedByOwner(); keep UpdateClusterStatusFromAdapters and
batchUpdateNodePoolStatusesFromAdapters calls as-is but remove the post-read
metric loop so metrics are emitted reliably from the write path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: b561f099-048e-47c0-8991-8c567303bafc

📥 Commits

Reviewing files that changed from the base of the PR and between 39d7989 and 429af79.

📒 Files selected for processing (15)
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/metrics.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/db/migrations/202604290001_add_deleted_time_indexes.go
  • pkg/db/migrations/migration_structs.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (3)
  • pkg/config/flags.go
  • pkg/services/node_pool.go
  • docs/metrics.md
🚧 Files skipped from review as they are similar to previous changes (6)
  • pkg/config/loader.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • charts/values.yaml
  • test/integration/deletion_metrics_test.go
  • pkg/config/metrics.go
  • pkg/metrics/deletion_test.go

@rafabene
Copy link
Copy Markdown
Contributor Author

/retest

@rafabene
Copy link
Copy Markdown
Contributor Author

/retest

@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from 429af79 to 066adcf Compare May 4, 2026 14:04
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from 066adcf to a170052 Compare May 4, 2026 14:06
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@charts/templates/prometheusrule.yaml`:
- Around line 18-30: The alert default "for" durations are too long; update the
template to use the intended SLO windows by changing the default for
.Values.prometheusRule.rules.deletionStuck.for from "30m" to "5m" and
.Values.prometheusRule.rules.deletionTimeout.for from "2h" to "30m" so the
HyperFleetResourceDeletionStuck warning/critical alerts (see the alert blocks
referencing deletionStuck.for and deletionTimeout.for and the
HyperFleetResourceDeletionStuckCritical alert) fire at the intended 5m and 30m
defaults when chart values are not provided.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 7a48a973-4b64-4818-bed2-38e4966a17e0

📥 Commits

Reviewing files that changed from the base of the PR and between 429af79 and a170052.

📒 Files selected for processing (15)
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/metrics.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/db/migrations/202604290001_add_deleted_time_indexes.go
  • pkg/db/migrations/migration_structs.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (6)
  • pkg/db/migrations/202604290001_add_deleted_time_indexes.go
  • pkg/db/migrations/migration_structs.go
  • pkg/config/loader.go
  • pkg/services/cluster.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • pkg/services/node_pool.go
  • pkg/config/metrics.go
  • test/integration/deletion_metrics_test.go
  • charts/values.yaml

Comment thread charts/templates/prometheusrule.yaml Outdated
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from d7b4c7d to fc6a6af Compare May 5, 2026 13:09
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
pkg/dao/mocks/node_pool.go (1)

38-41: 💤 Low value

Save appends instead of updates.

The Save method appends the nodepool to the slice rather than replacing an existing entry by ID. This could cause duplicate entries in tests that call Save on an existing nodepool. However, if the mock is only used for basic scenarios where this behavior is acceptable, this is a minor concern.

♻️ Optional fix for update semantics
 func (d *nodePoolDaoMock) Save(ctx context.Context, nodePool *api.NodePool) error {
+	for i, np := range d.nodePools {
+		if np.ID == nodePool.ID {
+			d.nodePools[i] = nodePool
+			return nil
+		}
+	}
 	d.nodePools = append(d.nodePools, nodePool)
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/dao/mocks/node_pool.go` around lines 38 - 41, The Save method on
nodePoolDaoMock currently appends instead of updating; modify
nodePoolDaoMock.Save(ctx context.Context, nodePool *api.NodePool) to search
d.nodePools for an existing entry with the same ID (compare nodePool.Id or
nodePool.ID field used by api.NodePool), replace that slice element with the
incoming nodePool if found, and only append to d.nodePools if no matching ID
exists; ensure you update the element in-place on d.nodePools to preserve slice
ordering and return nil as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/dao/mocks/cluster.go`:
- Around line 37-40: The mock Save method currently appends every cluster
causing duplicate IDs; change clusterDaoMock.Save to upsert by ID: search
d.clusters for an existing entry with the same cluster.ID and replace that
element if found, otherwise append the new cluster; update d.clusters in-place
so Get (and other methods) return the latest value for that ID.

In `@pkg/db/migrations/migration_structs.go`:
- Around line 35-38: MigrationList entries are out of ascending ID order; move
addReconciledIndex() so the list is sorted by migration ID. Edit the
MigrationList registration where addSoftDeleteSchema(),
addNodePoolOwnerDeletedIndex(), addReconciledIndex(), addDeletedTimeIndexes()
are listed and swap the positions of addNodePoolOwnerDeletedIndex and
addReconciledIndex so addReconciledIndex() appears before
addNodePoolOwnerDeletedIndex(), ensuring the migration registration order is
ascending and deterministic.

In `@pkg/services/node_pool.go`:
- Around line 102-123: The SoftDelete method's in-memory check
(nodePool.DeletedTime) can race; change the flow so the soft-delete is performed
atomically in the DB (or under row lock) to ensure only one request flips
deleted_time and increments metrics: replace the Get+check+Save pattern in
SoftDelete with a single conditional update (e.g., UPDATE ... SET
deleted_time=?, deleted_by=?, generation=generation+1 WHERE id=? AND
deleted_time IS NULL) or perform SELECT ... FOR UPDATE then update, and only
call metrics.RecordPendingDeletion("nodepool") when the conditional update
reports rows affected == 1 (or when the locked row shows it was previously NULL)
so concurrent requests do not both run side effects; keep references to
nodePoolDao.Save / nodePoolDao.Get and RecordPendingDeletion when implementing
the atomic update/locking behavior.

In `@pkg/telemetry/otel.go`:
- Around line 82-85: Fix the typo in the comment above the propagator selection
in otel.go: change "envivronment" to "environment" in the comment that
references OTEL_PROPAGATORS (the block immediately above the use of
autoprop.NewTextMapPropagator and otel.SetTextMapPropagator) so the comment
reads correctly.

---

Nitpick comments:
In `@pkg/dao/mocks/node_pool.go`:
- Around line 38-41: The Save method on nodePoolDaoMock currently appends
instead of updating; modify nodePoolDaoMock.Save(ctx context.Context, nodePool
*api.NodePool) to search d.nodePools for an existing entry with the same ID
(compare nodePool.Id or nodePool.ID field used by api.NodePool), replace that
slice element with the incoming nodePool if found, and only append to
d.nodePools if no matching ID exists; ensure you update the element in-place on
d.nodePools to preserve slice ordering and return nil as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 109f893a-3660-4bd9-81fd-53fb80bc9a2d

📥 Commits

Reviewing files that changed from the base of the PR and between d7b4c7d and fc6a6af.

⛔ Files ignored due to path filters (3)
  • .bingo/golangci-lint.sum is excluded by !**/*.sum
  • .bingo/oapi-codegen.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (69)
  • .bingo/Variables.mk
  • .bingo/oapi-codegen.mod
  • .bingo/variables.env
  • CHANGELOG.md
  • Makefile
  • charts/templates/_helpers.tpl
  • charts/templates/deployment.yaml
  • charts/templates/postgresql.yaml
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/config.md
  • docs/database.md
  • docs/metrics.md
  • go.mod
  • hack/verify-migrations.sh
  • openapi/openapi.yaml
  • pkg/api/cluster_types.go
  • pkg/api/metadata_types.go
  • pkg/api/node_pool_types.go
  • pkg/api/presenters/cluster.go
  • pkg/api/presenters/node_pool.go
  • pkg/api/status_types.go
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/dao/adapter_status.go
  • pkg/dao/cluster.go
  • pkg/dao/mocks/cluster.go
  • pkg/dao/mocks/node_pool.go
  • pkg/dao/node_pool.go
  • pkg/db/migrations/202511111044_add_clusters.go
  • pkg/db/migrations/202511111055_add_node_pools.go
  • pkg/db/migrations/202511111105_add_adapter_status.go
  • pkg/db/migrations/202601210001_add_conditions_gin_index.go
  • pkg/db/migrations/202604160001_soft_delete_schema.go
  • pkg/db/migrations/202604211859_add_reconciled_index.go
  • pkg/db/migrations/202604230001_add_nodepool_owner_deleted_index.go
  • pkg/db/migrations/202604290001_add_deleted_time_indexes.go
  • pkg/db/migrations/migration_structs.go
  • pkg/handlers/cluster.go
  • pkg/handlers/cluster_nodepools.go
  • pkg/handlers/cluster_nodepools_test.go
  • pkg/handlers/cluster_status.go
  • pkg/handlers/framework.go
  • pkg/handlers/node_pool.go
  • pkg/handlers/nodepool_status.go
  • pkg/handlers/rest.go
  • pkg/handlers/validation.go
  • pkg/handlers/validation_test.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/middleware/otel_test.go
  • pkg/services/aggregation.go
  • pkg/services/aggregation_test.go
  • pkg/services/cluster.go
  • pkg/services/cluster_test.go
  • pkg/services/generic.go
  • pkg/services/node_pool.go
  • pkg/services/node_pool_test.go
  • pkg/services/status_helpers.go
  • pkg/services/util.go
  • pkg/telemetry/otel.go
  • plugins/clusters/plugin.go
  • test/helper.go
  • test/integration/clusters_test.go
  • test/integration/deletion_metrics_test.go
  • test/integration/node_pools_test.go
💤 Files with no reviewable changes (7)
  • pkg/handlers/rest.go
  • pkg/db/migrations/202511111044_add_clusters.go
  • pkg/middleware/otel_test.go
  • pkg/db/migrations/202601210001_add_conditions_gin_index.go
  • pkg/db/migrations/202511111105_add_adapter_status.go
  • charts/templates/postgresql.yaml
  • pkg/db/migrations/202511111055_add_node_pools.go
✅ Files skipped from review due to trivial changes (12)
  • .bingo/oapi-codegen.mod
  • pkg/db/migrations/202604211859_add_reconciled_index.go
  • pkg/db/migrations/202604290001_add_deleted_time_indexes.go
  • pkg/db/migrations/202604230001_add_nodepool_owner_deleted_index.go
  • pkg/api/status_types.go
  • CHANGELOG.md
  • docs/config.md
  • pkg/config/loader.go
  • pkg/config/flags.go
  • cmd/hyperfleet-api/servecmd/cmd.go
  • docs/metrics.md
  • pkg/metrics/deletion.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • test/integration/deletion_metrics_test.go
  • pkg/config/metrics.go
  • pkg/metrics/deletion_test.go

Comment thread pkg/dao/mocks/cluster.go
Comment on lines +37 to +40
func (d *clusterDaoMock) Save(ctx context.Context, cluster *api.Cluster) error {
d.clusters = append(d.clusters, cluster)
return nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Save should upsert by ID, not always append

This mock currently accumulates duplicate IDs, which can make Get return stale data and reduce test reliability for update paths.

Proposed fix
 func (d *clusterDaoMock) Save(ctx context.Context, cluster *api.Cluster) error {
-	d.clusters = append(d.clusters, cluster)
+	for i, existing := range d.clusters {
+		if existing.ID == cluster.ID {
+			d.clusters[i] = cluster
+			return nil
+		}
+	}
+	d.clusters = append(d.clusters, cluster)
 	return nil
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (d *clusterDaoMock) Save(ctx context.Context, cluster *api.Cluster) error {
d.clusters = append(d.clusters, cluster)
return nil
}
func (d *clusterDaoMock) Save(ctx context.Context, cluster *api.Cluster) error {
for i, existing := range d.clusters {
if existing.ID == cluster.ID {
d.clusters[i] = cluster
return nil
}
}
d.clusters = append(d.clusters, cluster)
return nil
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/dao/mocks/cluster.go` around lines 37 - 40, The mock Save method
currently appends every cluster causing duplicate IDs; change
clusterDaoMock.Save to upsert by ID: search d.clusters for an existing entry
with the same cluster.ID and replace that element if found, otherwise append the
new cluster; update d.clusters in-place so Get (and other methods) return the
latest value for that ID.

Comment on lines +35 to +38
addSoftDeleteSchema(),
addNodePoolOwnerDeletedIndex(),
addReconciledIndex(),
addDeletedTimeIndexes(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Migration registration order is not ascending

MigrationList should follow ascending migration IDs, but addNodePoolOwnerDeletedIndex() is placed before addReconciledIndex(). Please reorder these entries to match ID order and keep migration execution deterministic.

Proposed fix
 	addSoftDeleteSchema(),
-	addNodePoolOwnerDeletedIndex(),
 	addReconciledIndex(),
+	addNodePoolOwnerDeletedIndex(),
 	addDeletedTimeIndexes(),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
addSoftDeleteSchema(),
addNodePoolOwnerDeletedIndex(),
addReconciledIndex(),
addDeletedTimeIndexes(),
addSoftDeleteSchema(),
addReconciledIndex(),
addNodePoolOwnerDeletedIndex(),
addDeletedTimeIndexes(),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/db/migrations/migration_structs.go` around lines 35 - 38, MigrationList
entries are out of ascending ID order; move addReconciledIndex() so the list is
sorted by migration ID. Edit the MigrationList registration where
addSoftDeleteSchema(), addNodePoolOwnerDeletedIndex(), addReconciledIndex(),
addDeletedTimeIndexes() are listed and swap the positions of
addNodePoolOwnerDeletedIndex and addReconciledIndex so addReconciledIndex()
appears before addNodePoolOwnerDeletedIndex(), ensuring the migration
registration order is ascending and deterministic.

Comment thread pkg/services/node_pool.go
Comment thread pkg/telemetry/otel.go
Comment on lines +82 to +85
// Select the propagator based on envivronment variable OTEL_PROPAGATORS
// If OTEL_PROPAGATORS is not provided, uses default "tracecontext,baggage"
textMapProp := autoprop.NewTextMapPropagator()
otel.SetTextMapPropagator(textMapProp)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Typo in comment: "envivronment" → "environment".

-	// Select the propagator based on envivronment variable OTEL_PROPAGATORS
+	// Select the propagator based on environment variable OTEL_PROPAGATORS
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Select the propagator based on envivronment variable OTEL_PROPAGATORS
// If OTEL_PROPAGATORS is not provided, uses default "tracecontext,baggage"
textMapProp := autoprop.NewTextMapPropagator()
otel.SetTextMapPropagator(textMapProp)
// Select the propagator based on environment variable OTEL_PROPAGATORS
// If OTEL_PROPAGATORS is not provided, uses default "tracecontext,baggage"
textMapProp := autoprop.NewTextMapPropagator()
otel.SetTextMapPropagator(textMapProp)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/telemetry/otel.go` around lines 82 - 85, Fix the typo in the comment
above the propagator selection in otel.go: change "envivronment" to
"environment" in the comment that references OTEL_PROPAGATORS (the block
immediately above the use of autoprop.NewTextMapPropagator and
otel.SetTextMapPropagator) so the comment reads correctly.

@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from fc6a6af to b24d3d0 Compare May 5, 2026 13:18
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@openapi/openapi.yaml`:
- Around line 190-197: Remove the duplicate "type: Reconciled" condition entries
in the delete-response examples under status.conditions: locate the extra blocks
where "type: Reconciled" appears a second time (the delete examples) and delete
those duplicated condition objects so each example only contains one each of
Ready, Reconciled, and Available; ensure the remaining Reconciled condition
keeps its original fields (status, reason, message, observed_generation,
created_time, last_updated_time, last_transition_time).

In `@pkg/services/cluster.go`:
- Around line 127-134: The current metrics loop uses nodePools from
FindSoftDeletedByOwner which includes pre-existing soft-deletes and overcounts;
change SoftDeleteByOwner (on nodePoolDao) to return the list of affected
nodepool IDs or an integer count of rows actually updated, then in cluster.go
use that returned set/count (instead of calling FindSoftDeletedByOwner) to call
metrics.RecordPendingDeletion exactly once per newly soft-deleted row and to
drive any downstream status updates; update any callers of SoftDeleteByOwner to
handle the new return value and remove the separate FindSoftDeletedByOwner-based
metric emission.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 3d172262-8cde-48ea-97f2-b0ef0dc73556

📥 Commits

Reviewing files that changed from the base of the PR and between fc6a6af and b24d3d0.

📒 Files selected for processing (16)
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/metrics.md
  • openapi/openapi.yaml
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/db/migrations/202604290001_add_deleted_time_indexes.go
  • pkg/db/migrations/migration_structs.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • test/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (5)
  • pkg/config/loader.go
  • docs/metrics.md
  • test/integration/deletion_metrics_test.go
  • pkg/metrics/deletion_test.go
  • pkg/metrics/deletion.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • pkg/config/metrics.go
  • pkg/config/flags.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • pkg/db/migrations/202604290001_add_deleted_time_indexes.go
  • cmd/hyperfleet-api/servecmd/cmd.go
  • pkg/db/migrations/migration_structs.go

Comment thread openapi/openapi.yaml Outdated
Comment thread pkg/services/cluster.go Outdated
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from b24d3d0 to b4ffe3e Compare May 5, 2026 13:30
@rafabene
Copy link
Copy Markdown
Contributor Author

rafabene commented May 5, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/metrics/deletion.go`:
- Around line 144-152: The shared timeout context created before the loop causes
the second DB query to occasionally fail with a deadline error; move the context
creation into the loop so each stuckQueries iteration gets its own timeout:
inside the for _, q := range stuckQueries loop create ctx, cancel :=
context.WithTimeout(context.Background(), c.queryTimeout) and use that ctx with
c.db.QueryRowContext(ctx, q.query, threshold), then call cancel() at the end of
the iteration (do not defer cancel() inside the loop). Remove the outer
ctx/cancel that currently surrounds the loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: f1367179-e9e7-4331-b984-70af03fca193

📥 Commits

Reviewing files that changed from the base of the PR and between fc6a6af and 6601f28.

📒 Files selected for processing (18)
  • charts/templates/prometheusrule.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • docs/metrics.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/metrics.go
  • pkg/dao/mocks/node_pool.go
  • pkg/dao/node_pool.go
  • pkg/db/migrations/202604290001_add_deleted_time_indexes.go
  • pkg/db/migrations/migration_structs.go
  • pkg/metrics/deletion.go
  • pkg/metrics/deletion_test.go
  • pkg/services/cluster.go
  • pkg/services/node_pool.go
  • pkg/services/node_pool_test.go
  • test/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (7)
  • pkg/db/migrations/migration_structs.go
  • cmd/hyperfleet-api/servecmd/cmd.go
  • pkg/config/loader.go
  • pkg/config/flags.go
  • cmd/hyperfleet-api/server/metrics_middleware.go
  • pkg/config/metrics.go
  • pkg/services/node_pool.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • pkg/db/migrations/202604290001_add_deleted_time_indexes.go
  • test/integration/deletion_metrics_test.go
  • pkg/services/cluster.go
  • pkg/metrics/deletion_test.go
  • charts/values.yaml
  • docs/metrics.md

Comment thread pkg/metrics/deletion.go Outdated
@rafabene rafabene force-pushed the HYPERFLEET-856-add-deletion-observability-metrics branch from 6601f28 to 8fcb3e8 Compare May 5, 2026 14:00
@rafabene
Copy link
Copy Markdown
Contributor Author

rafabene commented May 5, 2026

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants