HYPERFLEET-856 - feat: add deletion observability metrics and alerts#115
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis pull request introduces observability for resources stuck in "Pending Deletion" state across the hyperfleet-api system. It adds a new Sequence DiagramsequenceDiagram
participant Client as User/Client
participant Service as Service Layer
participant Metrics as Metrics Subsystem
participant DB as Database
participant Collector as Pending Deletion Collector
participant Prometheus as Prometheus
Client->>Service: Delete Cluster/NodePool (SoftDelete)
Service->>DB: Set deleted_time, deleted_by, increment generation
DB-->>Service: Persisted
Service->>Metrics: RecordPendingDeletion(resource_type)
Metrics->>Metrics: Increment counter, observe duration
Prometheus->>Collector: Scrape metrics
Collector->>DB: SELECT COUNT(*) WHERE deleted_time < (now - threshold)
DB-->>Collector: Count of stuck resources per type
Collector->>Prometheus: Emit hyperfleet_api_resource_pending_deletion_stuck gauge
Prometheus->>Prometheus: Evaluate alert rules
alt Stuck > 0 for 5m
Prometheus-->>Client: Fire HyperFleetResourceDeletionStuckWarning
end
alt Stuck > 0 for 30m
Prometheus-->>Client: Fire HyperFleetResourceDeletionStuckCritical
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~35 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/config/metrics.go (1)
16-30:⚠️ Potential issue | 🟠 MajorReject non-positive deletion thresholds.
DeletionStuckThresholdis only markedrequired, so a negative duration still gets through. That would invert the cutoff used by the stuck-resource collector and make the new deletion alerts/metrics meaningless. Please add an explicit positive-value check in config validation before this value is consumed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/config/metrics.go` around lines 16 - 30, Add explicit validation to reject non-positive DeletionStuckThreshold values: in the MetricsConfig validation path (e.g., the method that validates or finalizes MetricsConfig before use) check MetricsConfig.DeletionStuckThreshold > 0 and return an error if it is <= 0; update any constructor/factory logic (NewMetricsConfig remains fine) to rely on this validation so the stuck-resource collector and alerting never receive a zero or negative duration. Ensure the error message references DeletionStuckThreshold so callers know which config field is invalid.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@charts/templates/prometheusrule.yaml`:
- Around line 16-37: The alerts HyperFleetResourceDeletionStuck and
HyperFleetResourceDeletionTimeout currently use the raw metric
hyperfleet_api_resource_terminating_stuck > 0 which will fire per pod; change
the expr to aggregate across pods (e.g., use max by
(resource_type)(hyperfleet_api_resource_terminating_stuck) > 0) to deduplicate,
and replace the hard-coded durations in the annotations' description ("30
minutes"/"1 hour") with the templated for values (use the same {{
.Values.prometheusRule.rules.deletionStuck.for }} and {{
.Values.prometheusRule.rules.deletionTimeout.for }} references used for the for
fields) so the text matches configured alert durations for the alerts named
HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout.
In `@pkg/metrics/deletion.go`:
- Around line 134-149: Collect currently uses blocking DB calls via QueryRow;
change TerminatingCollector.Collect to use QueryRowContext with a bounded
context deadline so scrapes fail fast on slow DBs: create a context with timeout
(e.g., ctx, cancel := context.WithTimeout(context.Background(), c.queryTimeout)
and defer cancel()) and call c.db.QueryRowContext(ctx, q.query, threshold)
instead of c.db.QueryRow(...); ensure TerminatingCollector has a configurable
query timeout field (e.g., queryTimeout time.Duration) or use a sensible
constant, and preserve the existing error handling (log on Scan error) while
returning/continuing promptly when the context times out.
In `@pkg/services/cluster.go`:
- Around line 132-136: The cluster cascade path bypasses the node pool
soft-delete hook so nodepool termination metrics are not recorded; update the
cluster deletion flow (around the call to s.nodePoolDao.SoftDeleteByOwner) to
record nodepool terminations by invoking metrics.RecordTerminating("nodepool")
for the affected node pools (or ensure SoftDeleteByOwner emits that metric), so
the existing node_pool.go metric counts are preserved when deletions are
triggered by cluster cascade.
---
Outside diff comments:
In `@pkg/config/metrics.go`:
- Around line 16-30: Add explicit validation to reject non-positive
DeletionStuckThreshold values: in the MetricsConfig validation path (e.g., the
method that validates or finalizes MetricsConfig before use) check
MetricsConfig.DeletionStuckThreshold > 0 and return an error if it is <= 0;
update any constructor/factory logic (NewMetricsConfig remains fine) to rely on
this validation so the stuck-resource collector and alerting never receive a
zero or negative duration. Ensure the error message references
DeletionStuckThreshold so callers know which config field is invalid.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 2ad75a9a-1f66-4293-8c1f-68b67c9c0ec6
📒 Files selected for processing (13)
charts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/metrics.mdpkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/services/cluster.gopkg/services/node_pool.gotest/integration/deletion_metrics_test.go
6487c64 to
d429144
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@charts/templates/prometheusrule.yaml`:
- Around line 16-37: The alert uses aggregation "max by (resource_type)" over
hyperfleet_api_resource_terminating_stuck which merges across releases because
the metric lacks a namespace label; don't attempt to add namespace to the
aggregation without the label. Fix by one of the suggested approaches: (a) add a
relabeling rule in the ServiceMonitor (or ServiceMonitor template) to preserve
__meta_kubernetes_namespace as a metric label and change the PrometheusRule
alerts HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout to
aggregate with "max by (namespace, resource_type)"; or (b) emit "namespace" from
the application so hyperfleet_api_resource_terminating_stuck includes namespace
and then aggregate by namespace; or (c) require and validate that
serviceMonitor.namespace is set (documenting multi-release limitations) so
scraping is release-scoped. Reference symbols to change: the PrometheusRule
alerts HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout,
the metric hyperfleet_api_resource_terminating_stuck, and the
ServiceMonitor/serviceMonitor.namespace/relabel_configs to implement option (a).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: a59c3d7c-dcbd-4ed1-90ac-ec5a21b50b8b
📒 Files selected for processing (13)
charts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/metrics.mdpkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/services/cluster.gopkg/services/node_pool.gotest/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/metrics/deletion_test.go
🚧 Files skipped from review as they are similar to previous changes (9)
- pkg/config/flags.go
- pkg/services/cluster.go
- cmd/hyperfleet-api/server/metrics_middleware.go
- pkg/services/node_pool.go
- test/integration/deletion_metrics_test.go
- docs/metrics.md
- pkg/config/metrics.go
- charts/values.yaml
- pkg/metrics/deletion.go
d429144 to
79ebbb0
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
charts/templates/prometheusrule.yaml (1)
16-37:⚠️ Potential issue | 🟠 MajorAlign the alert aggregation with the metric labels.
hyperfleet_api_resource_terminating_stuckonly exportsresource_type,component, andversion, somax by (namespace, resource_type)collapses releases that share Prometheus and does not scope the alert by namespace. Either add a namespace label at scrape/metric time or removenamespacefrom the aggregation.Please verify the label mismatch with a read-only search:
#!/bin/bash set -euo pipefail rg -n 'labelResourceType|labelComponent|labelVersion|stuckDesc|max by \(namespace, resource_type\)' pkg/metrics/deletion.go charts/templates/prometheusrule.yamlExpected result: the metric definition has no namespace label while the alert rule groups by namespace, confirming the cross-release aggregation risk.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/templates/prometheusrule.yaml` around lines 16 - 37, The alert aggregation uses "max by (namespace, resource_type)" but the metric hyperfleet_api_resource_terminating_stuck (defined in pkg/metrics/deletion.go via labelResourceType/labelComponent/labelVersion) does not include a namespace label, causing cross-release aggregation; fix the two alerts HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout in charts/templates/prometheusrule.yaml by removing "namespace" from the aggregation (change "max by (namespace, resource_type)" to "max by (resource_type)") or alternatively ensure the metric is exported with a namespace label at scrape time—pick one approach and apply it consistently to both alert rules and their descriptions/runbook handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/metrics.md`:
- Around line 321-339: The PromQL aggregations are malformed and need to be
rewritten to the proper "sum by (...) (...)" form and the first rate() converted
to per-minute: change sum(rate(hyperfleet_api_resource_terminating_total[5m]))
by (resource_type) to sum by (resource_type)
(rate(hyperfleet_api_resource_terminating_total[5m])) * 60; change
sum(hyperfleet_api_resource_terminating_stuck) by (resource_type) to sum by
(resource_type) (hyperfleet_api_resource_terminating_stuck); for the average
terminating duration replace both numerator and denominator forms sum(... ) by
(resource_type) with sum by (resource_type) (...): sum by (resource_type)
(rate(hyperfleet_api_resource_terminating_duration_seconds_sum[5m])) / sum by
(resource_type)
(rate(hyperfleet_api_resource_terminating_duration_seconds_count[5m])); and
update the histogram_quantile call to histogram_quantile(0.99, sum by (le,
resource_type)
(rate(hyperfleet_api_resource_terminating_duration_seconds_bucket[5m]))).
In `@pkg/services/cluster.go`:
- Around line 143-150: The metric emission for nodepool termination is happening
after the follow-up read (FindSoftDeletedByOwner) which can fail and skip
emitting metrics; update the deletion flow to emit the metrics earlier or change
the SoftDeleteByOwner API to return the affected count/list. Concretely: either
modify SoftDeleteByOwner(...) to return (affectedCount int, affectedPools
[]NodePool, err error) and call metrics.RecordTerminating("nodepool") based on
that return value inside the deletion path, or move the loop that calls
metrics.RecordTerminating("nodepool") to immediately after the
SoftDeleteByOwner(...) call (before calling FindSoftDeletedByOwner), ensuring
metrics.RecordTerminating is invoked even if subsequent reads fail; adjust
references to SoftDeleteByOwner, FindSoftDeletedByOwner, and
metrics.RecordTerminating("nodepool") accordingly.
---
Duplicate comments:
In `@charts/templates/prometheusrule.yaml`:
- Around line 16-37: The alert aggregation uses "max by (namespace,
resource_type)" but the metric hyperfleet_api_resource_terminating_stuck
(defined in pkg/metrics/deletion.go via
labelResourceType/labelComponent/labelVersion) does not include a namespace
label, causing cross-release aggregation; fix the two alerts
HyperFleetResourceDeletionStuck and HyperFleetResourceDeletionTimeout in
charts/templates/prometheusrule.yaml by removing "namespace" from the
aggregation (change "max by (namespace, resource_type)" to "max by
(resource_type)") or alternatively ensure the metric is exported with a
namespace label at scrape time—pick one approach and apply it consistently to
both alert rules and their descriptions/runbook handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 3252c284-6c38-4779-9ca9-a7fe3e9a0937
📒 Files selected for processing (13)
charts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/metrics.mdpkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/services/cluster.gopkg/services/node_pool.gotest/integration/deletion_metrics_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
- cmd/hyperfleet-api/server/metrics_middleware.go
- pkg/config/flags.go
- pkg/services/node_pool.go
- pkg/config/loader.go
- pkg/config/metrics.go
79ebbb0 to
a29272c
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/services/cluster.go (1)
138-145:⚠️ Potential issue | 🟠 MajorNodepool terminating events can be permanently missed on read failure.
If
FindSoftDeletedByOwnerfails at Line 138 afterSoftDeleteByOwnersucceeded at Line 134, the function exits before Lines 143-145 emit nodepool metrics. Later retries short-circuit at Line 118, so those missed events are never recovered. Emit nodepool metric counts from the cascade write result (e.g., affected rows/returned IDs) instead of depending on a follow-up read.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/cluster.go` around lines 138 - 145, The current implementation records nodepool terminating metrics by reading soft-deleted nodepools via FindSoftDeletedByOwner (nodePools) after SoftDeleteByOwner, so if that read fails the metrics are never emitted; update the cascade delete flow to emit metrics from the cascade write result instead of the follow-up read: change SoftDeleteByOwner (or the DAO method it calls) to return the affected row count or the list of deleted IDs, have the caller capture that return value (instead of relying on nodePools from FindSoftDeletedByOwner), and call metrics.RecordTerminating the appropriate number of times (or once with the count) based on that returned count/IDs; retain or remove the FindSoftDeletedByOwner call as needed but ensure metrics emission no longer depends on it.charts/templates/prometheusrule.yaml (1)
17-17:⚠️ Potential issue | 🟠 Major
namespacescoping in the alert expression is currently ineffective.
hyperfleet_api_resource_terminating_stuckdoesn’t expose anamespacelabel, somax by (namespace, resource_type)still collapses across releases/namespaces for eachresource_type. This can mix signals across deployments.Run this to verify label shape and whether ServiceMonitor relabeling injects
namespace:#!/bin/bash set -euo pipefail echo "== Collector metric descriptor/emit labels ==" rg -n -C3 'resource_terminating_stuck|NewDesc\(|MustNewConstMetric\(' pkg/metrics/deletion.go echo echo "== Alert expressions ==" rg -n -C2 'HyperFleetResourceDeletion(Stuck|Timeout)|max by|hyperfleet_api_resource_terminating_stuck' charts/templates/prometheusrule.yaml echo echo "== ServiceMonitor relabeling (namespace injection) ==" fd -i 'servicemonitor.yaml' charts/templates | while read -r f; do echo "--- ${f} ---" rg -n -C3 '__meta_kubernetes_namespace|metricRelabelings|relabelings|targetLabels|namespace' "$f" || true doneExpected: no
namespacelabel in collector metric unless explicitly injected by scrape relabeling; if absent, currentby(namespace, resource_type)doesn’t isolate per namespace.Also applies to: 28-28
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/templates/prometheusrule.yaml` at line 17, The alert groups by a non-existent namespace label so signals are mixed; either inject namespace at scrape time or stop grouping by it: update the Prometheus relabeling in your ServiceMonitor (servicemonitor.yaml) to add target label "namespace" from __meta_kubernetes_namespace so the collector metric hyperfleet_api_resource_terminating_stuck has a namespace label, or change the rule in charts/templates/prometheusrule.yaml (expr referencing max by (namespace, resource_type)) to group only by existing labels (e.g., max by (resource_type)) and remove namespace from the by() clause to avoid incorrect scoping.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@charts/templates/prometheusrule.yaml`:
- Line 17: The alert groups by a non-existent namespace label so signals are
mixed; either inject namespace at scrape time or stop grouping by it: update the
Prometheus relabeling in your ServiceMonitor (servicemonitor.yaml) to add target
label "namespace" from __meta_kubernetes_namespace so the collector metric
hyperfleet_api_resource_terminating_stuck has a namespace label, or change the
rule in charts/templates/prometheusrule.yaml (expr referencing max by
(namespace, resource_type)) to group only by existing labels (e.g., max by
(resource_type)) and remove namespace from the by() clause to avoid incorrect
scoping.
In `@pkg/services/cluster.go`:
- Around line 138-145: The current implementation records nodepool terminating
metrics by reading soft-deleted nodepools via FindSoftDeletedByOwner (nodePools)
after SoftDeleteByOwner, so if that read fails the metrics are never emitted;
update the cascade delete flow to emit metrics from the cascade write result
instead of the follow-up read: change SoftDeleteByOwner (or the DAO method it
calls) to return the affected row count or the list of deleted IDs, have the
caller capture that return value (instead of relying on nodePools from
FindSoftDeletedByOwner), and call metrics.RecordTerminating the appropriate
number of times (or once with the count) based on that returned count/IDs;
retain or remove the FindSoftDeletedByOwner call as needed but ensure metrics
emission no longer depends on it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 2a707980-1e96-49d8-88f5-25adfb2a8e1f
📒 Files selected for processing (13)
charts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/metrics.mdpkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/services/cluster.gopkg/services/node_pool.gotest/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (3)
- pkg/config/flags.go
- pkg/services/node_pool.go
- charts/values.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
- cmd/hyperfleet-api/servecmd/cmd.go
- pkg/config/loader.go
- pkg/config/metrics.go
- docs/metrics.md
964d474 to
3f46062
Compare
| summary: "HyperFleet resources stuck in Pending Deletion state" | ||
| description: >- | ||
| {{ "{{ $value }}" }} {{ "{{ $labels.resource_type }}" }} resource(s) have been in | ||
| Pending Deletion state for more than {{ .Values.prometheusRule.rules.deletionStuck.for | default "5m" }}. |
There was a problem hiding this comment.
Is this description correct, is it not hitting 35m and not 5m?
There was a problem hiding this comment.
Good catch! You're right — the pending_deletion_stuck metric already embeds a 30m threshold from the collector (config.metrics.deletion_stuck_threshold), so by the time the Warning alert fires after its 5m for delay, the resource has actually been stuck for ~35m, not 5m.
Fixed in 39d7989 — the description now shows both components: 30m (stuck threshold) + 5m (alert delay).
| for: "5m" | ||
| runbookUrl: "" | ||
| deletionTimeout: | ||
| for: "30m" |
There was a problem hiding this comment.
Curious what was your reasoning behind these defaults?
There was a problem hiding this comment.
The deletionStuck.for: 5m is meant as an early warning — the collector already flags resources stuck beyond the 30m threshold (deletion_stuck_threshold), so 5m of sustained signal is just enough to filter out transient scrape noise before paging.
The deletionTimeout.for: 30m escalates to critical — at that point the resource has been stuck for ~60m total (30m threshold + 30m alert delay), which signals something is genuinely broken and needs immediate attention.
Happy to adjust if you think different values make more sense for the team's SLOs.
There was a problem hiding this comment.
I am leaning towards stupid defaults TBH, to avoid un-wanted noise for our partner teams. Most will prob just roll the defaults until they align on their SLO's. I would lean on warning after an hour and timeout at 2 or 3 hours.
There was a problem hiding this comment.
Makes total sense — bumped the defaults to conservative values: warning at 30m (fires ~1h total) and critical at 2h (fires ~2.5h total). Better to start quiet and let GCP/ROSA teams tighten when they define their SLOs, rather than generating noise out of the box.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
pkg/services/cluster.go (1)
138-145:⚠️ Potential issue | 🟠 MajorNodepool metric emission still depends on a brittle follow-up read.
If
SoftDeleteByOwnersucceeds butFindSoftDeletedByOwnerfails, nodepools are already soft-deleted but nonodepoolpending-deletion metrics are emitted, and retries won’t replay because the cluster is already marked deleted.Consider moving emission to data returned directly by
SoftDeleteByOwner(e.g., affected count/list) so it doesn’t depend on a second query.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/cluster.go` around lines 138 - 145, The current flow calls s.nodePoolDao.SoftDeleteByOwner(...) then separately calls s.nodePoolDao.FindSoftDeletedByOwner(...) to emit metrics, which is brittle if the second read fails; modify SoftDeleteByOwner on nodePoolDao to return the affected nodepool count or list (e.g., return (count int, ids []string, error)) and update callers in cluster.go to call that result directly and invoke metrics.RecordPendingDeletion("nodepool") using the returned count/list so metric emission does not depend on FindSoftDeletedByOwner; ensure error handling still returns on SoftDeleteByOwner failure and that the new return shape is propagated where needed.charts/templates/prometheusrule.yaml (1)
17-17:⚠️ Potential issue | 🟠 Major
namespacegrouping is still ineffective for this metric.On Line 17 and Line 29,
max by (namespace, resource_type)won’t scope alerts per namespace unless the metric actually has anamespacelabel.hyperfleet_api_resource_pending_deletion_stuckis emitted withoutnamespaceinpkg/metrics/deletion.go, so this can still merge signals across releases.#!/bin/bash # Verify metric labels for pending deletion stuck collector rg -n -C3 'resource_pending_deletion_stuck|NewDesc\(|labelResourceType|labelComponent|labelVersion' pkg/metrics/deletion.go # Verify whether ServiceMonitor injects namespace as a metric label rg -n -C3 'ServiceMonitor|metricRelabelings|relabelings|__meta_kubernetes_namespace|namespace' charts/templatesExpected result:
pkg/metrics/deletion.goshows onlyresource_typeas variable label for this metric.- No relabeling that preserves Kubernetes namespace into metric labels, confirming aggregation still cannot partition per namespace.
Also applies to: 29-29
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/templates/prometheusrule.yaml` at line 17, The alert groups by (namespace, resource_type) but the metric hyperfleet_api_resource_pending_deletion_stuck is emitted without a namespace label (see pkg/metrics/deletion.go), so change one of two things: either add a "namespace" label to the metric at its declaration/recording in pkg/metrics/deletion.go (update the NewDesc/label list and ensure the collector sets the namespace value when observing), or if you cannot emit namespace, remove namespace from the PrometheusRule expression in charts/templates/prometheusrule.yaml and use max by (resource_type) (and the identical change for the other occurrence at line 29) so alerts aren’t misleadingly grouped by a non-existent label.
🧹 Nitpick comments (1)
test/integration/deletion_metrics_test.go (1)
33-35: Assert delete response status to prevent false-positive test results.Right now these checks only assert transport success (
err == nil). If delete returns non-2xx, the second subtest can still pass spuriously.Proposed test hardening
- _, err = client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx)) + delResp, err := client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx)) Expect(err).NotTo(HaveOccurred()) + Expect(delResp.StatusCode()).To(BeNumerically(">=", 200)) + Expect(delResp.StatusCode()).To(BeNumerically("<", 300)) ... - _, err = client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx)) + delResp, err := client.DeleteClusterByIdWithResponse(ctx, cluster.ID, test.WithAuthToken(ctx)) Expect(err).NotTo(HaveOccurred()) + Expect(delResp.StatusCode()).To(BeNumerically(">=", 200)) + Expect(delResp.StatusCode()).To(BeNumerically("<", 300))Also applies to: 61-63
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/integration/deletion_metrics_test.go` around lines 33 - 35, The test currently only checks that client.DeleteClusterByIdWithResponse returned no transport error (err == nil) which can hide non-2xx delete responses; update the two places calling DeleteClusterByIdWithResponse (the call using cluster.ID and the second similar call around lines 61-63) to capture the response object and assert its HTTP status is the expected success code (e.g., 200/204) using the response's status/status code accessor before or in addition to Expect(err).NotTo(HaveOccurred()) so the test fails on non-2xx responses; reference the DeleteClusterByIdWithResponse call, the returned response variable, and the Expect assertions when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/metrics/deletion.go`:
- Around line 135-136: Metrics scrape COUNT queries for clusters and node_pools
(the entries with "SELECT COUNT(*) FROM clusters WHERE deleted_time IS NOT NULL
AND deleted_time < $1" and "SELECT COUNT(*) FROM node_pools WHERE deleted_time
IS NOT NULL AND deleted_time < $1" in pkg/metrics/deletion.go) need supporting
indexes; add a migration that creates either simple indexes on deleted_time
(CREATE INDEX ON clusters(deleted_time), CREATE INDEX ON
node_pools(deleted_time)) or partial indexes filtering NOT NULL (CREATE INDEX
... WHERE deleted_time IS NOT NULL) and also drop the stale indexes left on the
old deleted_at column (drop idx_clusters_deleted_at and
idx_node_pools_deleted_at) so the metrics queries use the new indexes and avoid
full-table scans.
---
Duplicate comments:
In `@charts/templates/prometheusrule.yaml`:
- Line 17: The alert groups by (namespace, resource_type) but the metric
hyperfleet_api_resource_pending_deletion_stuck is emitted without a namespace
label (see pkg/metrics/deletion.go), so change one of two things: either add a
"namespace" label to the metric at its declaration/recording in
pkg/metrics/deletion.go (update the NewDesc/label list and ensure the collector
sets the namespace value when observing), or if you cannot emit namespace,
remove namespace from the PrometheusRule expression in
charts/templates/prometheusrule.yaml and use max by (resource_type) (and the
identical change for the other occurrence at line 29) so alerts aren’t
misleadingly grouped by a non-existent label.
In `@pkg/services/cluster.go`:
- Around line 138-145: The current flow calls
s.nodePoolDao.SoftDeleteByOwner(...) then separately calls
s.nodePoolDao.FindSoftDeletedByOwner(...) to emit metrics, which is brittle if
the second read fails; modify SoftDeleteByOwner on nodePoolDao to return the
affected nodepool count or list (e.g., return (count int, ids []string, error))
and update callers in cluster.go to call that result directly and invoke
metrics.RecordPendingDeletion("nodepool") using the returned count/list so
metric emission does not depend on FindSoftDeletedByOwner; ensure error handling
still returns on SoftDeleteByOwner failure and that the new return shape is
propagated where needed.
---
Nitpick comments:
In `@test/integration/deletion_metrics_test.go`:
- Around line 33-35: The test currently only checks that
client.DeleteClusterByIdWithResponse returned no transport error (err == nil)
which can hide non-2xx delete responses; update the two places calling
DeleteClusterByIdWithResponse (the call using cluster.ID and the second similar
call around lines 61-63) to capture the response object and assert its HTTP
status is the expected success code (e.g., 200/204) using the response's
status/status code accessor before or in addition to
Expect(err).NotTo(HaveOccurred()) so the test fails on non-2xx responses;
reference the DeleteClusterByIdWithResponse call, the returned response
variable, and the Expect assertions when making this change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: ab847be3-0cd7-404c-b0c0-96a4da969b4e
📒 Files selected for processing (13)
charts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/metrics.mdpkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/services/cluster.gopkg/services/node_pool.gotest/integration/deletion_metrics_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
- pkg/config/flags.go
- cmd/hyperfleet-api/server/metrics_middleware.go
- charts/values.yaml
- pkg/config/metrics.go
- docs/metrics.md
39d7989 to
7b9a35f
Compare
7b9a35f to
429af79
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/services/cluster.go (1)
134-160:⚠️ Potential issue | 🟠 MajorEmit nodepool metrics from the write path, not the follow-up read.
This still derives the
nodepoolcounter fromFindSoftDeletedByOwner(), so a transient read failure drops the metric entirely, and any nodepools that were already soft-deleted before this cascade will be counted again. Please haveSoftDeleteByOwner()return the affected rows/IDs, or move the metric emission into that write path.♻️ Proposed fix
- if cascadeErr := s.nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy); cascadeErr != nil { - return nil, handleSoftDeleteError("NodePool", cascadeErr) - } - - nodePools, err := s.nodePoolDao.FindSoftDeletedByOwner(ctx, id) - if err != nil { - return nil, errors.GeneralError("Failed to fetch cascade-deleted nodepools: %s", err) - } - - for range nodePools { + nodePools, cascadeErr := s.nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy) + if cascadeErr != nil { + return nil, handleSoftDeleteError("NodePool", cascadeErr) + } + + for range nodePools { metrics.RecordPendingDeletion("nodepool") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/cluster.go` around lines 134 - 160, The current metric emission uses FindSoftDeletedByOwner() after SoftDeleteByOwner(), which can lose metrics on transient reads and double-count pre-existing soft-deletes; change nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy) to return the list of affected nodepool IDs (or count), have SoftDeleteByOwner itself call metrics.RecordPendingDeletion("nodepool") for each affected ID (or record the count), and update this caller (and any other callers) to stop iterating over results from FindSoftDeletedByOwner(); keep UpdateClusterStatusFromAdapters and batchUpdateNodePoolStatusesFromAdapters calls as-is but remove the post-read metric loop so metrics are emitted reliably from the write path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/services/cluster.go`:
- Around line 134-160: The current metric emission uses FindSoftDeletedByOwner()
after SoftDeleteByOwner(), which can lose metrics on transient reads and
double-count pre-existing soft-deletes; change
nodePoolDao.SoftDeleteByOwner(ctx, id, t, deletedBy) to return the list of
affected nodepool IDs (or count), have SoftDeleteByOwner itself call
metrics.RecordPendingDeletion("nodepool") for each affected ID (or record the
count), and update this caller (and any other callers) to stop iterating over
results from FindSoftDeletedByOwner(); keep UpdateClusterStatusFromAdapters and
batchUpdateNodePoolStatusesFromAdapters calls as-is but remove the post-read
metric loop so metrics are emitted reliably from the write path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: b561f099-048e-47c0-8991-8c567303bafc
📒 Files selected for processing (15)
charts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/metrics.mdpkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/db/migrations/202604290001_add_deleted_time_indexes.gopkg/db/migrations/migration_structs.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/services/cluster.gopkg/services/node_pool.gotest/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (3)
- pkg/config/flags.go
- pkg/services/node_pool.go
- docs/metrics.md
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/config/loader.go
- cmd/hyperfleet-api/server/metrics_middleware.go
- charts/values.yaml
- test/integration/deletion_metrics_test.go
- pkg/config/metrics.go
- pkg/metrics/deletion_test.go
|
/retest |
|
/retest |
429af79 to
066adcf
Compare
066adcf to
a170052
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@charts/templates/prometheusrule.yaml`:
- Around line 18-30: The alert default "for" durations are too long; update the
template to use the intended SLO windows by changing the default for
.Values.prometheusRule.rules.deletionStuck.for from "30m" to "5m" and
.Values.prometheusRule.rules.deletionTimeout.for from "2h" to "30m" so the
HyperFleetResourceDeletionStuck warning/critical alerts (see the alert blocks
referencing deletionStuck.for and deletionTimeout.for and the
HyperFleetResourceDeletionStuckCritical alert) fire at the intended 5m and 30m
defaults when chart values are not provided.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 7a48a973-4b64-4818-bed2-38e4966a17e0
📒 Files selected for processing (15)
charts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/metrics.mdpkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/db/migrations/202604290001_add_deleted_time_indexes.gopkg/db/migrations/migration_structs.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/services/cluster.gopkg/services/node_pool.gotest/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (6)
- pkg/db/migrations/202604290001_add_deleted_time_indexes.go
- pkg/db/migrations/migration_structs.go
- pkg/config/loader.go
- pkg/services/cluster.go
- pkg/metrics/deletion.go
- pkg/metrics/deletion_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
- pkg/services/node_pool.go
- pkg/config/metrics.go
- test/integration/deletion_metrics_test.go
- charts/values.yaml
d7b4c7d to
fc6a6af
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
pkg/dao/mocks/node_pool.go (1)
38-41: 💤 Low valueSave appends instead of updates.
The
Savemethod appends the nodepool to the slice rather than replacing an existing entry by ID. This could cause duplicate entries in tests that callSaveon an existing nodepool. However, if the mock is only used for basic scenarios where this behavior is acceptable, this is a minor concern.♻️ Optional fix for update semantics
func (d *nodePoolDaoMock) Save(ctx context.Context, nodePool *api.NodePool) error { + for i, np := range d.nodePools { + if np.ID == nodePool.ID { + d.nodePools[i] = nodePool + return nil + } + } d.nodePools = append(d.nodePools, nodePool) return nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/dao/mocks/node_pool.go` around lines 38 - 41, The Save method on nodePoolDaoMock currently appends instead of updating; modify nodePoolDaoMock.Save(ctx context.Context, nodePool *api.NodePool) to search d.nodePools for an existing entry with the same ID (compare nodePool.Id or nodePool.ID field used by api.NodePool), replace that slice element with the incoming nodePool if found, and only append to d.nodePools if no matching ID exists; ensure you update the element in-place on d.nodePools to preserve slice ordering and return nil as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/dao/mocks/cluster.go`:
- Around line 37-40: The mock Save method currently appends every cluster
causing duplicate IDs; change clusterDaoMock.Save to upsert by ID: search
d.clusters for an existing entry with the same cluster.ID and replace that
element if found, otherwise append the new cluster; update d.clusters in-place
so Get (and other methods) return the latest value for that ID.
In `@pkg/db/migrations/migration_structs.go`:
- Around line 35-38: MigrationList entries are out of ascending ID order; move
addReconciledIndex() so the list is sorted by migration ID. Edit the
MigrationList registration where addSoftDeleteSchema(),
addNodePoolOwnerDeletedIndex(), addReconciledIndex(), addDeletedTimeIndexes()
are listed and swap the positions of addNodePoolOwnerDeletedIndex and
addReconciledIndex so addReconciledIndex() appears before
addNodePoolOwnerDeletedIndex(), ensuring the migration registration order is
ascending and deterministic.
In `@pkg/services/node_pool.go`:
- Around line 102-123: The SoftDelete method's in-memory check
(nodePool.DeletedTime) can race; change the flow so the soft-delete is performed
atomically in the DB (or under row lock) to ensure only one request flips
deleted_time and increments metrics: replace the Get+check+Save pattern in
SoftDelete with a single conditional update (e.g., UPDATE ... SET
deleted_time=?, deleted_by=?, generation=generation+1 WHERE id=? AND
deleted_time IS NULL) or perform SELECT ... FOR UPDATE then update, and only
call metrics.RecordPendingDeletion("nodepool") when the conditional update
reports rows affected == 1 (or when the locked row shows it was previously NULL)
so concurrent requests do not both run side effects; keep references to
nodePoolDao.Save / nodePoolDao.Get and RecordPendingDeletion when implementing
the atomic update/locking behavior.
In `@pkg/telemetry/otel.go`:
- Around line 82-85: Fix the typo in the comment above the propagator selection
in otel.go: change "envivronment" to "environment" in the comment that
references OTEL_PROPAGATORS (the block immediately above the use of
autoprop.NewTextMapPropagator and otel.SetTextMapPropagator) so the comment
reads correctly.
---
Nitpick comments:
In `@pkg/dao/mocks/node_pool.go`:
- Around line 38-41: The Save method on nodePoolDaoMock currently appends
instead of updating; modify nodePoolDaoMock.Save(ctx context.Context, nodePool
*api.NodePool) to search d.nodePools for an existing entry with the same ID
(compare nodePool.Id or nodePool.ID field used by api.NodePool), replace that
slice element with the incoming nodePool if found, and only append to
d.nodePools if no matching ID exists; ensure you update the element in-place on
d.nodePools to preserve slice ordering and return nil as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 109f893a-3660-4bd9-81fd-53fb80bc9a2d
⛔ Files ignored due to path filters (3)
.bingo/golangci-lint.sumis excluded by!**/*.sum.bingo/oapi-codegen.sumis excluded by!**/*.sumgo.sumis excluded by!**/*.sum
📒 Files selected for processing (69)
.bingo/Variables.mk.bingo/oapi-codegen.mod.bingo/variables.envCHANGELOG.mdMakefilecharts/templates/_helpers.tplcharts/templates/deployment.yamlcharts/templates/postgresql.yamlcharts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/config.mddocs/database.mddocs/metrics.mdgo.modhack/verify-migrations.shopenapi/openapi.yamlpkg/api/cluster_types.gopkg/api/metadata_types.gopkg/api/node_pool_types.gopkg/api/presenters/cluster.gopkg/api/presenters/node_pool.gopkg/api/status_types.gopkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/dao/adapter_status.gopkg/dao/cluster.gopkg/dao/mocks/cluster.gopkg/dao/mocks/node_pool.gopkg/dao/node_pool.gopkg/db/migrations/202511111044_add_clusters.gopkg/db/migrations/202511111055_add_node_pools.gopkg/db/migrations/202511111105_add_adapter_status.gopkg/db/migrations/202601210001_add_conditions_gin_index.gopkg/db/migrations/202604160001_soft_delete_schema.gopkg/db/migrations/202604211859_add_reconciled_index.gopkg/db/migrations/202604230001_add_nodepool_owner_deleted_index.gopkg/db/migrations/202604290001_add_deleted_time_indexes.gopkg/db/migrations/migration_structs.gopkg/handlers/cluster.gopkg/handlers/cluster_nodepools.gopkg/handlers/cluster_nodepools_test.gopkg/handlers/cluster_status.gopkg/handlers/framework.gopkg/handlers/node_pool.gopkg/handlers/nodepool_status.gopkg/handlers/rest.gopkg/handlers/validation.gopkg/handlers/validation_test.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/middleware/otel_test.gopkg/services/aggregation.gopkg/services/aggregation_test.gopkg/services/cluster.gopkg/services/cluster_test.gopkg/services/generic.gopkg/services/node_pool.gopkg/services/node_pool_test.gopkg/services/status_helpers.gopkg/services/util.gopkg/telemetry/otel.goplugins/clusters/plugin.gotest/helper.gotest/integration/clusters_test.gotest/integration/deletion_metrics_test.gotest/integration/node_pools_test.go
💤 Files with no reviewable changes (7)
- pkg/handlers/rest.go
- pkg/db/migrations/202511111044_add_clusters.go
- pkg/middleware/otel_test.go
- pkg/db/migrations/202601210001_add_conditions_gin_index.go
- pkg/db/migrations/202511111105_add_adapter_status.go
- charts/templates/postgresql.yaml
- pkg/db/migrations/202511111055_add_node_pools.go
✅ Files skipped from review due to trivial changes (12)
- .bingo/oapi-codegen.mod
- pkg/db/migrations/202604211859_add_reconciled_index.go
- pkg/db/migrations/202604290001_add_deleted_time_indexes.go
- pkg/db/migrations/202604230001_add_nodepool_owner_deleted_index.go
- pkg/api/status_types.go
- CHANGELOG.md
- docs/config.md
- pkg/config/loader.go
- pkg/config/flags.go
- cmd/hyperfleet-api/servecmd/cmd.go
- docs/metrics.md
- pkg/metrics/deletion.go
🚧 Files skipped from review as they are similar to previous changes (4)
- cmd/hyperfleet-api/server/metrics_middleware.go
- test/integration/deletion_metrics_test.go
- pkg/config/metrics.go
- pkg/metrics/deletion_test.go
| func (d *clusterDaoMock) Save(ctx context.Context, cluster *api.Cluster) error { | ||
| d.clusters = append(d.clusters, cluster) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Save should upsert by ID, not always append
This mock currently accumulates duplicate IDs, which can make Get return stale data and reduce test reliability for update paths.
Proposed fix
func (d *clusterDaoMock) Save(ctx context.Context, cluster *api.Cluster) error {
- d.clusters = append(d.clusters, cluster)
+ for i, existing := range d.clusters {
+ if existing.ID == cluster.ID {
+ d.clusters[i] = cluster
+ return nil
+ }
+ }
+ d.clusters = append(d.clusters, cluster)
return nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (d *clusterDaoMock) Save(ctx context.Context, cluster *api.Cluster) error { | |
| d.clusters = append(d.clusters, cluster) | |
| return nil | |
| } | |
| func (d *clusterDaoMock) Save(ctx context.Context, cluster *api.Cluster) error { | |
| for i, existing := range d.clusters { | |
| if existing.ID == cluster.ID { | |
| d.clusters[i] = cluster | |
| return nil | |
| } | |
| } | |
| d.clusters = append(d.clusters, cluster) | |
| return nil | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/dao/mocks/cluster.go` around lines 37 - 40, The mock Save method
currently appends every cluster causing duplicate IDs; change
clusterDaoMock.Save to upsert by ID: search d.clusters for an existing entry
with the same cluster.ID and replace that element if found, otherwise append the
new cluster; update d.clusters in-place so Get (and other methods) return the
latest value for that ID.
| addSoftDeleteSchema(), | ||
| addNodePoolOwnerDeletedIndex(), | ||
| addReconciledIndex(), | ||
| addDeletedTimeIndexes(), |
There was a problem hiding this comment.
Migration registration order is not ascending
MigrationList should follow ascending migration IDs, but addNodePoolOwnerDeletedIndex() is placed before addReconciledIndex(). Please reorder these entries to match ID order and keep migration execution deterministic.
Proposed fix
addSoftDeleteSchema(),
- addNodePoolOwnerDeletedIndex(),
addReconciledIndex(),
+ addNodePoolOwnerDeletedIndex(),
addDeletedTimeIndexes(),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| addSoftDeleteSchema(), | |
| addNodePoolOwnerDeletedIndex(), | |
| addReconciledIndex(), | |
| addDeletedTimeIndexes(), | |
| addSoftDeleteSchema(), | |
| addReconciledIndex(), | |
| addNodePoolOwnerDeletedIndex(), | |
| addDeletedTimeIndexes(), |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/db/migrations/migration_structs.go` around lines 35 - 38, MigrationList
entries are out of ascending ID order; move addReconciledIndex() so the list is
sorted by migration ID. Edit the MigrationList registration where
addSoftDeleteSchema(), addNodePoolOwnerDeletedIndex(), addReconciledIndex(),
addDeletedTimeIndexes() are listed and swap the positions of
addNodePoolOwnerDeletedIndex and addReconciledIndex so addReconciledIndex()
appears before addNodePoolOwnerDeletedIndex(), ensuring the migration
registration order is ascending and deterministic.
| // Select the propagator based on envivronment variable OTEL_PROPAGATORS | ||
| // If OTEL_PROPAGATORS is not provided, uses default "tracecontext,baggage" | ||
| textMapProp := autoprop.NewTextMapPropagator() | ||
| otel.SetTextMapPropagator(textMapProp) |
There was a problem hiding this comment.
Typo in comment: "envivronment" → "environment".
- // Select the propagator based on envivronment variable OTEL_PROPAGATORS
+ // Select the propagator based on environment variable OTEL_PROPAGATORS📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Select the propagator based on envivronment variable OTEL_PROPAGATORS | |
| // If OTEL_PROPAGATORS is not provided, uses default "tracecontext,baggage" | |
| textMapProp := autoprop.NewTextMapPropagator() | |
| otel.SetTextMapPropagator(textMapProp) | |
| // Select the propagator based on environment variable OTEL_PROPAGATORS | |
| // If OTEL_PROPAGATORS is not provided, uses default "tracecontext,baggage" | |
| textMapProp := autoprop.NewTextMapPropagator() | |
| otel.SetTextMapPropagator(textMapProp) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/telemetry/otel.go` around lines 82 - 85, Fix the typo in the comment
above the propagator selection in otel.go: change "envivronment" to
"environment" in the comment that references OTEL_PROPAGATORS (the block
immediately above the use of autoprop.NewTextMapPropagator and
otel.SetTextMapPropagator) so the comment reads correctly.
fc6a6af to
b24d3d0
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@openapi/openapi.yaml`:
- Around line 190-197: Remove the duplicate "type: Reconciled" condition entries
in the delete-response examples under status.conditions: locate the extra blocks
where "type: Reconciled" appears a second time (the delete examples) and delete
those duplicated condition objects so each example only contains one each of
Ready, Reconciled, and Available; ensure the remaining Reconciled condition
keeps its original fields (status, reason, message, observed_generation,
created_time, last_updated_time, last_transition_time).
In `@pkg/services/cluster.go`:
- Around line 127-134: The current metrics loop uses nodePools from
FindSoftDeletedByOwner which includes pre-existing soft-deletes and overcounts;
change SoftDeleteByOwner (on nodePoolDao) to return the list of affected
nodepool IDs or an integer count of rows actually updated, then in cluster.go
use that returned set/count (instead of calling FindSoftDeletedByOwner) to call
metrics.RecordPendingDeletion exactly once per newly soft-deleted row and to
drive any downstream status updates; update any callers of SoftDeleteByOwner to
handle the new return value and remove the separate FindSoftDeletedByOwner-based
metric emission.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 3d172262-8cde-48ea-97f2-b0ef0dc73556
📒 Files selected for processing (16)
charts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/metrics.mdopenapi/openapi.yamlpkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/db/migrations/202604290001_add_deleted_time_indexes.gopkg/db/migrations/migration_structs.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/services/cluster.gopkg/services/node_pool.gotest/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (5)
- pkg/config/loader.go
- docs/metrics.md
- test/integration/deletion_metrics_test.go
- pkg/metrics/deletion_test.go
- pkg/metrics/deletion.go
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/config/metrics.go
- pkg/config/flags.go
- cmd/hyperfleet-api/server/metrics_middleware.go
- pkg/db/migrations/202604290001_add_deleted_time_indexes.go
- cmd/hyperfleet-api/servecmd/cmd.go
- pkg/db/migrations/migration_structs.go
b24d3d0 to
b4ffe3e
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/metrics/deletion.go`:
- Around line 144-152: The shared timeout context created before the loop causes
the second DB query to occasionally fail with a deadline error; move the context
creation into the loop so each stuckQueries iteration gets its own timeout:
inside the for _, q := range stuckQueries loop create ctx, cancel :=
context.WithTimeout(context.Background(), c.queryTimeout) and use that ctx with
c.db.QueryRowContext(ctx, q.query, threshold), then call cancel() at the end of
the iteration (do not defer cancel() inside the loop). Remove the outer
ctx/cancel that currently surrounds the loop.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: f1367179-e9e7-4331-b984-70af03fca193
📒 Files selected for processing (18)
charts/templates/prometheusrule.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/metrics_middleware.godocs/metrics.mdpkg/config/flags.gopkg/config/loader.gopkg/config/metrics.gopkg/dao/mocks/node_pool.gopkg/dao/node_pool.gopkg/db/migrations/202604290001_add_deleted_time_indexes.gopkg/db/migrations/migration_structs.gopkg/metrics/deletion.gopkg/metrics/deletion_test.gopkg/services/cluster.gopkg/services/node_pool.gopkg/services/node_pool_test.gotest/integration/deletion_metrics_test.go
✅ Files skipped from review due to trivial changes (7)
- pkg/db/migrations/migration_structs.go
- cmd/hyperfleet-api/servecmd/cmd.go
- pkg/config/loader.go
- pkg/config/flags.go
- cmd/hyperfleet-api/server/metrics_middleware.go
- pkg/config/metrics.go
- pkg/services/node_pool.go
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/db/migrations/202604290001_add_deleted_time_indexes.go
- test/integration/deletion_metrics_test.go
- pkg/services/cluster.go
- pkg/metrics/deletion_test.go
- charts/values.yaml
- docs/metrics.md
…prevent overcounting
6601f28 to
8fcb3e8
Compare
|
/retest |
Summary
hyperfleet_api_resource_pending_deletion_total(counter) — tracks resources entering Pending Deletion statehyperfleet_api_resource_pending_deletion_duration_seconds(histogram) — measures soft-delete to hard-delete duration (populated when hard-delete flow lands)hyperfleet_api_resource_pending_deletion_stuck(gauge via collector) — queries DB on each scrape for resources stuck beyond configurable thresholdHyperFleetResourceDeletionStuck(warning) — resources stuck >30min for 5mHyperFleetResourceDeletionTimeout(critical) — resources stuck >30min for 30m--metrics-deletion-stuck-thresholdconfig flag (default 30m)SoftDeletein cluster and nodepool services withRecordPendingDeletion()PendingDeletionCollectoragainst real PostgreSQL via testcontainersdocs/metrics.mdTest plan
make test— 796 unit tests passingmake lint— 0 issuesmake test-helm— all chart templates OK (including new PrometheusRule)make test-integration— 93 integration tests passing (2 new for PendingDeletionCollector)HYPERFLEET-856
Summary by CodeRabbit
New Features
Documentation
Tests