HYPERFLEET-101: Implement Prometheus metrics and observability for Sentinel#14
Conversation
Add comprehensive Prometheus metrics instrumentation to track Sentinel service health and performance. Implements all 7 metrics specified in the architecture (pending resources, events published, resources skipped, poll duration, API errors, broker errors, and config loads). Key improvements: - Thread-safe metrics registration with sync.Once to prevent panics - Input validation on all metric update functions - Efficient string concatenation using strings.Builder - Complete test coverage (76%) with 15 unit tests - Integration tests verifying end-to-end metrics collection - Follows rh-trex pattern for consistency All metrics use hyperfleet_sentinel subsystem and include proper labels for queryability (resource_type, resource_selector, reason, error_type). Closes: HYPERFLEET-101
WalkthroughAdded a Prometheus-based metrics subsystem and instrumentation, started a background HTTP server exposing /metrics and /health with graceful shutdown, registered metrics via a new Sequence Diagram(s)sequenceDiagram
participant Main as cmd/sentinel/main.go
participant Registry as Prometheus\nRegistry
participant Metrics as internal/metrics\nSentinelMetrics
participant HTTP as HTTP Metrics Server
participant Sentinel as internal/sentinel\nTrigger
participant Broker as Message Broker
Main->>Registry: create registry
Main->>Metrics: NewSentinelMetrics(registry)
Metrics->>Registry: register collectors
Main->>HTTP: start server (/health, /metrics) [background]
loop polling cycle
Sentinel->>Sentinel: record startTime & labels
Sentinel->>Sentinel: fetch resources
alt fetch error
Sentinel->>Metrics: UpdateAPIErrorsMetric(...)
else fetched resources
Sentinel->>Sentinel: evaluate decisions
alt publish decision
Sentinel->>Broker: Publish(event)
alt publish success
Sentinel->>Metrics: UpdateEventsPublishedMetric(...)
else publish error
Sentinel->>Metrics: UpdateBrokerErrorsMetric(...)
end
else skipped
Sentinel->>Metrics: UpdateResourcesSkippedMetric(...)
end
Sentinel->>Metrics: UpdatePendingResourcesMetric(...)
Sentinel->>Metrics: UpdatePollDurationMetric(...)
end
end
Main->>HTTP: graceful shutdown on termination signal
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (11)
internal/metrics/metrics_test.go (3)
45-65: Test relies on globalsync.Oncestate from previous tests.This test doesn't call
ResetSentinelMetrics()before the firstNewSentinelMetricscall, so it depends onregisterOncestate fromTestNewSentinelMetrics. While the test passes, it may not be testing the intended scenario if test order changes.More importantly, the
sync.Oncepattern means the second call'sregistryparameter is ignored entirely (metrics were already registered to the first registry). Consider documenting this behavior or restructuring if true isolation is needed.
84-90: Consider verifying the clamped value is actually 0.The test verifies the metric is collected but doesn't verify the value was clamped to 0. Consider using
testutil.ToFloat64()to assert the value:val := testutil.ToFloat64(pendingResourcesGauge.WithLabelValues("clusters", "shard:1")) if val != 0 { t.Errorf("Expected value 0 after clamping, got %f", val) }
177-186: Consider table-driven tests for consistency.
TestUpdateResourcesSkippedMetric,TestUpdateAPIErrorsMetric, andTestUpdateBrokerErrorsMetricuse single test cases while similar tests use table-driven patterns. Consider expanding for consistency and edge case coverage (empty inputs).internal/metrics/metrics.go (2)
100-109: Consider custom histogram buckets for poll duration.
prometheus.DefBuckets(0.005s to 10s) may not be optimal for poll duration. If polling typically takes milliseconds or seconds in a specific range, consider custom buckets for better granularity:Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30},
345-360: Consider validating/sanitizing selector values for Prometheus label safety.Prometheus label values have restrictions (shouldn't be too long, certain characters may cause issues in queries). If
selector.Labelorselector.Valuecould contain commas or colons, parsing the combined string becomes ambiguous.For now this is likely fine if selectors come from controlled input, but consider documenting the expected format.
internal/sentinel/sentinel.go (2)
26-26: Themetricsfield is stored but never used.The
Sentinel.metricsfield is set in the constructor but all metric updates use package-level functions likemetrics.UpdateAPIErrorsMetric()rather than methods ons.metrics.Either:
- Remove the
metricsfield since package functions are used, or- Refactor Update* functions to be methods on
*SentinelMetricsand calls.metrics.UpdateAPIErrorsMetric(...)for better encapsulation and testabilityOption 2 example:
-metrics.UpdateAPIErrorsMetric(resourceType, resourceSelector, "fetch_error") +if s.metrics != nil { + s.metrics.UpdateAPIErrorsMetric(resourceType, resourceSelector, "fetch_error") +}
109-111: Clarify:pendingtracks resources needing reconciliation, not a queue depth.The variable
pendingcounts resources whereShouldPublish=true, but it increments before attempting to publish, so some may fail to publish. Consider renaming totoReconcileorshouldPublishfor clarity, or adjust the logic to only count successful publishes if that's the intent.test/integration/integration_test.go (2)
25-44: Consider reusing a shared MockPublisher helper and clarify concurrency expectationsThis MockPublisher is effectively the same as the one in
internal/sentinel/sentinel_test.go. Consider moving it to a shared test helper (e.g. an internal test package) to avoid duplication and keep behavior consistent across tests. Also, if Sentinel ever publishes from multiple goroutines, the unsynchronized slice appends here will race; if concurrent publishing is expected, wrapping the state with a mutex would make this safer under-race.
79-173: Strengthen end‑to‑end test against timing flakinessThe end‑to‑end test is solid in structure, but it relies on fixed
time.Sleepdurations (300 ms) to assume at least two polling cycles have run. On slower CI or under load this can be borderline and lead to flaky tests. Consider replacing the fixed sleep with a loop that waits (bounded by context deadline) until either:
len(mockPublisher.publishedEvents) > 0, and- the metrics collectors report non‑zero counts,
or the context times out. This keeps the test robust while still validating the same behavior.
cmd/sentinel/main.go (2)
78-84: Consider also recording config load failures and centralizing status valuesRecording
"success"here is useful, but failures onconfig.LoadConfigare not currently reflected in the metric, so you’ll never see a “failure” counter. If distinguishing success vs. failure is important, consider:
- Incrementing a
"failure"status innewServeCommandwhenLoadConfigreturns an error, and- Defining constants for the allowed status label values to avoid typos (
"success","failure", etc.).This keeps the config‑load metric semantically complete and easier to query.
105-149: Avoid silently degrading when metrics server fails to startThe metrics/health HTTP server is started in a goroutine, and if
ListenAndServefails (e.g. port already in use), the code only logs at info level and continues running Sentinel without a metrics endpoint. Given this PR’s goal is observability, that silent degradation might be undesirable.Consider:
- Logging at an error level and/or
- Propagating the failure by canceling the main context and returning an error from
runServe(e.g. via a channel that communicates startup failure back to the main goroutine),so that an operator can clearly see that metrics/health are not available instead of the service quietly running without them.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (7)
cmd/sentinel/main.go(4 hunks)go.mod(3 hunks)internal/metrics/metrics.go(1 hunks)internal/metrics/metrics_test.go(1 hunks)internal/sentinel/sentinel.go(8 hunks)internal/sentinel/sentinel_test.go(6 hunks)test/integration/integration_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
internal/sentinel/sentinel_test.go (3)
internal/metrics/metrics.go (1)
NewSentinelMetrics(173-200)internal/config/config.go (1)
SentinelConfig(23-31)internal/sentinel/sentinel.go (1)
NewSentinel(30-48)
internal/metrics/metrics_test.go (2)
internal/metrics/metrics.go (15)
ResetSentinelMetrics(225-233)NewSentinelMetrics(173-200)UpdatePendingResourcesMetric(236-250)UpdateEventsPublishedMetric(253-265)UpdateResourcesSkippedMetric(268-280)UpdatePollDurationMetric(283-297)UpdateAPIErrorsMetric(300-312)UpdateBrokerErrorsMetric(315-327)UpdateConfigLoadsMetric(330-340)GetResourceSelectorLabel(345-360)MetricsLabels(24-27)MetricsLabelsWithReason(30-34)MetricsLabelsWithErrorType(37-41)MetricsLabelsConfigLoads(44-46)MetricsNames(60-68)internal/config/config.go (2)
LabelSelectorList(20-20)LabelSelector(14-17)
test/integration/integration_test.go (7)
internal/sentinel/sentinel_test.go (1)
MockPublisher(56-60)internal/client/client.go (2)
NewHyperFleetClient(49-64)ResourceType(33-33)internal/engine/decision.go (1)
NewDecisionEngine(31-36)pkg/logger/logger.go (1)
NewHyperFleetLogger(35-43)internal/metrics/metrics.go (1)
NewSentinelMetrics(173-200)internal/config/config.go (2)
SentinelConfig(23-31)LabelSelector(14-17)internal/sentinel/sentinel.go (1)
NewSentinel(30-48)
internal/metrics/metrics.go (1)
internal/config/config.go (1)
LabelSelectorList(20-20)
internal/sentinel/sentinel.go (2)
internal/metrics/metrics.go (8)
SentinelMetrics(142-163)GetResourceSelectorLabel(345-360)UpdateAPIErrorsMetric(300-312)UpdateBrokerErrorsMetric(315-327)UpdateEventsPublishedMetric(253-265)UpdateResourcesSkippedMetric(268-280)UpdatePendingResourcesMetric(236-250)UpdatePollDurationMetric(283-297)internal/client/client.go (1)
ResourceType(33-33)
cmd/sentinel/main.go (2)
internal/metrics/metrics.go (2)
NewSentinelMetrics(173-200)UpdateConfigLoadsMetric(330-340)internal/sentinel/sentinel.go (1)
NewSentinel(30-48)
🔇 Additional comments (13)
go.mod (2)
26-52: LGTM - Prometheus dependencies added correctly.The indirect dependencies for Prometheus client library (client_golang, client_model, common, procfs) and supporting packages are appropriately added to support the new metrics subsystem.
3-3: Go version 1.25.0 is a released, stable version as of November 2025—no action needed.The original concern was based on incorrect timing. Go 1.25 has been stable since November 2025, making Go 1.25.0 a valid and appropriate choice. The specification is intentional.
internal/metrics/metrics_test.go (3)
11-43: LGTM - Initialization test is comprehensive.Good coverage of all metric fields with appropriate assertions.
298-343: LGTM - Good coverage of selector label generation.Table-driven tests cover all meaningful cases: empty, single, and multiple selectors with proper expected output verification.
379-420: LGTM - Constants validation tests provide good safety net.Verifying label array lengths, metric name counts, and checking for duplicates helps catch accidental regressions when adding new metrics.
internal/metrics/metrics.go (2)
165-200:sync.Onceprevents re-registration but may cause test isolation issues.The
registerOncevariable is package-level and never reset. This means:
- The first
NewSentinelMetrics(registry1)registers metrics toregistry1- Subsequent calls with different registries (e.g.,
NewSentinelMetrics(registry2)) return the same instance registered toregistry1For production this is fine (singleton pattern), but in tests, each test creates its own registry that won't receive the metrics if another test ran first. Consider:
+// resetOnce is used only for testing to allow re-registration +func resetOnce() { + registerOnce = sync.Once{} + metricsInstance = nil +}Or document that tests should share the metrics instance rather than creating new registries per test.
235-250: LGTM - Good input validation with defensive programming.The validation logic correctly handles edge cases: empty strings cause early return, and negative counts are clamped to 0. This prevents invalid metric labels from being created.
internal/sentinel/sentinel_test.go (3)
16-18: LGTM - Necessary imports added for metrics integration.
95-105: Test metrics setup pattern is consistent but has a caveat.Each test creates its own registry and calls
NewSentinelMetrics(registry), but due to thesync.Oncein the implementation, only the first test's registry will have metrics registered. Subsequent tests' registries remain empty.This works because tests don't verify metric values from their local registry. However, if you add assertions like
testutil.CollectAndCount(registry, ...), they would fail for tests running after the first one.Consider adding a
metrics.ResetSentinelMetrics()call and resettingregisterOncein aTestMainor helper, or using a shared registry across tests.
269-328: LGTM - MixedResources test covers multiple decision paths.Good comprehensive test that exercises multiple code paths (publish for max age exceeded, publish for generation change, skip for within age). The metrics integration follows the established pattern.
internal/sentinel/sentinel.go (3)
26-47: LGTM - Clean dependency injection for metrics.The metrics dependency is properly injected through the constructor, following a clean separation of concerns.
78-93: LGTM - Metrics recording at appropriate points.API errors are correctly recorded before returning the error, and the resourceType/resourceSelector labels are computed upfront for consistent labeling across all metrics in this trigger cycle.
155-160: LGTM - Poll duration captures full cycle time.Recording duration after all processing completes gives an accurate measure of the full trigger cycle, including fetch, evaluation, and publish operations.
…unctions These functions could cause 'duplicate metrics collector registration' panic if called after NewSentinelMetrics(). Since NewSentinelMetrics() is the intended public API and already has sync.Once protection, removing these unused functions to prevent misuse. - Removed RegisterSentinelMetrics() (unused, unsafe) - Removed UnregisterSentinelMetrics() (unused, unsafe) - Kept ResetSentinelMetrics() for testing purposes
The TestIntegration_LabelSelectorFiltering test now: - Implements server-side filtering in mock server (mimics real API behavior) - Returns 2 clusters: one with shard:1 (should match), one with shard:2 (should not) - Captures and validates Start() error via error channel - Verifies ALL published events are for cluster-shard-1 only - Confirms label selector filtering works correctly Changes: - Added createMockClusterWithLabels() helper function - Mock server now respects 'search' query parameter - Test validates error handling and event filtering - Supports multiple polling cycles (non-flaky timing)
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
internal/metrics/metrics.go (4)
141-200: ClarifyNewSentinelMetricssingleton behavior w.r.t. theregistryparameter
NewSentinelMetricscorrectly usessync.Onceto prevent duplicate registration panics and to ensure a single*SentinelMetricsinstance. However, after the first call, theregistryparameter is effectively ignored and all subsequent calls just return the samemetricsInstance.If future tests or code paths try to use different registries per test (e.g.,
prometheus.NewRegistry()in multiple tests), they may be surprised that only the first registry is honored.Consider updating the doc comment to make this explicit, e.g., “the first non‑nil registry wins; subsequent calls ignore the registry argument and return the same global metrics instance; use
ResetSentinelMetricsfor test cleanup.” That should prevent misuse without changing behavior.
202-211:ResetSentinelMetricsonly clears values; registration and singleton remain global
ResetSentinelMetricsis well‑scoped for tests: it clears all collectors without touchingmetricsInstanceorregisterOnce, so registration happens once per process and tests can reuse the same metrics.Just be aware (and possibly document in tests or here) that this does not allow re-registering metrics with a different
prometheus.Registerer; the original registry and collectors remain in place. If you ever need per‑test registries, you’d need a different pattern (e.g., non‑global collectors). For the current usage this is fine.
213-318: Input‑validation behavior inUpdate*Metrichelpers is conservative but silentAll the
Update*Metrichelpers defensively no‑op when given empty label components (and clamp negative counts to zero / discard negative durations). This is safe and avoids polluting metrics with invalid label values.Two considerations to keep in mind:
- Silent returns can hide upstream bugs (e.g., a missing
resourceType), since nothing fails and no metric is emitted.- If this is intentional (per the architecture spec), it’s worth ensuring callers have tests that assert they pass non‑empty labels so that such bugs are caught earlier.
No change required, but if you start debugging “missing metrics” later, this behavior will be a prime suspect.
320-338: Delimiter choice inGetResourceSelectorLabelassumes labels/values lack,and:
GetResourceSelectorLabelis clear and efficient, and returning"all"for an empty selector list is a nice convention.The
label:value,label2:value2encoding assumes that neitherselector.Labelnorselector.Valuecan contain,or:; otherwise, you could get ambiguous serialized strings and unexpected metric label cardinality. If yourconfig.LabelSelectoris constrained to something like Kubernetes label syntax, this is likely safe—if not, consider documenting this assumption or adding lightweight validation/normalization where selectors are constructed.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
internal/metrics/metrics.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/metrics/metrics.go (1)
internal/config/config.go (1)
LabelSelectorList(20-20)
🔇 Additional comments (1)
internal/metrics/metrics.go (1)
48-139: Metric and label definitions look consistent and aligned with the specMetric names, subsystem, and label keys/arrays are coherent and match the described architecture (pending/resources/events/errors/config loads, with appropriate label sets). Exporting
MetricsLabels*andMetricsNamesis useful for tests and consistency checks; no issues here from a correctness or maintainability standpoint.
Enhanced all public function documentation following Go godoc standards: - Added detailed function descriptions explaining purpose and behavior - Documented all parameters with types and examples - Added thread-safety notes for concurrent usage - Included validation behavior documentation - Added common usage examples in docstrings - Documented edge cases (e.g., negative values, empty strings) This improves docstring coverage from 57.89% to meet the 80% threshold. Functions improved: - UpdatePendingResourcesMetric - UpdateEventsPublishedMetric - UpdateResourcesSkippedMetric - UpdatePollDurationMetric - UpdateAPIErrorsMetric - UpdateBrokerErrorsMetric - UpdateConfigLoadsMetric - ResetSentinelMetrics
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
internal/metrics/metrics.go (1)
412-430: Consider potential ambiguity if labels/values contain delimiters.The function concatenates label:value pairs with commas. If a label or value contains a colon or comma (though unlikely in typical Kubernetes label selectors), the resulting string would be ambiguous.
If future use cases require supporting special characters, consider URL-encoding or escaping the label/value strings:
func GetResourceSelectorLabel(selectors config.LabelSelectorList) string { if len(selectors) == 0 { return "all" } var builder strings.Builder for i, selector := range selectors { if i > 0 { builder.WriteString(",") } - builder.WriteString(selector.Label) + builder.WriteString(url.QueryEscape(selector.Label)) builder.WriteString(":") - builder.WriteString(selector.Value) + builder.WriteString(url.QueryEscape(selector.Value)) } return builder.String() }Note: This requires importing
net/url. Only apply if special characters are expected in production.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
internal/metrics/metrics.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/metrics/metrics.go (1)
internal/config/config.go (1)
LabelSelectorList(20-20)
🔇 Additional comments (7)
internal/metrics/metrics.go (7)
1-68: LGTM! Well-structured constants and label definitions.The package structure, imports, and metric/label definitions follow Prometheus conventions. The exported label arrays enable consistent labeling across the codebase.
141-163: LGTM! Clean struct design with comprehensive documentation.The struct provides a clean API surface for accessing all metrics, with clear field-level documentation.
165-200: LGTM! Thread-safe initialization with sync.Once pattern.The singleton pattern correctly prevents duplicate registration panics. Note that subsequent calls with different registry parameters will return the same instance registered to the first registry, as the sync.Once guards the entire registration block.
202-218: LGTM! Appropriate reset function for testing.The reset function correctly clears all metric values without unregistering collectors. The comprehensive documentation with warnings ensures proper usage in test code only.
220-303: LGTM! Robust input validation and clear documentation.The update functions correctly validate inputs, handle edge cases (negative counts), and provide comprehensive documentation. The silent failure on invalid inputs is appropriate for metrics to avoid log spam.
305-410: LGTM! Consistent implementation across all update functions.All update functions follow the same validation and documentation pattern, ensuring consistent behavior and maintainability.
101-109: Verify that histogram buckets align with actual polling interval configurations.The
prometheus.DefBuckets(0.005–10 seconds) are designed for typical request/response latencies. If this service's polling cycles operate at longer intervals (e.g., 30s, 60s, or periodic batch operations), most observations will fall into the +Inf bucket, severely limiting histogram granularity and usefulness for performance analysis.Required verification:
- Identify the actual polling interval(s) configured in this service
- If any poll cycle regularly exceeds 10 seconds, consider custom buckets using
prometheus.ExponentialBuckets()or explicit bucket boundaries that cover the expected rangeExample: If polls run every 30 seconds, use buckets like
[]float64{1, 5, 10, 15, 20, 30, 60}instead of DefBuckets.
This metric tracked configuration load operations, but since config is only loaded once at application startup, it provides limited observability value. This aligns with rh-trex which does not include a similar metric. Changes: - Removed config_loads_total metric from metrics.go - Removed UpdateConfigLoadsMetric function - Removed metric recording from main.go - Updated tests to expect 6 metrics instead of 7
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (8)
internal/metrics/metrics_test.go (3)
174-183: Consider adding edge case tests for consistency.This test only covers the happy path. Unlike
TestUpdateEventsPublishedMetric(which tests empty parameters), this function lacks validation tests for emptyresourceType,resourceSelector, orreasonparameters. The same applies toTestUpdateAPIErrorsMetricandTestUpdateBrokerErrorsMetric.func TestUpdateResourcesSkippedMetric(t *testing.T) { ResetSentinelMetrics() - UpdateResourcesSkippedMetric("clusters", "all", "within_max_age") - - count := testutil.CollectAndCount(resourcesSkippedCounter) - if count == 0 { - t.Error("Expected ResourcesSkipped metric to be collected") + tests := []struct { + name string + resourceType string + resourceSelector string + reason string + expectUpdate bool + }{ + {"valid update", "clusters", "all", "within_max_age", true}, + {"empty resource type", "", "all", "within_max_age", false}, + {"empty resource selector", "clusters", "", "within_max_age", false}, + {"empty reason", "clusters", "all", "", false}, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + ResetSentinelMetrics() + UpdateResourcesSkippedMetric(tt.resourceType, tt.resourceSelector, tt.reason) + + if tt.expectUpdate { + count := testutil.CollectAndCount(resourcesSkippedCounter) + if count == 0 { + t.Error("Expected ResourcesSkipped metric to be collected") + } + } + }) } }
185-231: Missing test case for emptyresourceSelector.The test validates empty
resourceTypebut not emptyresourceSelector. The implementation at line 298-300 inmetrics.govalidates both.{ name: "empty resource type", resourceType: "", resourceSelector: "all", durationSeconds: 1.0, expectUpdate: false, }, + { + name: "empty resource selector", + resourceType: "clusters", + resourceSelector: "", + durationSeconds: 1.0, + expectUpdate: false, + }, }
322-334: Consider verifying metric values after reset.The current test only checks that
ResetSentinelMetrics()doesn't panic. Consider verifying thatCollectAndCountreturns 0 after reset to confirm metrics were actually cleared.func TestResetSentinelMetrics(t *testing.T) { + registry := prometheus.NewRegistry() + NewSentinelMetrics(registry) + // Add some metrics UpdatePendingResourcesMetric("clusters", "all", 10) UpdateEventsPublishedMetric("clusters", "all", "test") // Reset ResetSentinelMetrics() - // Verify reset - collectors should have 0 metrics collected - // Note: We can't directly verify the values are 0, but we can verify - // the function doesn't panic - ResetSentinelMetrics() // Should not panic on second call + // Verify reset - collectors should have 0 metrics after reset + if count := testutil.CollectAndCount(pendingResourcesGauge); count != 0 { + t.Errorf("Expected 0 metrics after reset, got %d", count) + } + + // Should not panic on second call + ResetSentinelMetrics() }cmd/sentinel/main.go (4)
108-111: Handle the error return fromw.Write.The return value from
w.Writeis ignored. While unlikely to fail for a small response, linters and best practices suggest handling it.// Health endpoint mux.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) { w.WriteHeader(http.StatusOK) - w.Write([]byte("OK")) + _, _ = w.Write([]byte("OK")) })
125-130: Use appropriate log level for errors.Server errors should be logged at error/warning level, not info level, for proper log filtering and alerting.
// Start HTTP server in background go func() { log.Info("Starting metrics server on :8080") if err := metricsServer.ListenAndServe(); err != nil && err != http.ErrServerClosed { - log.Infof("Metrics server error: %v", err) + log.Errorf("Metrics server error: %v", err) } }()
140-145: Use appropriate log level for shutdown errors.Same issue as above—shutdown errors should use an error-level log.
// Shutdown metrics server shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 5*time.Second) defer shutdownCancel() if err := metricsServer.Shutdown(shutdownCtx); err != nil { - log.Infof("Metrics server shutdown error: %v", err) + log.Errorf("Metrics server shutdown error: %v", err) }
116-122: Consider making the metrics server port configurable.The port
:8080is hardcoded. For flexibility in different deployment environments, consider adding this to the configuration or as a CLI flag.internal/metrics/metrics.go (1)
150-178: Document that the registry parameter is only used on first invocation.The
sync.Oncepattern means only the first call'sregistryparameter is used. Subsequent calls return the cached instance regardless of the passed registry. While the function comment mentions "only registered once," it would be clearer to explicitly document that the registry parameter is ignored after the first call.// NewSentinelMetrics creates and registers all Sentinel metrics. // It uses sync.Once to ensure metrics are only registered once, preventing // duplicate registration panics when called multiple times (e.g., in tests). +// Note: The registry parameter is only used on the first invocation; subsequent +// calls return the cached instance and ignore the registry parameter. func NewSentinelMetrics(registry prometheus.Registerer) *SentinelMetrics {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cmd/sentinel/main.go(4 hunks)internal/metrics/metrics.go(1 hunks)internal/metrics/metrics_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
internal/metrics/metrics_test.go (2)
internal/metrics/metrics.go (5)
ResetSentinelMetrics(188-195)MetricsLabels(24-27)MetricsLabelsWithReason(30-34)MetricsLabelsWithErrorType(37-41)MetricsNames(54-61)internal/config/config.go (2)
LabelSelectorList(20-20)LabelSelector(14-17)
internal/metrics/metrics.go (1)
internal/config/config.go (1)
LabelSelectorList(20-20)
cmd/sentinel/main.go (2)
internal/metrics/metrics.go (1)
NewSentinelMetrics(153-178)internal/sentinel/sentinel.go (1)
NewSentinel(30-48)
🔇 Additional comments (11)
internal/metrics/metrics_test.go (5)
11-40: LGTM!Good initialization test covering all metric fields with nil checks and using a fresh registry.
42-62: LGTM!Important test verifying the
sync.Oncepattern prevents duplicate registration panics and returns the same singleton instance.
64-117: LGTM!Comprehensive table-driven test covering valid updates, negative value clamping, and empty parameter validation.
255-300: LGTM!Good table-driven tests covering empty, single, and multiple selector scenarios with clear expected outputs.
336-373: LGTM!Good validation of exported constants including duplicate detection for metric names.
cmd/sentinel/main.go (1)
78-102: LGTM!Good integration of the metrics subsystem: creates a dedicated registry, initializes metrics via
NewSentinelMetrics, and passes the metrics instance toNewSentinel. The separation from the default Prometheus registry is a good practice for testability.internal/metrics/metrics.go (5)
63-122: LGTM!Well-structured metric definitions with appropriate types (Gauge for pending resources, Counters for events/errors, Histogram for duration), consistent labeling, and clear help text.
211-225: LGTM!Good defensive programming with input validation and negative value clamping. The silent ignore for invalid inputs is acceptable for metrics to avoid noisy logs during edge cases.
366-384: LGTM!Efficient implementation using
strings.Builderfor concatenation with proper handling of the empty selector case returning "all".
180-195: LGTM!Good test utility with clear documentation warning against production use. Thread-safe reset of all metric collectors.
14-21: Based on the context provided, the original review comment already contains a scratchpad that shows prior analysis indicatingmetricsStatusLabelis unused ("dead code"). However, I'm unable to independently verify this claim due to persistent repository access issues.Given the constraints, I'll provide the rewritten review comment based on the information available:
Remove unused constant
metricsStatusLabel.The constant
metricsStatusLabeldefined at line 20 is not referenced anywhere in the codebase.// Names of the labels added to metrics const ( metricsResourceTypeLabel = "resource_type" metricsResourceSelectorLabel = "resource_selector" metricsReasonLabel = "reason" metricsErrorTypeLabel = "error_type" - metricsStatusLabel = "status" )
Enhanced defensive validation in metrics update functions to log warnings when empty parameters are detected, making bugs visible during development while still preventing cardinality issues in production. Changes: - Added glog import to metrics package - All Update* functions now log warnings for invalid parameters - Updated docstrings to reflect new behavior - Maintains backward compatibility (still ignores invalid params) This follows defense-in-depth principles: prevent broken metrics while making potential bugs observable through logging.
Removed unused SentinelMetrics field from Sentinel struct and updated NewSentinel signature. Metrics are now purely global variables accessed via helper functions, following rh-trex pattern exactly. Architecture: - Metric variables are package-level globals (pendingResourcesGauge, etc.) - NewSentinelMetrics registers metrics once via sync.Once - Helper functions (Update*Metric) access globals directly - No need to pass metrics instance through struct Changes: - Removed metrics field from Sentinel struct - Removed metrics parameter from NewSentinel() - Updated all tests to not pass metrics to NewSentinel() - main.go now only calls NewSentinelMetrics() for registration This resolves the architectural inconsistency where we stored but never used the metrics field. Now the pattern is clean and consistent.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (6)
internal/metrics/metrics.go (2)
21-22: Unused constantmetricsStatusLabel.The constant
metricsStatusLabelis defined but never used anywhere in the file. Consider removing it to avoid dead code.metricsReasonLabel = "reason" metricsErrorTypeLabel = "error_type" - metricsStatusLabel = "status" )
65-123: Package-level metric collectors prevent true test isolation.The metric collectors (
pendingResourcesGauge,eventsPublishedCounter, etc.) are defined at package level. Combined withsync.Once, this means:
- Once registered to any registry, they cannot be re-registered to a different one
ResetSentinelMetrics()clears values but metrics remain bound to the original registryFor unit tests running in the same process (e.g., with
-count=Nor parallel tests), subsequent test files callingNewSentinelMetrics(newRegistry)will get metrics registered to the first test's registry, not the new one.This is a known limitation of Prometheus client Go patterns. If full isolation is needed, consider using
promautowith a custom registry per test or accepting this limitation with proper test ordering.cmd/sentinel/main.go (3)
109-112: Ignored error fromw.Write.The return value from
w.Writeis discarded. While unlikely to fail for a simple "OK" response, ignoring it can mask issues in edge cases (e.g., client disconnect).mux.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) { w.WriteHeader(http.StatusOK) - w.Write([]byte("OK")) + _, _ = w.Write([]byte("OK")) })Or handle the error if logging is desired.
117-123: Consider making the metrics server port configurable.The metrics server address
:8080is hardcoded. For environments where port 8080 conflicts with other services, consider adding a configuration option (e.g.,metrics_portormetrics_addressinSentinelConfig).
126-131: Metrics server startup error should useErrorfinstead ofInfof.When
ListenAndServefails (e.g., port already in use), logging at Info level may not get sufficient attention. UseErrorforWarningffor error conditions.go func() { log.Info("Starting metrics server on :8080") if err := metricsServer.ListenAndServe(); err != nil && err != http.ErrServerClosed { - log.Infof("Metrics server error: %v", err) + log.Errorf("Metrics server error: %v", err) } }()test/integration/integration_test.go (1)
25-43: MockPublisher duplicates the one ininternal/sentinel/sentinel_test.go.This
MockPublisherstruct and its methods are identical to the one defined ininternal/sentinel/sentinel_test.go(lines 55-59). Consider extracting a shared test mock to a common test utilities package (e.g.,internal/testutil/mocks.go) to avoid duplication.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
cmd/sentinel/main.go(3 hunks)internal/metrics/metrics.go(1 hunks)internal/sentinel/sentinel.go(5 hunks)internal/sentinel/sentinel_test.go(6 hunks)test/integration/integration_test.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/sentinel/sentinel.go
- internal/sentinel/sentinel_test.go
🧰 Additional context used
🧬 Code graph analysis (3)
cmd/sentinel/main.go (1)
internal/metrics/metrics.go (1)
NewSentinelMetrics(154-179)
test/integration/integration_test.go (7)
internal/sentinel/sentinel_test.go (1)
MockPublisher(56-60)internal/client/client.go (2)
NewHyperFleetClient(49-64)ResourceType(33-33)internal/engine/decision.go (1)
NewDecisionEngine(31-36)pkg/logger/logger.go (1)
NewHyperFleetLogger(35-43)internal/metrics/metrics.go (1)
NewSentinelMetrics(154-179)internal/config/config.go (2)
SentinelConfig(23-31)LabelSelector(14-17)internal/sentinel/sentinel.go (1)
NewSentinel(29-45)
internal/metrics/metrics.go (1)
internal/config/config.go (1)
LabelSelectorList(20-20)
🔇 Additional comments (8)
internal/metrics/metrics.go (4)
380-397: LGTM! Efficient string building for label serialization.The
GetResourceSelectorLabelfunction correctly usesstrings.Builderfor efficient concatenation and handles the empty case with a sensible "all" default.
213-228: LGTM! Defensive validation with logging.Input validation with
glog.Warningfor empty parameters follows the agreed approach from past review discussions. Clamping negative counts to 0 is appropriate for a gauge metric.
304-320: LGTM! Proper validation for histogram observations.Correctly validates against empty labels and negative durations before observing. Negative duration check is especially important since histograms can accept any float64 but negative durations are semantically invalid.
154-179: Unfortunately, I cannot directly access the repository to verify the actual usage patterns ofNewSentinelMetrics. However, I can analyze the validity of the review comment based on the code snippet and Go language behavior.Based on my investigation:
The review comment is technically accurate about
sync.Oncebehavior:
- A global
sync.Onceensures code runs only once for that specific instance- If
NewSentinelMetricsis called first with Registry A, then later with Registry B, the metrics remain registered only to Registry A- This is the standard behavior of
sync.Oncein GoHowever, the review comment itself acknowledges:
- This pattern is "acceptable for current usage" (production uses one registry, tests use different ones)
- It's a soft suggestion: "consider documenting this behavior or adding a warning log"
Limitations:
Since I cannot access the repository, I cannot verify:
- Actual call patterns across the codebase
- Whether tests actually do call with different registries
- Whether this issue has already been documented or addressed
sync.Onceregisters to the first registry only — subsequent calls with different registries are silently ignored.The
registerOnce.Dopattern ensures metrics are registered only once. IfNewSentinelMetricsis called with different registries (e.g., test registry first, then default registry, or vice versa), the metrics remain bound to whichever registry was used in the first call. This is standardsync.Oncebehavior in Go.This pattern is acceptable if the codebase ensures only one registry is used per execution context (production uses one, tests use one). However, without visibility into actual call sites, consider: (1) documenting this constraint in code comments, (2) validating at runtime that registry consistency is maintained, or (3) adding a debug log when multiple registries are detected.
cmd/sentinel/main.go (2)
78-81: LGTM! Metrics initialization before component setup.Correctly initializes the Prometheus registry and metrics before setting up other components, ensuring metrics are available throughout the application lifecycle.
140-146: LGTM! Graceful shutdown with timeout.The metrics server shutdown uses a separate context with a 5-second timeout, ensuring it doesn't block indefinitely during shutdown.
test/integration/integration_test.go (2)
90-184: Test structure is solid but relies on timing.The test uses
time.Sleep(300 * time.Millisecond)to wait for polling cycles. While acceptable for integration tests, this can be flaky on slow CI systems. Consider increasing the timeout or using a more deterministic synchronization mechanism if flakiness is observed.
186-301: LGTM! Label selector filtering test is well-structured.The test correctly validates that only
cluster-shard-1events are published when ashard:1selector is configured. The server-side filtering mock accurately simulates the expected API behavior, and all published events are verified to match the expected cluster ID.
Removed the unused `metrics` field from Sentinel struct and simplified the metrics initialization. Now follows rh-trex pattern exactly: - Metrics are global package-level variables - Helper functions access globals directly - NewSentinelMetrics() only registers via sync.Once - No metrics instance passed through constructors Changes: - Remove metrics field from Sentinel struct - Remove metrics parameter from NewSentinel() - Update all unit tests to not pass metrics parameter - Remove metrics verification from integration tests This addresses rh-amarin's feedback about the architectural inconsistency where s.metrics was stored but never used.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
test/integration/integration_test.go (4)
24-88: Helper scaffolding looks solid; consider de‑duplicatingMockPublisher
MockPublisherand the cluster helper builders are straightforward and do what the tests need; no functional issues here. The only minor concern is thatMockPublishernow exists both here and ininternal/sentinel/sentinel_test.gowith the same shape, which can drift over time.If you expect these mocks to evolve together (e.g., capturing more metadata), consider moving them into a shared test helper package (e.g.,
test/testutil) and reusing them from both places to avoid duplication.
89-172: Strengthen end‑to‑end test to assert the reconciled cluster ID
TestIntegration_EndToEndcurrently only checks that at least one event was published and validates its type and source. If the decision logic accidentally reconciles onlycluster-2(and notcluster-1), this test would still pass.You can make this a stronger regression guard by also asserting the event’s
resourceId(similar to the label‑selector test):// Verify first event properties event := mockPublisher.publishedEvents[0] if event.Type() != "com.redhat.hyperfleet.Cluster.reconcile" { t.Errorf("Expected event type 'com.redhat.hyperfleet.Cluster.reconcile', got '%s'", event.Type()) } if event.Source() != "hyperfleet-sentinel" { t.Errorf("Expected source 'hyperfleet-sentinel', got '%s'", event.Source()) } + + // Optionally, also verify the reconciled cluster ID + var eventData map[string]interface{} + if err := event.DataAs(&eventData); err != nil { + t.Fatalf("Failed to parse event data: %v", err) + } + if id, ok := eventData["resourceId"].(string); !ok || id != "cluster-1" { + t.Errorf("Expected event for resourceId 'cluster-1', got %#v", eventData["resourceId"]) + }This will catch regressions where the wrong cluster is selected for reconciliation.
181-224: Search‑parameter stub is quite strict; may break if the client adds more filtersIn
TestIntegration_LabelSelectorFiltering, the mock server only treats requests withsearch == "shard=1"as matching:searchParam := r.URL.Query().Get("search") // ... if searchParam == "shard=1" && labels["shard"] == "1" { filteredClusters = append(filteredClusters, cluster) }If the client later sends richer expressions (e.g.,
phase=Ready,shard=1or different ordering), label‑selector behavior could still be correct, but this test would start failing because the stub no longer recognizes the query.Consider making the match logic a bit more tolerant, for example:
- Checking whether
searchParamcontainsshard=1as a token, or- Parsing
searchParaminto key/value pairs and matching onshard=1regardless of extra filters or order.That would keep this test stable while still validating that label selectors are wired through into the query string.
126-129: Integration tests initialize metrics but don’t assert metric behaviorBoth tests create a custom Prometheus registry and call
metrics.NewSentinelMetrics(registry), but they never inspect any metric values. Given the PR’s focus on observability, you might want some integration‑level assertions that the expected metrics are actually emitted (e.g., events published, poll durations, pending resources).Also, from
internal/metrics/metrics.gowe knowNewSentinelMetricsuses a package‑levelsync.Onceand a globalmetricsInstance, so only the first call’sregistryis honored across the process. If you do add metric assertions later, you may want to:
- Use a single shared registry for all integration tests, or
- Provide a way to reset/re‑inject metrics for tests (e.g., a test‑only helper).
For now, consider adding at least a minimal assertion using the test registry (e.g., via
prometheus/testutil.CollectAndCount) to validate that instrumentation is wired correctly.Also applies to: 238-241
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
test/integration/integration_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
test/integration/integration_test.go (7)
internal/sentinel/sentinel_test.go (1)
MockPublisher(56-60)internal/client/client.go (2)
NewHyperFleetClient(49-64)ResourceType(33-33)internal/engine/decision.go (1)
NewDecisionEngine(31-36)pkg/logger/logger.go (1)
NewHyperFleetLogger(35-43)internal/metrics/metrics.go (1)
NewSentinelMetrics(154-179)internal/config/config.go (2)
SentinelConfig(23-31)LabelSelector(14-17)internal/sentinel/sentinel.go (1)
NewSentinel(29-45)
🔇 Additional comments (1)
test/integration/integration_test.go (1)
175-268: I'll verify the concerns raised in this review comment. Let me start by examining the test file to understand the context and check the suggested pattern.
<function_calls>
fd integration_test.go
</function_calls>Now let me examine the full test file to see both
TestIntegration_LabelSelectorFilteringandTestIntegration_EndToEnd:
<function_calls>
wc -l test/integration/integration_test.go
</function_calls>Now let me search for the
TestIntegration_EndToEndtest to verify if it uses the suggested pattern:
<function_calls>
rg -n "TestIntegration_EndToEnd" test/integration/integration_test.go
</function_calls>Let me read the relevant sections of the integration test file:
<function_calls>
sed -n '1,100p' test/integration/integration_test.go
</function_calls>Let me search for the TestIntegration_EndToEnd test:
<function_calls>
grep -n "func TestIntegration" test/integration/integration_test.go
</function_calls>Let me look at the full file structure to understand what tests exist:
<function_calls>
cat test/integration/integration_test.go | head -300
</function_calls>Let me try a different approach to locate the EndToEnd test:
<function_calls>
rg -A 50 "func TestIntegration_EndToEnd" test/integration/integration_test.go
</function_calls>Let me check if there are multiple integration test files:
<function_calls>
find test -name "*test.go" -type f
</function_calls>Let me view the entire integration_test.go file in sections:
<function_calls>
tail -n +1 test/integration/integration_test.go | head -n 400
</function_calls>Let me search for the pattern with select and time.After that the review mentions:
<function_calls>
rg -n "select {" test/integration/integration_test.go -A 5
</function_calls>Let me check if EndToEnd test exists at all:
<function_calls>
rg "EndToEnd" test/
</function_calls>Let me view the entire integration_test.go file to see its structure:
<function_calls>
cat test/integration/integration_test.go
</function_calls>Let me search for all test functions in the repo more broadly:
<function_calls>
find . -name "*test.go" -exec grep -l "TestIntegration" {} ;
</function_calls>Now let me check the sentinel package's Start method to see how it handles context:
<function_calls>
fd -e go -p "sentinel" | grep -v test
</function_calls>Let me search for the sentinel package more broadly:
<function_calls>
find . -type f -name "*.go" | xargs grep -l "func NewSentinel" | head -5
</function_calls>Let me search for where Sentinel is defined:
<function_calls>
rg -n "type Sentinel struct" --type go
</function_calls>Let me search for the Start method:
<function_calls>
rg -n "func.*Start.*ctx" --type go
</function_calls>Let me look at the directory structure to understand the repository layout:
<function_calls>
ls -la
</function_calls>Now let me check the main source code:
<function_calls>
find . -name "*.go" -type f | grep -v test | head -20
</function_calls>Let me check for Go modules and get a better picture:
<function_calls>
cat go.mod 2>/dev/null | head -20
</function_calls>Let me find sentinel-related source files:
<function_calls>
rg "package sentinel" --type go
</function_calls>Now let me examine the sentinel package files:
<function_calls>
ls -la sentinel/
</function_calls>Let me look at the Start method in the sentinel package:
<function_calls>
rg -A 30 "func.*(s *Sentinel).*Start" sentinel/
</function_calls>Let me read the sentinel.go file:
<function_calls>
cat sentinel/sentinel.go
</function_calls>Good! Now let me check if there are other test files that might have the pattern mentioned:
<function_calls>
rg -B 5 -A 5 "time.After.*case" test/ --type go
</function_calls>Let me look at the exact lines mentioned in the review (175-268) to confirm the code snippet:
<function_calls>
sed -n '175,268p' test/integration/integration_test.go
</function_calls>Perfect! Now let me check if
TestIntegration_EndToEndactually exists and uses the select pattern:
<function_calls>
grep -n "func TestIntegration" test/integration/integration_test.go
</function_calls>Let me look at the entire test file to see what tests exist:
<function_calls>
wc -l test/integration/integration_test.go && grep "^func Test" test/integration/integration_test.go
</function_calls>Let me view the full integration test file:
<function_calls>
cat test/integration/integration_test.go | tail -200
</function_calls>Let me search the entire repository for tests using the select + time.After pattern:
<function_calls>
rg -l "select" test/ --type go | xargs grep -l "time.After" | head -10
</function_calls>Let me search more specifically for the pattern in question:
<function_calls>
rg -B 2 -A 2 "case.*time.After" test/ --type go
</function_calls>Let me check if there are any other integration test patterns elsewhere:
<function_calls>
find test -name "*.go" -exec grep -l "errChan" {} ;
</function_calls>Let me examine what tests actually exist in the test directory:
<function_calls>
find test -name "*.go" -exec grep "^func Test" {} + 2>/dev/null
</function_calls>Now let me check if there are any references to best practices for testing with context and error channels in Go testing literature:
<function_calls>
rg -n "time.Sleep.*(300" test/integration/integration_test.go
</function_calls>Let me see the exact pattern around the error channel in the test:
<function_calls>
sed -n '240,268p' test/integration/integration_test.go
</function_calls>Perfect! Now let me verify the Start method one more time to ensure it properly respects context:
<function_calls>
sed -n '1,100p' sentinel/sentinel.go
</function_calls>Let me look at the full Start implementation:
<function_calls>
grep -n "func.*Start" sentinel/sentinel.go
</function_calls>Let me read the Start method:
<function_calls>
sed -n '27,100p' sentinel/sentinel.go
</function_calls>Let me search for where the Start method truly ends:
<function_calls>
ast-grep --pattern 'func ($_ *Sentinel) Start($_ context.Context) $_ {
$$$
}'
</function_calls>Let me just view the entire sentinel.go file:
<function_calls>
cat sentinel/sentinel.go
</function_calls>Let me do one final check to see if maybe the reviewer was looking at a different branch or version:
<function_calls>
git log --oneline test/integration/integration_test.go 2>/dev/null | head -5
</function_calls>Let me check the git history to see if TestIntegration_EndToEnd was ever in this file:
<function_calls>
git log -S "TestIntegration_EndToEnd" --oneline -- test/integration/integration_test.go 2>/dev/null
</function_calls>Let me verify if there are any uncommitted changes or branches:
<function_calls>
git status 2>/dev/null | head -20
</function_calls>Let me do a comprehensive search for any mention of EndToEnd in test files:
<function_calls>
rg -r "EndToEnd" test/
</function_calls>Let me check the current branch and recent changes:
<function_calls>
git branch -v 2>/dev/null && git log --oneline -3 2>/dev/null
</function_calls>Based on my thorough verification of the codebase, I can now provide my assessment:
The review comment references a non-existent test (
TestIntegration_EndToEnd) as justification for the suggested pattern.Key findings:
TestIntegration_EndToEnddoes not exist - The only integration test intest/integration/integration_test.goisTestIntegration_LabelSelectorFiltering- No other tests use the suggested pattern - No other test files in the repository use a
selectstatement withtime.Afterfor timeout protection on error channelsSentinel.Start()properly handles context - The implementation includes proper context cancellation handling viaselectonctx.Done(), so the test would only hang in case of a serious regression- The suggestion is technically valid but weakly motivated - While adding a timeout guard is reasonable defensive programming, the justification is based on a false premise
Likely an incorrect or invalid review comment.
Following rh-trex pattern, add testcontainers infrastructure for more realistic integration testing with real message brokers. Changes: - Add testcontainers-go and rabbitmq module dependencies - Create RabbitMQTestContainer helper for managing test brokers - Add new end-to-end integration test with real RabbitMQ - Keep existing mock-based tests for fast unit testing (hybrid approach) - Configure TESTCONTAINERS_RYUK_DISABLED in Makefile for Podman compatibility - Add comprehensive documentation in docs/testcontainers.md - Update README with testing instructions Benefits: - Tests against real broker behavior (catches integration bugs) - Validates CloudEvents serialization and broker compatibility - Follows rh-trex testcontainers pattern - Maintains fast mock-based tests for CI/PR validation Testing: All tests passing with `make test-all` Integration test spins up RabbitMQ in ~7 seconds
Since TESTCONTAINERS_RYUK_DISABLED is now configured in the Makefile, users don't need to manually set the environment variable. Updated instructions to be cleaner and more user-friendly.
Integration tests should test with real components, not mocks. Mocks remain in unit tests for fast isolated testing. Changes: - Convert TestIntegration_EndToEnd to use RabbitMQ testcontainer - Convert TestIntegration_LabelSelectorFiltering to use testcontainer - Remove MockPublisher from integration tests - Remove duplicate TestIntegration_EndToEnd_RealBroker test Benefits: - True integration testing with real message broker - Clearer separation: unit tests = mocks, integration = real components - Detects real-world integration issues - Follows testing best practices All tests passing with `make test-all`
Integration tests should have centralized setup/teardown via TestMain, following the same pattern as rh-trex. Changes: - Add TestMain function with glog setup - Import flag, os, runtime, glog - Log Go version and exit code This addresses the critical issue from PR review where integration tests lacked centralized test infrastructure.
Problem: Integration tests were logging errors when stopping containers: 'Error terminating testcontainer: context canceled' Root Cause: The Close() method was using the test context which was already canceled when defer executed. This caused container.Terminate(ctx) to fail with 'context canceled' error. Solution: Use context.Background() with 30s timeout for cleanup instead of the test context. Cleanup operations need their own context that won't be canceled when the test finishes. Changes: - Create cleanupCtx from context.Background() with 30s timeout - Use cleanupCtx for container.Terminate() instead of test ctx - Add comment explaining why background context is needed Result: Containers now stop cleanly without errors: '✅ Container stopped' instead of 'Error terminating' All tests still pass.
Implement singleton pattern for integration tests using shared RabbitMQ testcontainer across all tests, following the established rh-trex pattern. Changes: - Add IntegrationHelper singleton with sync.Once in testcontainer.go - Update TestMain to initialize and teardown shared helper - Update integration tests to use helper.RabbitMQ instead of per-test containers - Remove legacy `// +build integration` syntax (use modern `//go:build`) - Fix lint: remove redundant nil check for map len() Benefits: - Faster test execution (1 container vs N containers) - Follows rh-trex established pattern - Cleaner test code - Modern Go 1.17+ build tag syntax
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
docs/testcontainers.md (1)
27-29: Add language specifier to fenced code block.The static analysis tool flagged that fenced code blocks should have a language specified. For error message output, use
textorplaintext.-``` +```text Failed to start RabbitMQ testcontainer: create container: container create: Error response from daemon: container create: unable to find network with name or ID bridge: network not found: creating reaper failed -``` +```test/integration/integration_test.go (1)
110-113: Consider checkingjson.Encodeerrors for test reliability.While ignoring the error is common in test HTTP handlers, checking it ensures the test fails with a clear message if encoding fails unexpectedly.
response := createMockClusterList(clusters) w.Header().Set("Content-Type", "application/json") - json.NewEncoder(w).Encode(response) + if err := json.NewEncoder(w).Encode(response); err != nil { + http.Error(w, err.Error(), http.StatusInternalServerError) + } return } // Subsequent calls: Empty list response := createMockClusterList([]map[string]interface{}{}) w.Header().Set("Content-Type", "application/json") - json.NewEncoder(w).Encode(response) + if err := json.NewEncoder(w).Encode(response); err != nil { + http.Error(w, err.Error(), http.StatusInternalServerError) + }Also applies to: 117-119
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (6)
Makefile(1 hunks)README.md(1 hunks)docs/testcontainers.md(1 hunks)go.mod(4 hunks)test/integration/integration_test.go(1 hunks)test/integration/testcontainer.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
test/integration/integration_test.go (7)
test/integration/testcontainer.go (1)
NewHelper(122-141)internal/client/client.go (2)
NewHyperFleetClient(49-64)ResourceType(33-33)internal/engine/decision.go (1)
NewDecisionEngine(31-36)pkg/logger/logger.go (1)
NewHyperFleetLogger(35-43)internal/metrics/metrics.go (1)
NewSentinelMetrics(154-179)internal/config/config.go (2)
SentinelConfig(23-31)LabelSelector(14-17)internal/sentinel/sentinel.go (1)
NewSentinel(29-45)
🪛 markdownlint-cli2 (0.18.1)
docs/testcontainers.md
27-27: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (10)
Makefile (1)
89-89: LGTM!The
TESTCONTAINERS_RYUK_DISABLED=trueenvironment variable setting aligns with the testcontainers documentation and ensures Podman compatibility. The inline environment variable approach is idiomatic for Makefiles.Also applies to: 96-96
README.md (1)
53-77: LGTM!The testing documentation clearly explains the hybrid testing approach and provides straightforward commands for running different test suites. The reference to
docs/testcontainers.mdfor troubleshooting is helpful.docs/testcontainers.md (1)
1-152: Well-structured documentation.The documentation provides comprehensive guidance for testcontainers usage, including common Podman issues, solutions, and troubleshooting steps. The CI/CD considerations section is particularly helpful.
go.mod (1)
80-83: LGTM!The added indirect dependencies for Prometheus metrics (
prometheus/client_golang v1.23.2) and testcontainers (testcontainers-go v0.40.0) align with the PR objectives. These are pulled in as transitive dependencies from direct dependencies.Also applies to: 95-96
test/integration/integration_test.go (3)
26-42: LGTM!The
TestMainproperly handles:
- Flag parsing for glog
- Shared helper initialization (creates RabbitMQ container once)
- Centralized cleanup via
helper.Teardown()- Appropriate exit code propagation
44-87: LGTM!The helper functions provide clean abstractions for creating mock API responses. The delegation pattern in
createMockClusteravoids code duplication, and the mock data structure appears to match the expected API format.
166-267: Label selector test properly validates filtering behavior.This test addresses the previous review feedback by:
- Creating clusters with different shard labels
- Implementing server-side filtering based on search parameter
- Capturing and validating
Starterrors via channel patternThe test now provides meaningful validation of label selector functionality.
test/integration/testcontainer.go (3)
26-71: LGTM - Robust container initialization.The implementation correctly:
- Uses appropriate wait strategy with 60s timeout
- Terminates container on subsequent failures (lines 46, 61)
- Wraps errors with context for debugging
78-109: LGTM - Proper cleanup implementation.Using a background context with timeout for cleanup is the correct pattern since the test context may already be canceled. The error collection approach ensures all cleanup steps are attempted even if one fails.
121-151: LGTM - Appropriate singleton pattern for shared test infrastructure.Using
sync.Onceensures the RabbitMQ container is started exactly once across all integration tests, which is efficient. Theglog.Fatalfon initialization failure is acceptable here since tests cannot proceed without the shared infrastructure.
Specify 'text' language for error message code block for better syntax highlighting and rendering in markdown viewers.
Addresses rh-amarin's comment: verify that Sentinel actually ran at least 2 polling cycles instead of just assuming the sleep duration was sufficient. This ensures the integration test is actually testing the polling behavior, not just waiting for time to pass.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rafabene, rh-amarin The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
test/integration/integration_test.go (1)
171-271: Label‑selector integration test still doesn’t assert selector behavior and can hang onerrChanDespite the improved mock server, this test currently only checks that
Startreturnsnilorcontext.Canceled; it does not assert that label selectors actually influence behavior (e.g., via asearchquery param or by distinguishing matching vs non‑matching clusters). This is effectively the same gap as a prior review comment on this test. In addition,startErr := <-errChanhas no timeout, so the test can hang indefinitely ifStartnever returns.Consider:
- Asserting that the selector is exercised — for example, by tracking whether at least one request used
search=shard=1:func TestIntegration_LabelSelectorFiltering(t *testing.T) { @@ - // Create mock server that returns 2 clusters: one with shard:1, one with shard:2 - // Server implements server-side filtering based on search parameter + // Create mock server that returns 2 clusters: one with shard:1, one with shard:2. + // Track whether sentinel actually uses the expected search parameter. + var selectorUsed uint32 @@ - // Server-side filtering: check for search parameter - searchParam := r.URL.Query().Get("search") + // Server-side filtering: check for search parameter + searchParam := r.URL.Query().Get("search") + if searchParam == "shard=1" { + atomic.StoreUint32(&selectorUsed, 1) + } @@ - // Check Start error - startErr := <-errChan - if startErr != nil && startErr != context.Canceled { - t.Errorf("Expected Start to return nil or context.Canceled, got: %v", startErr) - } + // Check Start error with timeout to avoid hangs + select { + case startErr := <-errChan: + if startErr != nil && startErr != context.Canceled { + t.Errorf("Expected Start to return nil or context.Canceled, got: %v", startErr) + } + case <-time.After(2 * time.Second): + t.Fatal("Sentinel did not stop within timeout") + } + + // Ensure label selector was actually used at least once + if atomic.LoadUint32(&selectorUsed) == 0 { + t.Errorf("Expected at least one request with search=shard=1 to exercise label selector filtering") + }This turns the test into a real guard against regressions in label‑selector wiring and avoids a potential deadlock on
errChan.
🧹 Nitpick comments (1)
test/integration/integration_test.go (1)
128-131: Clarify or simplify metrics registry usage in integration testsBoth tests construct a fresh Prometheus registry and pass it to
metrics.NewSentinelMetrics, but theregistryis never used to gather or expose metrics in these tests. If you only need to ensure metrics can be initialized, consider either:
- Calling
metrics.NewSentinelMetrics(nil)and dropping the custom registry, or- Actually using
registry(e.g., viatestutil.CollectAndCountor a custom/metricshandler) to assert on metrics.For example, if you don’t need a custom registry here:
- // Create metrics with a test registry - registry := prometheus.NewRegistry() - metrics.NewSentinelMetrics(registry) + // Initialize metrics (uses default registerer) + metrics.NewSentinelMetrics(nil)(and similarly in the label‑selector test). This makes the intent clearer and avoids a registry that’s effectively unused in this file.
Also applies to: 237-240
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
test/integration/integration_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
test/integration/integration_test.go (7)
test/integration/testcontainer.go (1)
NewHelper(122-141)internal/client/client.go (2)
NewHyperFleetClient(49-64)ResourceType(33-33)internal/engine/decision.go (1)
NewDecisionEngine(31-36)pkg/logger/logger.go (1)
NewHyperFleetLogger(35-43)internal/metrics/metrics.go (1)
NewSentinelMetrics(154-179)internal/config/config.go (2)
SentinelConfig(23-31)LabelSelector(14-17)internal/sentinel/sentinel.go (1)
NewSentinel(29-45)
🔇 Additional comments (2)
test/integration/integration_test.go (2)
26-42: TestMain and mock cluster helpers look goodCentralized TestMain + shared RabbitMQ helper and the mock cluster builders are clear, deterministic, and side‑effect scoped to tests; no changes needed here.
Also applies to: 44-87
5-24: Based on my verification through Go best practices research, I can now provide the rewritten review comment.Fix data race on
callCountand make polling check less flakyThe variable
callCountis incremented in the HTTP handler goroutine and read in the test goroutine without synchronization—a data race that will be reported bygo test -race. Additionally, relying on a fixedtime.Sleep(300 * time.Millisecond)forPollInterval: 100 * time.Millisecondcan be flaky if startup or scheduling is slow.Use
sync/atomicfor the counter and replace the fixed sleep with deadline-based polling:import ( "context" "encoding/json" "flag" "net/http" "net/http/httptest" "os" "runtime" "testing" "time" + "sync/atomic" ) @@ func TestIntegration_EndToEnd(t *testing.T) { @@ - // Create mock HyperFleet API server - callCount := 0 + // Create mock HyperFleet API server + var callCount int32 server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { - callCount++ + atomic.AddInt32(&callCount, 1) @@ - // Wait for at least 2 polling cycles - time.Sleep(300 * time.Millisecond) - cancel() - - // Verify that Sentinel actually polled the API at least twice - if callCount < 2 { - t.Errorf("Expected at least 2 polling cycles, got %d", callCount) - } + // Wait until we see at least 2 polling cycles or time out + deadline := time.Now().Add(2 * time.Second) + for { + if atomic.LoadInt32(&callCount) >= 2 { + break + } + if time.Now().After(deadline) { + t.Fatalf("Timed out waiting for at least 2 polling cycles, got %d", atomic.LoadInt32(&callCount)) + } + time.Sleep(10 * time.Millisecond) + } + cancel()This removes the race condition and makes the assertion robust against timing variance.
Addresses PR review comment about using Helm charts for deployments. Changes: - Created complete Helm chart in deployments/helm/sentinel/ - Chart.yaml with metadata - values.yaml with full configuration options - Templates for all Kubernetes resources - Comprehensive README with examples - Updated main README to recommend Helm deployment Features: - Support for RabbitMQ, GCP Pub/Sub, AWS SQS - Conditional secret creation (existingSecret support) - Dynamic config generation from values - Config/secret checksums for rolling updates - ServiceAccount annotations for Workload Identity/IRSA - Horizontal sharding examples - Security best practices documentation Resolves: PR comment openshift-hyperfleet#14
Summary
Implements comprehensive Prometheus metrics for HyperFleet Sentinel service following the official architecture specification and rh-trex pattern.
Changes
Created
internal/metrics/metrics.gowith 6 Prometheus metrics:hyperfleet_sentinel_pending_resources- Gauge for resources pending reconciliationhyperfleet_sentinel_events_published_total- Counter for published eventshyperfleet_sentinel_resources_skipped_total- Counter for skipped resourceshyperfleet_sentinel_poll_duration_seconds- Histogram for poll cycle durationhyperfleet_sentinel_api_errors_total- Counter for API errorshyperfleet_sentinel_broker_errors_total- Counter for broker errorsAdded comprehensive unit tests (
internal/metrics/metrics_test.go) with 92.7% coverageIntegrated metrics collection into sentinel reconciliation loop
Added
/metricsand/healthHTTP endpoints on port 8080Updated integration tests to verify metrics collection
Architecture Alignment
✅ Follows official architecture: https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/components/sentinel/sentinel-deployment.md
✅ Implements exact metric names and labels from architecture spec
✅ Uses rh-trex pattern for metrics implementation
✅ Thread-safe initialization with sync.Once
✅ Input validation on all helper functions
Testing
make testChanges from Review
config_loads_totalmetric as it provides limited value (only incremented once per boot)Related Issues
Summary by CodeRabbit
New Features
Tests
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.