-
Notifications
You must be signed in to change notification settings - Fork 68
🌱 OPRUN-4242: Calibrate Prometheus alert thresholds using memory profiling data #2308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 OPRUN-4242: Calibrate Prometheus alert thresholds using memory profiling data #2308
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK with.
It would be nice get either LGTM from @dtfranz who is the author of it.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2308 +/- ##
==========================================
- Coverage 74.32% 74.24% -0.09%
==========================================
Files 90 91 +1
Lines 7008 7046 +38
==========================================
+ Hits 5209 5231 +22
- Misses 1392 1402 +10
- Partials 407 413 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
helm/prometheus/values.yaml
Outdated
| highMemoryThresholds: | ||
| enabled: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about that we put this under prometheus config block and have actual threshold values for various alarms:
prometheus:
thresholds:
memoryGrowth: 100_000
catalogdMemoryUsage: 75_000_000
operatorMemoryUsage: 100_000_000
.
.
It gives us more flexibility later, we do not need to modify the chart at all, and we can configure this in a particular way by overriding the default values at the chart installation time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
aacc60e to
4224099
Compare
helm/prometheus/values.yaml
Outdated
| memoryUsage: "75_000_000" | ||
| cpuUsage: 20 | ||
| apiCallRate: 5 | ||
| highMemoryThresholds: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed anymore, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dammit! I thought I removed that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Analyze baseline memory usage patterns and adjust Prometheus alert thresholds to eliminate false positives while maintaining sensitivity to real issues. This is based on memory profiling done against BoxcutterRuntime, which has increased memory load. **Memory Analysis:** - Peak RSS: 107.9MB, Peak Heap: 54.74MB during e2e tests - Memory stabilizes at 106K heap (heap19-21 show 0K growth for 3 snapshots) - Conclusion: NOT a memory leak, but normal operational behavior **Memory Breakdown:** - JSON Deserialization: 24.64MB (45%) - inherent to OLM's dynamic nature - Informer Lists: 9.87MB (18%) - optimization possible via field selectors - OpenAPI Schemas: 3.54MB (6%) - already optimized (73% reduction) - Runtime Overhead: 53.16MB (49%) - normal for Go applications **Alert Threshold Updates:** - operator-controller-memory-growth: 100kB/sec → 200kB/sec - operator-controller-memory-usage: 100MB → 150MB - catalogd-memory-growth: 100kB/sec → 200kB/sec **Rationale:** Baseline profiling showed 132.4kB/sec episodic growth during informer sync and 107.9MB peak usage are normal. Previous thresholds caused false positive alerts during normal e2e test execution. **Verification:** - Baseline test (old thresholds): 2 alerts triggered (false positives) - Verification test (new thresholds): 0 alerts triggered ✅ - Memory patterns remain consistent (~55MB heap, 79-171MB RSS) - Transient spikes don't trigger alerts due to "for: 5m" clause **Recommendation:** Accept 107.9MB as normal operational behavior for test/development environments. Production deployments may need different thresholds based on workload characteristics (number of resources, reconciliation frequency). **Non-viable Optimizations:** - Cannot replace unstructured with typed clients (breaks OLM flexibility) - Cannot reduce runtime overhead (inherent to Go) - JSON deserialization is unavoidable for dynamic resource handling 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Todd Short <tshort@redhat.com>
4224099 to
9b44975
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
/lgtm too thanks for doing this @tmshort and for the thorough justification! |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: camilamacedo86, pedjak, tmshort The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
05ee601
into
operator-framework:main
Analyze baseline memory usage patterns and adjust Prometheus alert thresholds to eliminate false positives while maintaining sensitivity to real issues.
This is based on memory profiling done against BoxcutterRuntime, which has increased memory load. It's set up to only increase the thresholds when running in experimental mode.
Memory Analysis:
Memory Breakdown:
Alert Threshold Updates:
Rationale:
Baseline profiling showed 132.4kB/sec episodic growth during informer sync and 107.9MB peak usage are normal. Previous thresholds caused false positive alerts during normal e2e test execution.
Verification:
Recommendation:
Accept 107.9MB as normal operational behavior for test/development environments. Production deployments may need different thresholds based on workload characteristics (number of resources, reconciliation frequency).
Non-viable Optimizations:
🤖 Generated with Claude Code
Originally in ALERT_THRESHOLD_VERIFICATION.md
Alert Threshold Verification
Summary
Successfully verified that updated Prometheus alert thresholds eliminate false positive alerts during normal e2e test execution.
Test Results
Baseline Test (Before Threshold Updates)
Alerts Triggered:
operator-controller-memory-growth: 132.4kB/sec (threshold: 100kB/sec)operator-controller-memory-usage: 107.9MB (threshold: 100MB)Memory Profile:
Verification Test (After Threshold Updates)
Alerts Triggered:
Memory Profile:
Alert Threshold Changes
Memory Growth Analysis
Baseline Memory Growth Rate (5min avg):
Memory Usage Pattern:
Conclusion
✅ Verification Successful
The updated alert thresholds are correctly calibrated for test/development environments:
Important Notes
Thresholds are calibrated for test/development environments
Production deployments may need different thresholds based on:
The "for: 5m" clause in alerts ensures transient spikes (like the 171MB spike at test completion) don't trigger alerts
Reference
See #2290 for detailed breakdown of memory usage patterns and optimization opportunities.
Reviewer Checklist