🌱 OPRUN-4242: Calibrate Prometheus alert thresholds using memory profiling data #2308

tmshort · 2025-11-05T17:02:34Z

Analyze baseline memory usage patterns and adjust Prometheus alert thresholds to eliminate false positives while maintaining sensitivity to real issues.

This is based on memory profiling done against BoxcutterRuntime, which has increased memory load. It's set up to only increase the thresholds when running in experimental mode.

Memory Analysis:

Peak RSS: 107.9MB, Peak Heap: 54.74MB during e2e tests
Memory stabilizes at 106K heap (heap19-21 show 0K growth for 3 snapshots)
Conclusion: NOT a memory leak, but normal operational behavior

Memory Breakdown:

JSON Deserialization: 24.64MB (45%) - inherent to OLM's dynamic nature
Informer Lists: 9.87MB (18%) - optimization possible via field selectors
OpenAPI Schemas: 3.54MB (6%) - already optimized (73% reduction)
Runtime Overhead: 53.16MB (49%) - normal for Go applications

Alert Threshold Updates:

operator-controller-memory-growth: 100kB/sec → 200kB/sec
operator-controller-memory-usage: 100MB → 150MB
catalogd-memory-growth: 100kB/sec → 200kB/sec

Rationale:
Baseline profiling showed 132.4kB/sec episodic growth during informer sync and 107.9MB peak usage are normal. Previous thresholds caused false positive alerts during normal e2e test execution.

Verification:

Baseline test (old thresholds): 2 alerts triggered (false positives)
Verification test (new thresholds): 0 alerts triggered ✅
Memory patterns remain consistent (~55MB heap, 79-171MB RSS)
Transient spikes don't trigger alerts due to "for: 5m" clause

Recommendation:
Accept 107.9MB as normal operational behavior for test/development environments. Production deployments may need different thresholds based on workload characteristics (number of resources, reconciliation frequency).

Non-viable Optimizations:

Cannot replace unstructured with typed clients (breaks OLM flexibility)
Cannot reduce runtime overhead (inherent to Go)
JSON deserialization is unavoidable for dynamic resource handling

🤖 Generated with Claude Code

Originally in ALERT_THRESHOLD_VERIFICATION.md

Alert Threshold Verification

Summary

Successfully verified that updated Prometheus alert thresholds eliminate false positive alerts during normal e2e test execution.

Test Results

Baseline Test (Before Threshold Updates)

Alerts Triggered:

⚠️ operator-controller-memory-growth: 132.4kB/sec (threshold: 100kB/sec)
⚠️ operator-controller-memory-usage: 107.9MB (threshold: 100MB)

Memory Profile:

operator-controller: 25 profiles, peak heap24.pprof (160K)
catalogd: 25 profiles, peak heap24.pprof (44K)
Peak heap memory: 54.74MB
Peak RSS memory: 107.9MB

Verification Test (After Threshold Updates)

Alerts Triggered:

✅ None - Zero alerts fired

Memory Profile:

operator-controller: 25 profiles, peak heap24.pprof (168K)
catalogd: 25 profiles, peak heap24.pprof (44K)
Peak heap memory: ~55MB (similar to baseline)
RSS memory: Stayed mostly 79-90MB with final spike to 171MB (did not sustain for 5min)

Alert Threshold Changes

Alert	Old Threshold	New Threshold	Rationale
operator-controller-memory-growth	100 kB/sec	200 kB/sec	Baseline shows 132.4kB/sec episodic growth is normal
operator-controller-memory-usage	100 MB	150 MB	Baseline shows 107.9MB peak is normal operational usage
catalogd-memory-growth	100 kB/sec	200 kB/sec	Aligned with operator-controller for consistency
catalogd-memory-usage	75 MB	75 MB	No change needed (16.9MB peak well under threshold)

Memory Growth Analysis

Baseline Memory Growth Rate (5min avg):

Observed: 109.4 KB/sec max in verification test
Pattern: Episodic spikes during informer sync and reconciliation
Not a continuous leak - memory stabilizes during normal operation

Memory Usage Pattern:

Initialization: 12K → 19K (minimal)
Informer sync: 19K → 64K (rapid growth)
Steady operation: 64K → 106K (gradual)
Stabilization: 106K (heap19-21 show 0K growth for 3 snapshots)

Conclusion

✅ Verification Successful

The updated alert thresholds are correctly calibrated for test/development environments:

No false positive alerts during normal e2e test execution
Thresholds still detect anomalies: Set high enough to avoid false positives but low enough to catch actual issues
Memory behavior is consistent: Both baseline and verification tests show similar memory patterns

Important Notes

Thresholds are calibrated for test/development environments
Production deployments may need different thresholds based on:
- Number of managed ClusterExtensions
- Reconciliation frequency
- Cluster size and API server load
- Number of ClusterCatalogs and bundle complexity
The "for: 5m" clause in alerts ensures transient spikes (like the 171MB spike at test completion) don't trigger alerts

Reference

See #2290 for detailed breakdown of memory usage patterns and optimization opportunities.

Reviewer Checklist

API Go Documentation
Tests: Unit Tests (and E2E Tests, if appropriate)
Comprehensive Commit Messages
Links to related GitHub Issue(s)

netlify · 2025-11-05T17:02:40Z

✅ Deploy Preview for olmv1 ready!

Name	Link
🔨 Latest commit	`9b44975`
🔍 Latest deploy log	https://app.netlify.com/projects/olmv1/deploys/690ba8ba44d1720008e38651
😎 Deploy Preview	https://deploy-preview-2308--olmv1.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

camilamacedo86

I am OK with.
It would be nice get either LGTM from @dtfranz who is the author of it.

codecov · 2025-11-05T17:16:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.24%. Comparing base (18142b3) to head (9b44975).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2308      +/-   ##
==========================================
- Coverage   74.32%   74.24%   -0.09%     
==========================================
  Files          90       91       +1     
  Lines        7008     7046      +38     
==========================================
+ Hits         5209     5231      +22     
- Misses       1392     1402      +10     
- Partials      407      413       +6

Flag	Coverage Δ
e2e	`45.91% <ø> (-0.06%)`	⬇️
experimental-e2e	`48.20% <ø> (-0.03%)`	⬇️
unit	`58.58% <ø> (-0.25%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pedjak · 2025-11-05T17:57:56Z

helm/prometheus/values.yaml

+  highMemoryThresholds:
+    enabled: false


How about that we put this under prometheus config block and have actual threshold values for various alarms:

prometheus: thresholds: memoryGrowth: 100_000 catalogdMemoryUsage: 75_000_000 operatorMemoryUsage: 100_000_000 . .

It gives us more flexibility later, we do not need to modify the chart at all, and we can configure this in a particular way by overriding the default values at the chart installation time.

Thinking...

pedjak · 2025-11-05T19:40:02Z

helm/prometheus/values.yaml

+      memoryUsage: "75_000_000"
+      cpuUsage: 20
+      apiCallRate: 5
+  highMemoryThresholds:


this is not needed anymore, right?

Dammit! I thought I removed that

Analyze baseline memory usage patterns and adjust Prometheus alert thresholds to eliminate false positives while maintaining sensitivity to real issues. This is based on memory profiling done against BoxcutterRuntime, which has increased memory load. **Memory Analysis:** - Peak RSS: 107.9MB, Peak Heap: 54.74MB during e2e tests - Memory stabilizes at 106K heap (heap19-21 show 0K growth for 3 snapshots) - Conclusion: NOT a memory leak, but normal operational behavior **Memory Breakdown:** - JSON Deserialization: 24.64MB (45%) - inherent to OLM's dynamic nature - Informer Lists: 9.87MB (18%) - optimization possible via field selectors - OpenAPI Schemas: 3.54MB (6%) - already optimized (73% reduction) - Runtime Overhead: 53.16MB (49%) - normal for Go applications **Alert Threshold Updates:** - operator-controller-memory-growth: 100kB/sec → 200kB/sec - operator-controller-memory-usage: 100MB → 150MB - catalogd-memory-growth: 100kB/sec → 200kB/sec **Rationale:** Baseline profiling showed 132.4kB/sec episodic growth during informer sync and 107.9MB peak usage are normal. Previous thresholds caused false positive alerts during normal e2e test execution. **Verification:** - Baseline test (old thresholds): 2 alerts triggered (false positives) - Verification test (new thresholds): 0 alerts triggered ✅ - Memory patterns remain consistent (~55MB heap, 79-171MB RSS) - Transient spikes don't trigger alerts due to "for: 5m" clause **Recommendation:** Accept 107.9MB as normal operational behavior for test/development environments. Production deployments may need different thresholds based on workload characteristics (number of resources, reconciliation frequency). **Non-viable Optimizations:** - Cannot replace unstructured with typed clients (breaks OLM flexibility) - Cannot reduce runtime overhead (inherent to Go) - JSON deserialization is unavoidable for dynamic resource handling 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Todd Short <tshort@redhat.com>

pedjak

/lgtm

dtfranz · 2025-11-05T23:41:15Z

/lgtm too thanks for doing this @tmshort and for the thorough justification!

tmshort · 2025-11-06T01:38:20Z

/approve

openshift-ci · 2025-11-06T01:38:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: camilamacedo86, pedjak, tmshort

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tmshort]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tmshort requested a review from a team as a code owner November 5, 2025 17:02

openshift-ci bot requested review from bentito and thetechnick November 5, 2025 17:02

camilamacedo86 approved these changes Nov 5, 2025

View reviewed changes

openshift-ci bot assigned camilamacedo86 Nov 5, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 5, 2025

tmshort mentioned this pull request Nov 5, 2025

🐛 OPRUN-4239: Memory usage improvements #2290

Merged

pedjak reviewed Nov 5, 2025

View reviewed changes

tmshort force-pushed the prom-threshholds branch from aacc60e to 4224099 Compare November 5, 2025 19:27

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 5, 2025

pedjak reviewed Nov 5, 2025

View reviewed changes

tmshort force-pushed the prom-threshholds branch from 4224099 to 9b44975 Compare November 5, 2025 19:42

pedjak approved these changes Nov 5, 2025

View reviewed changes

openshift-ci bot assigned pedjak Nov 5, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 5, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 6, 2025

openshift-merge-bot bot merged commit 05ee601 into operator-framework:main Nov 6, 2025
24 checks passed

🌱 OPRUN-4242: Calibrate Prometheus alert thresholds using memory profiling data #2308

🌱 OPRUN-4242: Calibrate Prometheus alert thresholds using memory profiling data #2308

Conversation

tmshort commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Originally in ALERT_THRESHOLD_VERIFICATION.md

Alert Threshold Verification

Summary

Test Results

Baseline Test (Before Threshold Updates)

Verification Test (After Threshold Updates)

Alert Threshold Changes

Memory Growth Analysis

Conclusion

Important Notes

Reference

Reviewer Checklist

Uh oh!

netlify bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for olmv1 ready!

Uh oh!

camilamacedo86 left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pedjak Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

tmshort Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

tmshort Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

pedjak Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

tmshort Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

tmshort Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

pedjak left a comment

Choose a reason for hiding this comment

Uh oh!

dtfranz commented Nov 5, 2025

Uh oh!

tmshort commented Nov 6, 2025

Uh oh!

openshift-ci bot commented Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tmshort commented Nov 5, 2025 •

edited

Loading

netlify bot commented Nov 5, 2025 •

edited

Loading

codecov bot commented Nov 5, 2025 •

edited

Loading