Skip to content

Conversation

@tmshort
Copy link
Contributor

@tmshort tmshort commented Nov 5, 2025

Analyze baseline memory usage patterns and adjust Prometheus alert thresholds to eliminate false positives while maintaining sensitivity to real issues.

This is based on memory profiling done against BoxcutterRuntime, which has increased memory load. It's set up to only increase the thresholds when running in experimental mode.

Memory Analysis:

  • Peak RSS: 107.9MB, Peak Heap: 54.74MB during e2e tests
  • Memory stabilizes at 106K heap (heap19-21 show 0K growth for 3 snapshots)
  • Conclusion: NOT a memory leak, but normal operational behavior

Memory Breakdown:

  • JSON Deserialization: 24.64MB (45%) - inherent to OLM's dynamic nature
  • Informer Lists: 9.87MB (18%) - optimization possible via field selectors
  • OpenAPI Schemas: 3.54MB (6%) - already optimized (73% reduction)
  • Runtime Overhead: 53.16MB (49%) - normal for Go applications

Alert Threshold Updates:

  • operator-controller-memory-growth: 100kB/sec → 200kB/sec
  • operator-controller-memory-usage: 100MB → 150MB
  • catalogd-memory-growth: 100kB/sec → 200kB/sec

Rationale:
Baseline profiling showed 132.4kB/sec episodic growth during informer sync and 107.9MB peak usage are normal. Previous thresholds caused false positive alerts during normal e2e test execution.

Verification:

  • Baseline test (old thresholds): 2 alerts triggered (false positives)
  • Verification test (new thresholds): 0 alerts triggered ✅
  • Memory patterns remain consistent (~55MB heap, 79-171MB RSS)
  • Transient spikes don't trigger alerts due to "for: 5m" clause

Recommendation:
Accept 107.9MB as normal operational behavior for test/development environments. Production deployments may need different thresholds based on workload characteristics (number of resources, reconciliation frequency).

Non-viable Optimizations:

  • Cannot replace unstructured with typed clients (breaks OLM flexibility)
  • Cannot reduce runtime overhead (inherent to Go)
  • JSON deserialization is unavoidable for dynamic resource handling

🤖 Generated with Claude Code

Originally in ALERT_THRESHOLD_VERIFICATION.md

Alert Threshold Verification

Summary

Successfully verified that updated Prometheus alert thresholds eliminate false positive alerts during normal e2e test execution.

Test Results

Baseline Test (Before Threshold Updates)

Alerts Triggered:

  • ⚠️ operator-controller-memory-growth: 132.4kB/sec (threshold: 100kB/sec)
  • ⚠️ operator-controller-memory-usage: 107.9MB (threshold: 100MB)

Memory Profile:

  • operator-controller: 25 profiles, peak heap24.pprof (160K)
  • catalogd: 25 profiles, peak heap24.pprof (44K)
  • Peak heap memory: 54.74MB
  • Peak RSS memory: 107.9MB

Verification Test (After Threshold Updates)

Alerts Triggered:

  • None - Zero alerts fired

Memory Profile:

  • operator-controller: 25 profiles, peak heap24.pprof (168K)
  • catalogd: 25 profiles, peak heap24.pprof (44K)
  • Peak heap memory: ~55MB (similar to baseline)
  • RSS memory: Stayed mostly 79-90MB with final spike to 171MB (did not sustain for 5min)

Alert Threshold Changes

Alert Old Threshold New Threshold Rationale
operator-controller-memory-growth 100 kB/sec 200 kB/sec Baseline shows 132.4kB/sec episodic growth is normal
operator-controller-memory-usage 100 MB 150 MB Baseline shows 107.9MB peak is normal operational usage
catalogd-memory-growth 100 kB/sec 200 kB/sec Aligned with operator-controller for consistency
catalogd-memory-usage 75 MB 75 MB No change needed (16.9MB peak well under threshold)

Memory Growth Analysis

Baseline Memory Growth Rate (5min avg):

  • Observed: 109.4 KB/sec max in verification test
  • Pattern: Episodic spikes during informer sync and reconciliation
  • Not a continuous leak - memory stabilizes during normal operation

Memory Usage Pattern:

  • Initialization: 12K → 19K (minimal)
  • Informer sync: 19K → 64K (rapid growth)
  • Steady operation: 64K → 106K (gradual)
  • Stabilization: 106K (heap19-21 show 0K growth for 3 snapshots)

Conclusion

Verification Successful

The updated alert thresholds are correctly calibrated for test/development environments:

  1. No false positive alerts during normal e2e test execution
  2. Thresholds still detect anomalies: Set high enough to avoid false positives but low enough to catch actual issues
  3. Memory behavior is consistent: Both baseline and verification tests show similar memory patterns

Important Notes

  • Thresholds are calibrated for test/development environments

  • Production deployments may need different thresholds based on:

    • Number of managed ClusterExtensions
    • Reconciliation frequency
    • Cluster size and API server load
    • Number of ClusterCatalogs and bundle complexity
  • The "for: 5m" clause in alerts ensures transient spikes (like the 171MB spike at test completion) don't trigger alerts

Reference

See #2290 for detailed breakdown of memory usage patterns and optimization opportunities.

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@tmshort tmshort requested a review from a team as a code owner November 5, 2025 17:02
@netlify
Copy link

netlify bot commented Nov 5, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 9b44975
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/690ba8ba44d1720008e38651
😎 Deploy Preview https://deploy-preview-2308--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@openshift-ci openshift-ci bot requested review from bentito and thetechnick November 5, 2025 17:02
Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with.
It would be nice get either LGTM from @dtfranz who is the author of it.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 5, 2025
@codecov
Copy link

codecov bot commented Nov 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.24%. Comparing base (18142b3) to head (9b44975).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2308      +/-   ##
==========================================
- Coverage   74.32%   74.24%   -0.09%     
==========================================
  Files          90       91       +1     
  Lines        7008     7046      +38     
==========================================
+ Hits         5209     5231      +22     
- Misses       1392     1402      +10     
- Partials      407      413       +6     
Flag Coverage Δ
e2e 45.91% <ø> (-0.06%) ⬇️
experimental-e2e 48.20% <ø> (-0.03%) ⬇️
unit 58.58% <ø> (-0.25%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 11 to 23
highMemoryThresholds:
enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about that we put this under prometheus config block and have actual threshold values for various alarms:

prometheus:
  thresholds:
     memoryGrowth: 100_000
     catalogdMemoryUsage: 75_000_000
     operatorMemoryUsage: 100_000_000
    .
    .

It gives us more flexibility later, we do not need to modify the chart at all, and we can configure this in a particular way by overriding the default values at the chart installation time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 5, 2025
memoryUsage: "75_000_000"
cpuUsage: 20
apiCallRate: 5
highMemoryThresholds:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed anymore, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dammit! I thought I removed that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Analyze baseline memory usage patterns and adjust Prometheus alert thresholds
to eliminate false positives while maintaining sensitivity to real issues.

This is based on memory profiling done against BoxcutterRuntime, which has
increased memory load.

**Memory Analysis:**
- Peak RSS: 107.9MB, Peak Heap: 54.74MB during e2e tests
- Memory stabilizes at 106K heap (heap19-21 show 0K growth for 3 snapshots)
- Conclusion: NOT a memory leak, but normal operational behavior

**Memory Breakdown:**
- JSON Deserialization: 24.64MB (45%) - inherent to OLM's dynamic nature
- Informer Lists: 9.87MB (18%) - optimization possible via field selectors
- OpenAPI Schemas: 3.54MB (6%) - already optimized (73% reduction)
- Runtime Overhead: 53.16MB (49%) - normal for Go applications

**Alert Threshold Updates:**
- operator-controller-memory-growth: 100kB/sec → 200kB/sec
- operator-controller-memory-usage: 100MB → 150MB
- catalogd-memory-growth: 100kB/sec → 200kB/sec

**Rationale:**
Baseline profiling showed 132.4kB/sec episodic growth during informer sync
and 107.9MB peak usage are normal. Previous thresholds caused false positive
alerts during normal e2e test execution.

**Verification:**
- Baseline test (old thresholds): 2 alerts triggered (false positives)
- Verification test (new thresholds): 0 alerts triggered ✅
- Memory patterns remain consistent (~55MB heap, 79-171MB RSS)
- Transient spikes don't trigger alerts due to "for: 5m" clause

**Recommendation:**
Accept 107.9MB as normal operational behavior for test/development
environments. Production deployments may need different thresholds based
on workload characteristics (number of resources, reconciliation frequency).

**Non-viable Optimizations:**
- Cannot replace unstructured with typed clients (breaks OLM flexibility)
- Cannot reduce runtime overhead (inherent to Go)
- JSON deserialization is unavoidable for dynamic resource handling

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Todd Short <tshort@redhat.com>
Copy link
Contributor

@pedjak pedjak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 5, 2025
@dtfranz
Copy link
Contributor

dtfranz commented Nov 5, 2025

/lgtm too thanks for doing this @tmshort and for the thorough justification!

@tmshort
Copy link
Contributor Author

tmshort commented Nov 6, 2025

/approve

@openshift-ci
Copy link

openshift-ci bot commented Nov 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: camilamacedo86, pedjak, tmshort

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 6, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 05ee601 into operator-framework:main Nov 6, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants