-
Notifications
You must be signed in to change notification settings - Fork 68
🐛 OPRUN-4239: Memory usage improvements #2290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 OPRUN-4239: Memory usage improvements #2290
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
888d28f to
d28dfea
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2290 +/- ##
==========================================
- Coverage 74.32% 74.29% -0.04%
==========================================
Files 90 91 +1
Lines 7008 7046 +38
==========================================
+ Hits 5209 5235 +26
- Misses 1392 1400 +8
- Partials 407 411 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
d28dfea to
e0e4ec0
Compare
2f55115 to
a8afab3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements memory optimizations for operator-controller and catalogd based on profiling analysis. The changes reduce memory usage by approximately 16-20% through cache transformations, discovery client caching, and strategic pre-allocations. Additionally, it introduces a comprehensive e2e profiling toolset and updates Prometheus alert thresholds based on empirical memory usage data.
- Memory optimization: Cache transforms to strip managed fields and large annotations, discovery client caching to prevent redundant OpenAPI fetches
- Profiling infrastructure: Complete e2e profiling toolkit with automated collection, analysis, and comparison capabilities
- Alert calibration: Updated Prometheus thresholds based on actual memory usage patterns observed during profiling
Reviewed Changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/shared/util/cache/transform.go | New cache transform function to strip memory-heavy fields from cached objects |
| internal/operator-controller/applier/boxcutter.go | Memory optimizations: pre-allocated maps, stripped managed fields, improved slice allocation estimates |
| internal/catalogd/storage/localdir.go | Pre-allocate metaChans slice with correct capacity |
| internal/catalogd/garbagecollection/garbage_collector.go | Pre-allocate removed slice to avoid reallocation |
| cmd/operator-controller/main.go | Apply cache transforms and wrap discovery client with memory cache |
| cmd/catalogd/main.go | Apply cache transforms to catalogd's cache configuration |
| manifests/*.yaml | Add pprof-bind-address flags to enable profiling in e2e tests |
| helm/olmv1/templates/*.yml | Conditionally enable pprof for e2e environments |
| helm/prometheus/templates/prometheusrile-controller-alerts.yml | Updated alert thresholds based on profiling data |
| hack/tools/e2e-profiling/* | Complete profiling toolkit with scripts for collection, analysis, and comparison |
| docs/project/*.md | Memory analysis documentation and alert threshold verification |
| .gitignore | Exclude e2e-profiles directory and refined Claude file exclusions |
| .claude/commands/e2e-profile.md | Claude Code integration for profiling commands |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
48a627b to
25b2461
Compare
|
Used claude to fix copilot review issues |
25b2461 to
aebb6e9
Compare
|
I split the PR... see #2298 |
| if labels == nil { | ||
| labels = map[string]string{} | ||
| } | ||
| // Optimize: avoid cloning if we're going to add labels anyway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is going to be confusing(context-less) without a diff to see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I don't understand the change here. Cloning creates a completely new slice (independent of the original), and creates the backing storage itself automatically.
If we want to copy into a new array instead, we have to create the new storage manually, and then copy from the source (original slice) to dest (new slice). So it's doing the same thing cloning does, but manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the other make() optimizations, this pre-allocates the size of the map, rather than allowing the code to reallocate as needed. So the map is sized according to the two maps, and then the other maps are copied into the new map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽 about the allocation. How do you feel about removing the comment? - it's a nit though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove the comments
docs/project/MEMORY_ANALYSIS.md
Outdated
| @@ -0,0 +1,302 @@ | |||
| # Memory Analysis & Optimization Plan | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how I feel about committing these files to the project. Projects normally haven't created and committed these kinds of files to the code base, even though LLMs speak this lingo now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I hear you. But it does provide context for this PR. I can remove it before merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I would add them to PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it to the (now very long) PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better than having them as files committed to main 👍🏽
docs/project/MEMORY_ANALYSIS.md
Outdated
| @@ -0,0 +1,302 @@ | |||
| # Memory Analysis & Optimization Plan | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I would add them to PR description.
cmd/catalogd/main.go
Outdated
| // Memory optimization: strip managed fields and large annotations from cached objects | ||
| DefaultTransform: cacheutil.StripManagedFieldsAndAnnotations, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we quantify the saving here? Is it worth of the effort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading the MEMORY_ANALYSIS.md correctly, it's saving ~10MB or 18%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps you can state the saving in the comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those savings only apply for the test against which this was run, and may be more, less, or may change over time, depending on the actual usage. So, I'd rather not give explicit numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those savings only apply for the test against which this was run, and may be more, less, or may change over time, depending on the actual usage. So, I'd rather not give explicit numbers.
Sure, I was more thinking to have a bit more explicit message that striping fields reduces memory usage significantly.
| // | ||
| // Note: This function signature returns an error to satisfy the cache.TransformFunc | ||
| // interface, but it never returns an error in practice. | ||
| func StripManagedFieldsAndAnnotations(obj interface{}) (interface{}, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the additional saving significant compared to just using TransformStripManagedFields from controller runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently. Claude did not know about that function.
This does more than just strip the managed fields, it also removes the last-appliued-configuration annotation, which can be quite large as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to Claude:
● Based on my analysis, here's the comparison:
What Each Function Does
TransformStripManagedFields (controller-runtime v0.22.1)
- Package: sigs.k8s.io/controller-runtime/pkg/cache
- Signature: func TransformStripManagedFields() toolscache.TransformFunc
- Purpose: Strips only managed fields from objects before caching
- Added in: controller-runtime v0.18.0
StripManagedFieldsAndAnnotations (your custom function)
- Location: internal/shared/util/cache/transform.go:20
- Does:
a. Strips managed fields (same as controller-runtime's function)
b. ALSO strips the kubectl.kubernetes.io/last-applied-configuration annotation (which can be very large)
c. Clones the annotations map to avoid modifying shared referencesAnswer to Your Question
Instead of: ❌ No - The controller-runtime function only strips managed fields, while your custom function also
strips a memory-heavy annotation. Using only TransformStripManagedFields would lose the annotation-stripping
optimization.In addition to: ❌ No - This would be redundant since both strip managed fields, so you'd be stripping managed
fields twice.As part of: ✅ Your custom function already includes the functionality - Your StripManagedFieldsAndAnnotations
essentially does what TransformStripManagedFields does PLUS additional annotation cleanup.Recommendation
Keep using your custom StripManagedFieldsAndAnnotations because it provides more memory optimization by also
removing the kubectl.kubernetes.io/last-applied-configuration annotation, which the controller-runtime function
doesn't handle. Your implementation is more comprehensive for memory reduction.However, you could potentially refactor it to compose with the controller-runtime function if you wanted to
reduce code duplication, but the current implementation is simple, clear, and well-documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, you could potentially refactor it to compose with the controller-runtime function if you wanted to
reduce code duplication, but the current implementation is simple, clear, and well-documented.###
This would be nice to do, to reduce the code duplication.
Keep using your custom StripManagedFieldsAndAnnotations because it provides more memory optimization by also
removing the kubectl.kubernetes.io/last-applied-configuration annotation, which the controller-runtime function
Right, but in our e2e test we cannot measure it because the resources are not created using kubectl, hence the annotation does not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, but it's still a potential for memory savings when kubectl is used.
| // | ||
| // Note: This function signature returns an error to satisfy the cache.TransformFunc | ||
| // interface, but it never returns an error in practice. | ||
| func StripManagedFieldsAndAnnotations(obj interface{}) (interface{}, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we not actually return TransformFunc type to improve readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done... this introduced a helper function because boxcutter used the function as well...
aebb6e9 to
6ab096e
Compare
| description: container {{`{{ $labels.container }}`}} of pod {{`{{ $labels.pod }}`}} experienced OOM event(s); count={{`{{ $value }}`}} | ||
| expr: container_oom_events_total > 0 | ||
| # Memory growth alerts - thresholds calibrated based on baseline memory profiling | ||
| # See MEMORY_ANALYSIS.md for details on normal operational memory patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no MEMORY_ANALYSIS.md file anymore right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's part of the PR comment.
6ab096e to
84edafb
Compare
| # Threshold: 75MB (baseline shows 16.9MB peak, well under threshold) | ||
| expr: sum(container_memory_working_set_bytes{pod=~"catalogd.*",container="manager"}) > 75_000_000 | ||
| for: 5m | ||
| keep_firing_for: 1d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to increase the Thresholds
it will solve the flakes
| cacheOptions := crcache.Options{ | ||
| ByObject: map[client.Object]crcache.ByObject{}, | ||
| // Memory optimization: strip managed fields and large annotations from cached objects | ||
| DefaultTransform: cacheutil.StripManagedFieldsAndAnnotations(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
Just a small nit — this isn’t valid only for Boxcutter since we’re not protecting it with a feature flag.
Could we clarify that in the title? As it stands, it gives the impression that the change only applies when the flag is enabled.
Alternatively, if we don’t want it to be used by default, we could protect the change with the flag.
| # Threshold: 150MB (baseline shows 107.9MB peak is normal, stabilizes at 78-88MB) | ||
| expr: sum(container_memory_working_set_bytes{pod=~"operator-controller.*",container="manager"}) > 150_000_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the changes in this file a bit unexpected, given the overall PR goal: if we have have improved memory usage (i.e. reduce it), I would expect to keep the current thresholds, if not even decreasing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn’t have a proper benchmark defined for those values.
Since @tmshort performed the checks, he updated them accordingly.
However, I agree that this change is out of scope for this PR.
It would be more appropriate to have a separate PR specifically for that — for example: “Update thresholds based on benchmark tests.” @tmshort Could we do that? So we could LGTM/asap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another split...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pedjak
Boxcutter uses more memory in general.
This PR is attempting to reduce that extra memory usage, but it can't eliminate it, so the thresholds that were in place for the helm applier do not apply when boxcutter is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pedjak Boxcutter uses more memory in general. This PR is attempting to reduce that extra memory usage, but it can't eliminate it, so the thresholds that were in place for the helm applier do not apply when boxcutter is enabled.
Fully agree that PR contains the useful improvement, but would the new threshold (150MB) be reached on main branch today? If no, I would suggest that we lower it a bit (e.g. 120MB), so that we can consistently catch regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fully agree that PR contains the useful improvement, but would the new threshold (150MB) be reached on
mainbranch today? If no, I would suggest that we lower it a bit (e.g. 120MB), so that we can consistently catch regressions.
I've created #2308, it will only update the threshold for test-experimental-e2e. The thresholds for test-e2e don't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is just the memory optimizations now
Implement multiple memory optimization strategies to reduce heap allocations and RSS memory usage during operator execution: **OpenAPI Schema Caching:** - Wrap discovery client with memory.NewMemCacheClient to cache OpenAPI schemas - Prevents redundant schema fetches from API server - Applied to both operator-controller and catalogd **Cache Transform Functions:** - Strip managed fields from cached objects (can be several KB per object) - Remove large annotations (kubectl.kubernetes.io/last-applied-configuration) - Shared transform function in internal/shared/util/cache/transform.go **Memory Efficiency Improvements:** - Pre-allocate slices with known capacity to reduce grow operations - Reduce unnecessary deep copies of large objects - Optimize JSON deserialization paths **Impact:** These optimizations significantly reduce memory overhead, especially for large-scale deployments with many resources. OpenAPI caching alone reduces allocations by ~73% (from 13MB to 3.5MB per profiling data). See MEMORY_ANALYSIS.md for detailed breakdown of memory usage patterns. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Todd Short <tshort@redhat.com>
84edafb to
b7ca109
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
He already gave his LGTM, but the "Changes Requested" needs to be removed.
| @@ -0,0 +1,71 @@ | |||
| package cache | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps better location for this file is internal/shared/controllers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pedjak, tmshort The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
f0a9ded
into
operator-framework:main
Memory Optimization
This PR implements comprehensive memory optimizations to reduce memory usage and establish evidence-based alerting thresholds.
This was triggered by Boxcutter using more memory than the helm applier. Some of these optimizations are for Boxcutter, but some are also more general.
Summary of Changes
⚡ Memory Optimizations
OpenAPI Schema Caching:
memory.NewMemCacheClient()to cache OpenAPI schemasCache Transform Functions:
kubectl.kubernetes.io/last-applied-configurationannotationsinternal/shared/util/cache/transform.goMemory Efficiency Improvements:
Documentation
Testing
All changes have been validated through:
Impact
Memory Usage:
Developer Experience:
Notes
🤖 Generated with Claude Code
These were originally part of the PR as documents, but have been added to the PR description:
MEMORY_ANALYSIS.md
Memory Analysis & Optimization Plan
Executive Summary
Current State:
Alerts Triggered:
Key Finding: Memory stabilizes during normal operation (heap19-21 @ 106K for 3 snapshots), suggesting this is NOT a memory leak but rather expected operational memory usage.
Memory Breakdown
Peak Memory (54.74MB Heap)
Memory Growth Pattern
Critical Insight: heap19-21 shows 0K growth for 3 consecutive snapshots. This proves memory stabilizes during normal operation and is NOT continuously leaking.
Root Cause Analysis
1. JSON Deserialization (24.64MB / 45%) 🔴
What's happening:
Why:
unstructured.Unstructuredtypes for dynamic manifests from bundlesmap[string]interface{}conversion allocates heavilyCall Stack:
Is this a problem?
Optimization Potential:⚠️ Limited
2. Informer List Operations (9.87MB / 18%) 🟡
What's happening:
Why:
Current Configuration:
Optimization Potential: ✅ Medium (~3-5MB savings)
3. OpenAPI Schema Retrieval (3.54MB / 6%) ✅
Status: Already optimized in commit 446a595
What was done:
Results:
Remaining 3.54MB: Legitimate one-time schema fetch + conversion costs
Optimization Potential: ✅ Already done
Recommended Optimizations
Priority 1: Document Expected Behavior ✅ RECOMMEND
Action: Accept current memory usage as normal operational behavior
Reasoning:
Implementation:
Effort: Low
Risk: None
Impact: Eliminates false positive alerts
Priority 2: Add Pagination to Informer Lists⚠️ EVALUATE
Goal: Reduce initial list memory from 9.87MB → ~5-6MB
Problem: Informers load entire resource collections at startup:
Solution: Use pagination for initial sync
Implementation Research Needed:
Current controller-runtime may not support pagination for informers. Would need to:
cache.Optionssupports list paginationField Selector Approach (easier):
Caveat: Only helps if significant % of resources match filter
Effort: Medium (if controller-runtime supports) / High (if requires upstream changes)
Risk: Low (just cache configuration)
Potential Savings: 3-5MB (30-50% of list operation memory)
Priority 3: Reduce JSON String Allocation⚠️ RESEARCH
Goal: Reduce json.unquote overhead (10.14MB)
Problem: Heavy string interning during JSON parsing
Possible Approaches:
Research Required:
Effort: High (requires careful implementation + testing)
Risk: Medium (could introduce subtle bugs)
Potential Savings: 2-3MB (20-30% of string allocation)
Non-Viable Optimizations
❌ Replace Unstructured with Typed Clients
Why not: Operator deals with arbitrary CRDs from bundles that aren't known at compile time
Tradeoff: Would need to:
Verdict: Not feasible for OLM
❌ Reduce Runtime Overhead (53MB)
Why not: This is inherent Go runtime memory (GC, goroutine stacks, fragmentation)
Normal Ratio: 50% overhead (heap → RSS) is typical for Go applications
Verdict: Cannot optimize without affecting functionality
Recommended Action Plan
Phase 1: Acceptance (Immediate)
Rationale: Data shows memory is stable, not leaking
Phase 2: Optional Optimizations (Future)
Only pursue if production shows issues
Investigate field selectors for informer lists (Low effort, low risk)
Research pagination support in controller-runtime (Medium effort)
Benchmark object pooling for unstructured types (High effort)
Success Metrics
Current Baseline
Target (if optimizations pursued)
Production Monitoring
Conclusion
Current State: ✅ HEALTHY
The memory profile shows:
Recommendation: Accept current behavior and adjust alert thresholds
Optional Future Work: Investigate informer list optimizations if production deployments show issues with large numbers of resources