-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Description
What version of Go are you using (go version)?
go1.19.5, linux/amd64
Does this issue reproduce with the latest release?
yes
What did you do?
I ran a slowly leaking memory application with GOMEMLIMIT and GOGC=-1.
What did you expect to see?
I expected the application to run fine for a while. After a few days, I expected the CPU utilization to increase gradually (slightly, by <25-50%). I then the application to OOMed by the kernel.
What did you see instead?
After the initial expected GC CPU utilization increase, the utilized CPU increased dramatically (75% of the CPU time available to the container). The application remained in a degraded state for a long time (5-8+ hours).
Report
This is a continuation of a conversation started in #48409 (comment) - I thought I'd start a new issue.
tldr;
- GOMEMLIMIT works great for 95% of our applications; we'd like to enable it globally
- the remaining 5% are applications slowly leaking memory; it's unfeasible to fix them
- for these applications, we're seeing significant degradation for a couple of hours before the container is OOMed
- we'd want the applications to die faster (and be restarted as they are now) - the degradation is too "expensive".
Investigation
Hey @mknyszek, I come bringing more data. In summary:
- testing "slow memory leaks" is slow
- I'm able to reproduce the degradation on both staging and production consistently
- I should be able to get gctraces; I'm giving the staging applications more time to crash
- the CPU utilization degradation is significant (75% of the container)
I was testing two of our applications via dynamically adjusting GOMEMLIMIT.
1. A Latency-sensitive proxy in production.
Observations:
- I'm able to consistently trigger the degradation by setting the memory manually
- surprising: the CPU usage increase is not gradual, but sudden (1.5 core -> 3.8 core)
- 200mb difference in GOMEMLIMIT can trigger the degraded behavior
- the RSS usage increases by ~40mb/hour, so it would take ~10 hours for OOM to kick in with GOMEMLIMIT=95%.
The container runs with 4 GOMAXPROCS (4 cores cgroup limits) and 10gb of memory. At the time of the tests, cgroup reported RSS was 9.2GiB.
The test was:
- I started reducing GOMEMLIMIT from 10GiB in 100MiB batches down (the first spike on the graph)
-- 9563676416 - normal
-- 9463676416 - increase to 150%
-- 9363676416 - increase to 300% - I reverted GOMEMLIMIT to 9563676416,
- after 1 hour we start seeing some CPU increase, after 2 hours CPU utilization is at ~1.3-1.4 core
- after 4 hours the utilization goes through the roof.
Container CPU utilization before the test was fairly stable for ~2 days, so I don't think any external factors affect the test. The previous spikes are me modifying the MEMLIMIT manually.

Benchmarks
I ran three benchmarks during "normal" and "high CPU" periods, (1) unlimited throughput, (2) 1000 RPS, (3)100 RPS. The throughput seems (surprisingly) only to be ~10% down; latency changes are visible though. This is the same host, with GOMEMLIMIT set at all times, without restarts.
Benchmarks
All benchmarks are "degraded" followed by "non-degraded".
Unlimited throughput:
Benchmark parameters:
CPUs: 32
Connections: 64
Concurrency: 1
Max requests: 0
Max duration: 1m0s
Max RPS: 0
Latencies:
0.5000: 1.505503ms
0.9000: 3.116004ms
0.9500: 3.940367ms
0.9900: 7.068518ms
0.9990: 29.626833ms
0.9995: 97.738173ms
1.0000: 901.724891ms
Elapsed time (seconds): 60.05
Total requests: 1899024
RPS: 31622.10
Benchmark parameters:
CPUs: 32
Connections: 64
Concurrency: 1
Max requests: 0
Max duration: 1m0s
Max RPS: 0
Latencies:
0.5000: 1.186803ms
0.9000: 3.468514ms
0.9500: 4.749452ms
0.9900: 9.265477ms
0.9990: 27.223769ms
0.9995: 65.71919ms
1.0000: 538.934235ms
Elapsed time (seconds): 60.01
Total requests: 2091010
RPS: 34842.09
1000 RPS:
Benchmark parameters:
CPUs: 32
Connections: 64
Concurrency: 1
Max requests: 60000
Max duration: 1m0s
Max RPS: 1000
Latencies:
0.5000: 649.981µs
0.9000: 1.512459ms
0.9500: 16.059954ms
0.9900: 612.542139ms
0.9990: 840.800883ms
0.9995: 860.559206ms
1.0000: 930.54864ms
Elapsed time (seconds): 60.00
Total requests: 52173
RPS: 869.55
Benchmark parameters:
CPUs: 32
Connections: 64
Concurrency: 1
Max requests: 60000
Max duration: 1m0s
Max RPS: 1000
Latencies:
0.5000: 422.508µs
0.9000: 694.131µs
0.9500: 1.017692ms
0.9900: 33.982008ms
0.9990: 112.48922ms
0.9995: 120.384829ms
1.0000: 132.146788ms
Elapsed time (seconds): 60.00
Total requests: 59770
RPS: 996.16
100RPS:
Benchmark parameters:
CPUs: 32
Connections: 64
Concurrency: 1
Max requests: 6000
Max duration: 1m0s
Max RPS: 100
Latencies:
0.5000: 846.241µs
0.9000: 97.532361ms
0.9500: 348.847315ms
0.9900: 643.727969ms
0.9990: 773.982663ms
0.9995: 798.205955ms
1.0000: 827.131386ms
Elapsed time (seconds): 60.00
Total requests: 5992
RPS: 99.87
Benchmark parameters:
CPUs: 32
Connections: 64
Concurrency: 1
Max requests: 6000
Max duration: 1m0s
Max RPS: 100
Latencies:
0.5000: 620.282µs
0.9000: 940.553µs
0.9500: 1.44535ms
0.9900: 70.505245ms
0.9990: 145.183217ms
0.9995: 157.922399ms
1.0000: 187.452379ms
Elapsed time (seconds): 60.00
Total requests: 6000
RPS: 100.00
2. A leader elected not-so-important application.
The leader allocates a lot. The backup containers gather some data (memory "leaks" on all), but don't allocate as much. Containers run with 3 CPUs, and 16 GiB memory.
The test: I restarted and set GOMEMLIMIT=4Gib on all containers. I let them run.
We see:
- gradual CPU usage increase over time (this is the GC, there's no external work affecting the usage)
- after 24 hours, my GOMEMLIMIT expires (we have an enforced timeout), and the CPU utilization briefly drops back to "base".
- I reset the GOMEMLIMIT to 4GiB, CPU usage is now higher
- the CPU is stable for 2 hours
- after 2 hours CPU starts to increase to 3 cores (max)
- the container crashes for an unrelated reason
- a backup picks the work up, it's immediately throttled (it should have as much leaked data as the previous leader)
- the memory continues to leak at ~100mb/5h - with GOMEMLIMIT at 95% of the container memory, it would be 8 hours before the crash.
This case is interesting because it's similar to a "failover" scenario where the application might suddenly get 50% extra traffic.
We measure cgroups nr_throttled periods and calculate "healthiness".

I'll try to get gctraces over the next few days.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status

