Skip to content

runtime/debug: GOMEMLIMIT prolonged high GC CPU utilization before container OOM #58106

@rabbbit

Description

@rabbbit

What version of Go are you using (go version)?

go1.19.5, linux/amd64

Does this issue reproduce with the latest release?

yes

What did you do?

I ran a slowly leaking memory application with GOMEMLIMIT and GOGC=-1.

What did you expect to see?

I expected the application to run fine for a while. After a few days, I expected the CPU utilization to increase gradually (slightly, by <25-50%). I then the application to OOMed by the kernel.

What did you see instead?

After the initial expected GC CPU utilization increase, the utilized CPU increased dramatically (75% of the CPU time available to the container). The application remained in a degraded state for a long time (5-8+ hours).

Report

This is a continuation of a conversation started in #48409 (comment) - I thought I'd start a new issue.

tldr;

  • GOMEMLIMIT works great for 95% of our applications; we'd like to enable it globally
  • the remaining 5% are applications slowly leaking memory; it's unfeasible to fix them
  • for these applications, we're seeing significant degradation for a couple of hours before the container is OOMed
  • we'd want the applications to die faster (and be restarted as they are now) - the degradation is too "expensive".

Investigation

Hey @mknyszek, I come bringing more data. In summary:

  • testing "slow memory leaks" is slow
  • I'm able to reproduce the degradation on both staging and production consistently
  • I should be able to get gctraces; I'm giving the staging applications more time to crash
  • the CPU utilization degradation is significant (75% of the container)

I was testing two of our applications via dynamically adjusting GOMEMLIMIT.

1. A Latency-sensitive proxy in production.

Observations:

  • I'm able to consistently trigger the degradation by setting the memory manually
  • surprising: the CPU usage increase is not gradual, but sudden (1.5 core -> 3.8 core)
  • 200mb difference in GOMEMLIMIT can trigger the degraded behavior
  • the RSS usage increases by ~40mb/hour, so it would take ~10 hours for OOM to kick in with GOMEMLIMIT=95%.

The container runs with 4 GOMAXPROCS (4 cores cgroup limits) and 10gb of memory. At the time of the tests, cgroup reported RSS was 9.2GiB.

The test was:

  • I started reducing GOMEMLIMIT from 10GiB in 100MiB batches down (the first spike on the graph)
    -- 9563676416 - normal
    -- 9463676416 - increase to 150%
    -- 9363676416 - increase to 300%
  • I reverted GOMEMLIMIT to 9563676416,
  • after 1 hour we start seeing some CPU increase, after 2 hours CPU utilization is at ~1.3-1.4 core
  • after 4 hours the utilization goes through the roof.

image

Container CPU utilization before the test was fairly stable for ~2 days, so I don't think any external factors affect the test. The previous spikes are me modifying the MEMLIMIT manually.
image

Benchmarks

I ran three benchmarks during "normal" and "high CPU" periods, (1) unlimited throughput, (2) 1000 RPS, (3)100 RPS. The throughput seems (surprisingly) only to be ~10% down; latency changes are visible though. This is the same host, with GOMEMLIMIT set at all times, without restarts.

Benchmarks

All benchmarks are "degraded" followed by "non-degraded".

Unlimited throughput:

Benchmark parameters:                       
  CPUs:            32
  Connections:     64
  Concurrency:     1
  Max requests:    0
  Max duration:    1m0s
  Max RPS:         0
                                                      
                                                      
Latencies:       
  0.5000: 1.505503ms
  0.9000: 3.116004ms
  0.9500: 3.940367ms   
  0.9900: 7.068518ms
  0.9990: 29.626833ms
  0.9995: 97.738173ms
  1.0000: 901.724891ms
Elapsed time (seconds):         60.05
Total requests:                 1899024
RPS:                            31622.10

Benchmark parameters:
  CPUs:            32
  Connections:     64
  Concurrency:     1
  Max requests:    0
  Max duration:    1m0s
  Max RPS:         0
Latencies:
  0.5000: 1.186803ms
  0.9000: 3.468514ms
  0.9500: 4.749452ms
  0.9900: 9.265477ms
  0.9990: 27.223769ms
  0.9995: 65.71919ms
  1.0000: 538.934235ms
Elapsed time (seconds):         60.01
Total requests:                 2091010
RPS:                            34842.09

1000 RPS:

Benchmark parameters:
  CPUs:            32
  Connections:     64                                 
  Concurrency:     1 
  Max requests:    60000
  Max duration:    1m0s
  Max RPS:         1000
Latencies:              
  0.5000: 649.981µs    
  0.9000: 1.512459ms   
  0.9500: 16.059954ms  
  0.9900: 612.542139ms 
  0.9990: 840.800883ms
  0.9995: 860.559206ms
  1.0000: 930.54864ms 
Elapsed time (seconds):         60.00
Total requests:                 52173
RPS:                            869.55

Benchmark parameters:
  CPUs:            32
  Connections:     64
  Concurrency:     1
  Max requests:    60000
  Max duration:    1m0s
  Max RPS:         1000
Latencies:
  0.5000: 422.508µs
  0.9000: 694.131µs
  0.9500: 1.017692ms
  0.9900: 33.982008ms
  0.9990: 112.48922ms
  0.9995: 120.384829ms
  1.0000: 132.146788ms
Elapsed time (seconds):         60.00
Total requests:                 59770
RPS:                            996.16

100RPS:

Benchmark parameters:
  CPUs:            32
  Connections:     64
  Concurrency:     1
  Max requests:    6000
  Max duration:    1m0s
  Max RPS:         100
Latencies:
  0.5000: 846.241µs
  0.9000: 97.532361ms
  0.9500: 348.847315ms
  0.9900: 643.727969ms
  0.9990: 773.982663ms
  0.9995: 798.205955ms
  1.0000: 827.131386ms
Elapsed time (seconds):         60.00
Total requests:                 5992
RPS:                            99.87

Benchmark parameters:
  CPUs:            32
  Connections:     64
  Concurrency:     1
  Max requests:    6000
  Max duration:    1m0s
  Max RPS:         100

Latencies:
  0.5000: 620.282µs
  0.9000: 940.553µs
  0.9500: 1.44535ms
  0.9900: 70.505245ms
  0.9990: 145.183217ms
  0.9995: 157.922399ms
  1.0000: 187.452379ms
Elapsed time (seconds):         60.00
Total requests:                 6000
RPS:                            100.00  

2. A leader elected not-so-important application.

The leader allocates a lot. The backup containers gather some data (memory "leaks" on all), but don't allocate as much. Containers run with 3 CPUs, and 16 GiB memory.

The test: I restarted and set GOMEMLIMIT=4Gib on all containers. I let them run.

image

We see:

  • gradual CPU usage increase over time (this is the GC, there's no external work affecting the usage)
  • after 24 hours, my GOMEMLIMIT expires (we have an enforced timeout), and the CPU utilization briefly drops back to "base".
  • I reset the GOMEMLIMIT to 4GiB, CPU usage is now higher
  • the CPU is stable for 2 hours
  • after 2 hours CPU starts to increase to 3 cores (max)
  • the container crashes for an unrelated reason
  • a backup picks the work up, it's immediately throttled (it should have as much leaked data as the previous leader)
  • the memory continues to leak at ~100mb/5h - with GOMEMLIMIT at 95% of the container memory, it would be 8 hours before the crash.

This case is interesting because it's similar to a "failover" scenario where the application might suddenly get 50% extra traffic.

We measure cgroups nr_throttled periods and calculate "healthiness".
image

I'll try to get gctraces over the next few days.

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.compiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions