Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when using the /metrics endpoint (PrometheusExporter) #95

Open
krynju opened this issue Nov 8, 2023 · 4 comments
Open

Memory leak when using the /metrics endpoint (PrometheusExporter) #95

krynju opened this issue Nov 8, 2023 · 4 comments

Comments

@krynju
Copy link
Collaborator

krynju commented Nov 8, 2023

EDIT: This is observable on 1.9, but not on 1.10

I've identified a memory leak in a long running service coming from OpenTelemetry.

I'm attaching an MWE split off from the main service, which can reproduce the issue identically
The example attached is pretty extreme and can get you to ~2GB memory usage in about 10 minutes.

The leak is dependent on measure activity and capturing measures through the /metrics endpoint.

  1. metricsspam emulates a running service, which generates some metric activity
  2. for emulating metrics capture we use ab to call the /metrics endpoint extensively

Version info:
julia 1.9.3, linux
OpenTelemetry versions as the attached Manifest.toml
or simpler, this repo at commit 4975ecd

subdir="src/api",                      # 0.3.0
subdir="src/sdk",                      # 0.2.1
subdir="src/proto",                    # 0.13.1
subdir="src/exporter/otlp/proto/grpc", # 0.2.1
subdir="src/exporter/prometheus",      # 0.1.4

Reproducer: memleak.zip

Tar contents:

  1. run.jl - reproducer code
  2. make.jl - code to generate project/manifest with exact package revisions used
  3. Project.toml/Manifest.toml

Steps:

  1. extract the zip
  2. run the reproducer using julia --project=. -e "include(\"run.jl\");init();"
  3. run the metrics capture emulation using ab -c 500 -n 1000000 http://localhost:9967/
  4. observe endlessly rising memory usage in the process monitoring tool of your choice

Example runtime (~4gb in 25min)
image

@lamdor
Copy link

lamdor commented Jan 12, 2024

@krynju I think this may be an issue of the incremental GC not being able to keep up. I was trying to recreate this to see if I could fix it, but what I ended up trying was in the background having it do a full GC every few seconds. With that I don't see the memory to continue to grow, even after an hour. Before the GC.gc() change, the memory would grow by about 150MB/minute.

Here's my diff:

4c4
< using OpenTelemetryExporterPrometheus
---
> using OpenTelemetryExporterPrometheus
7a8,9
> GC.enable_logging(true)
>
172a175,182
> function gc_full_occasionally()
>     while true
>         sleep(2)
>         GC.gc()
>     end
>
> end
>
198c208,217
<
---
>         @async begin
>             try
>                 gc_full_occasionally()
>             catch ex
>                 @error(
>                     "exception initializing",
>                     exception = (ex, catch_backtrace())
>                 )
>             end
>         end

@krynju
Copy link
Collaborator Author

krynju commented Jan 16, 2024

Notes from today:

  1. GC in loop at 2 sec period, ~540MB after 2h
  2. GC from interactive REPL called sometimes and extensively at the end - ~7.5GB after 2h
  3. GC in loop at 60 sec period, ~645MB after ~30min

@krynju
Copy link
Collaborator Author

krynju commented Feb 13, 2024

Some experiments with heap snapshots.

After the 4th snapshot I ran GC manually and:

  1. heap actually had elements removed correctly (you can see the comparison of the 4th and 5th snapshot and there's a clear delta)
  2. process memory footprint stayed the same

I'm starting to think this is some hidden Julia issue. I would suspect the process memory footprint to go down along with the cleanup of the heap

image

process info

Heap dump says the heap is only 150mb, but process info says it's 1.5gb
image

@krynju
Copy link
Collaborator Author

krynju commented Feb 13, 2024

I ran the reproducer on 1.10 on a custom branch of OpenTelemetry.jl that has proper 1.10 support
In two runs I could not reproduce the leak. Memory usage stays stable under load and GC seems to clean up most of it after I run it manually

This is potentially good news, but I'm not 100% sure of it yet.
Will do more testing etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants