Memory leak when using the `/metrics` endpoint (PrometheusExporter) #95

krynju · 2023-11-08T13:02:34Z

EDIT: This is observable on 1.9, but not on 1.10

I've identified a memory leak in a long running service coming from OpenTelemetry.

I'm attaching an MWE split off from the main service, which can reproduce the issue identically
The example attached is pretty extreme and can get you to ~2GB memory usage in about 10 minutes.

The leak is dependent on measure activity and capturing measures through the /metrics endpoint.

metricsspam emulates a running service, which generates some metric activity
for emulating metrics capture we use ab to call the /metrics endpoint extensively

Version info:
julia 1.9.3, linux
OpenTelemetry versions as the attached Manifest.toml
or simpler, this repo at commit 4975ecd

subdir="src/api",                      # 0.3.0
subdir="src/sdk",                      # 0.2.1
subdir="src/proto",                    # 0.13.1
subdir="src/exporter/otlp/proto/grpc", # 0.2.1
subdir="src/exporter/prometheus",      # 0.1.4

Reproducer: memleak.zip

Tar contents:

run.jl - reproducer code
make.jl - code to generate project/manifest with exact package revisions used
Project.toml/Manifest.toml

Steps:

extract the zip
run the reproducer using julia --project=. -e "include(\"run.jl\");init();"
run the metrics capture emulation using ab -c 500 -n 1000000 http://localhost:9967/
observe endlessly rising memory usage in the process monitoring tool of your choice

Example runtime (~4gb in 25min)

The text was updated successfully, but these errors were encountered:

lamdor · 2024-01-12T23:06:19Z

@krynju I think this may be an issue of the incremental GC not being able to keep up. I was trying to recreate this to see if I could fix it, but what I ended up trying was in the background having it do a full GC every few seconds. With that I don't see the memory to continue to grow, even after an hour. Before the GC.gc() change, the memory would grow by about 150MB/minute.

Here's my diff:

4c4
< using OpenTelemetryExporterPrometheus
---
> using OpenTelemetryExporterPrometheus
7a8,9
> GC.enable_logging(true)
>
172a175,182
> function gc_full_occasionally()
>     while true
>         sleep(2)
>         GC.gc()
>     end
>
> end
>
198c208,217
<
---
>         @async begin
>             try
>                 gc_full_occasionally()
>             catch ex
>                 @error(
>                     "exception initializing",
>                     exception = (ex, catch_backtrace())
>                 )
>             end
>         end

krynju · 2024-01-16T15:27:00Z

Notes from today:

GC in loop at 2 sec period, ~540MB after 2h
GC from interactive REPL called sometimes and extensively at the end - ~7.5GB after 2h
GC in loop at 60 sec period, ~645MB after ~30min

krynju · 2024-02-13T11:32:12Z

Some experiments with heap snapshots.

After the 4th snapshot I ran GC manually and:

heap actually had elements removed correctly (you can see the comparison of the 4th and 5th snapshot and there's a clear delta)
process memory footprint stayed the same

I'm starting to think this is some hidden Julia issue. I would suspect the process memory footprint to go down along with the cleanup of the heap

process info

Heap dump says the heap is only 150mb, but process info says it's 1.5gb

krynju · 2024-02-13T17:33:22Z

I ran the reproducer on 1.10 on a custom branch of OpenTelemetry.jl that has proper 1.10 support
In two runs I could not reproduce the leak. Memory usage stays stable under load and GC seems to clean up most of it after I run it manually

This is potentially good news, but I'm not 100% sure of it yet.
Will do more testing etc.

pankgeorg mentioned this issue Jan 17, 2024

Periodic Memory Management fonsp/Pluto.jl#2781

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak when using the `/metrics` endpoint (PrometheusExporter) #95

Memory leak when using the `/metrics` endpoint (PrometheusExporter) #95

krynju commented Nov 8, 2023 •

edited

lamdor commented Jan 12, 2024

krynju commented Jan 16, 2024

krynju commented Feb 13, 2024 •

edited

krynju commented Feb 13, 2024

Memory leak when using the /metrics endpoint (PrometheusExporter) #95

Memory leak when using the /metrics endpoint (PrometheusExporter) #95

Comments

krynju commented Nov 8, 2023 • edited

EDIT: This is observable on 1.9, but not on 1.10

lamdor commented Jan 12, 2024

krynju commented Jan 16, 2024

krynju commented Feb 13, 2024 • edited

process info

krynju commented Feb 13, 2024

Memory leak when using the `/metrics` endpoint (PrometheusExporter) #95

Memory leak when using the `/metrics` endpoint (PrometheusExporter) #95

krynju commented Nov 8, 2023 •

edited

krynju commented Feb 13, 2024 •

edited