Memory leak in OT Tracing 0.4.1 / 0.6.1 #249

mrjaekim · 2020-11-02T20:59:29Z

We use Open Telemetry tracing in our product and log spans on almost all request to our Community web application. One of our production sites (app name: xxxxxxx.prod, app id: xxxxxxxxx) on one of our hosts (hostname.domain.com) starting seeing a large spike in memory use on 10/15 until we turned of the open telemetry tracing just before 10/21. We took a memory snapshot and found about 10.77 GB was being used by the class com.newrelic.telemetry.TelemetryClient.

It seems to have backed up 10.77 GB of RunnableScheduleFutures with com.newrelic.telemetry.spans.SpanBatch objects to send to New Relic in a ScheduleThreadPoolExecutore.DelayedWorkQueue.

We checked out logs and found steadily increasing log messages from 10/14/2020 23:57:52.947 -0700 to 10/20/2020 23:55:49.481 -0700 (when we turned off open telemetry tracing for this community web app) that looked like this:

2020-10-15 08:57:52,947 +0200 [8-thread-1] INFO [cid=, tx=, rh=, userId=] newrelic.telemetry.TelemetryClient - Metric batch sending failed. Backing off 0 MILLISECONDS

We currently have open telemetry tracing turned off for this app, but we'd like to turn it back on and make sure that it doesn't happen again (or on any of our other community apps). Can you see what might have happened to cause this issue?

I can provide any further info that's needed. thanks!

breedx-nr · 2020-11-02T21:45:29Z

Hi @mrjaekim .

Thanks for sharing your experience with us...it sounds a lot like #189 and like an unusual case where the communication was down for an extended period of time (not just a network blip). Do you have any idea what was failing? Did anything interesting happen with the network or firewall or DNS or proxies or routes or anything else at that time? Was there any other indication in the logs about why things were failing?

For the sake of discussion, I would like to entertain the idea of allowing data over a certain size to
"age-out" of the buffer. That is, if you've buffered more than spans, the oldest will be replaced when new spans are being buffered. This approach should serve to keep the memory usage constrained, at the cost of potential data loss.

Do you have a sense of how many spans were being buffered to give the 10.77GB size? Do you have a sense of what might be a reasonable number of spans to buffer?

breedx-nr · 2020-11-06T22:45:54Z

Well, this got auto closed...but it would be great if you could try out the snapshot and let us know if this does genuinely help your problem. Thanks!

breedx-nr mentioned this issue Nov 4, 2020

Add limiting scheduler #251

Merged

breedx-nr closed this as completed in #251 Nov 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in OT Tracing 0.4.1 / 0.6.1 #249

Memory leak in OT Tracing 0.4.1 / 0.6.1 #249

mrjaekim commented Nov 2, 2020 •

edited

Loading

breedx-nr commented Nov 2, 2020

breedx-nr commented Nov 6, 2020

Memory leak in OT Tracing 0.4.1 / 0.6.1 #249

Memory leak in OT Tracing 0.4.1 / 0.6.1 #249

Comments

mrjaekim commented Nov 2, 2020 • edited Loading

breedx-nr commented Nov 2, 2020

breedx-nr commented Nov 6, 2020

mrjaekim commented Nov 2, 2020 •

edited

Loading