Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in OT Tracing 0.4.1 / 0.6.1 #249

Closed
mrjaekim opened this issue Nov 2, 2020 · 2 comments · Fixed by #251
Closed

Memory leak in OT Tracing 0.4.1 / 0.6.1 #249

mrjaekim opened this issue Nov 2, 2020 · 2 comments · Fixed by #251

Comments

@mrjaekim
Copy link

mrjaekim commented Nov 2, 2020

We use Open Telemetry tracing in our product and log spans on almost all request to our Community web application. One of our production sites (app name: xxxxxxx.prod, app id: xxxxxxxxx) on one of our hosts (hostname.domain.com) starting seeing a large spike in memory use on 10/15 until we turned of the open telemetry tracing just before 10/21. We took a memory snapshot and found about 10.77 GB was being used by the class com.newrelic.telemetry.TelemetryClient.

It seems to have backed up 10.77 GB of RunnableScheduleFutures with com.newrelic.telemetry.spans.SpanBatch objects to send to New Relic in a ScheduleThreadPoolExecutore.DelayedWorkQueue.

We checked out logs and found steadily increasing log messages from 10/14/2020 23:57:52.947 -0700 to 10/20/2020 23:55:49.481 -0700 (when we turned off open telemetry tracing for this community web app) that looked like this:

2020-10-15 08:57:52,947 +0200 [8-thread-1] INFO [cid=, tx=, rh=, userId=] newrelic.telemetry.TelemetryClient - Metric batch sending failed. Backing off 0 MILLISECONDS

We currently have open telemetry tracing turned off for this app, but we'd like to turn it back on and make sure that it doesn't happen again (or on any of our other community apps). Can you see what might have happened to cause this issue?

I can provide any further info that's needed. thanks!

@breedx-nr
Copy link
Contributor

Hi @mrjaekim .

Thanks for sharing your experience with us...it sounds a lot like #189 and like an unusual case where the communication was down for an extended period of time (not just a network blip). Do you have any idea what was failing? Did anything interesting happen with the network or firewall or DNS or proxies or routes or anything else at that time? Was there any other indication in the logs about why things were failing?

For the sake of discussion, I would like to entertain the idea of allowing data over a certain size to
"age-out" of the buffer. That is, if you've buffered more than spans, the oldest will be replaced when new spans are being buffered. This approach should serve to keep the memory usage constrained, at the cost of potential data loss.

Do you have a sense of how many spans were being buffered to give the 10.77GB size? Do you have a sense of what might be a reasonable number of spans to buffer?

@breedx-nr
Copy link
Contributor

Well, this got auto closed...but it would be great if you could try out the snapshot and let us know if this does genuinely help your problem. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants