You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use Open Telemetry tracing in our product and log spans on almost all request to our Community web application. One of our production sites (app name: xxxxxxx.prod, app id: xxxxxxxxx) on one of our hosts (hostname.domain.com) starting seeing a large spike in memory use on 10/15 until we turned of the open telemetry tracing just before 10/21. We took a memory snapshot and found about 10.77 GB was being used by the class com.newrelic.telemetry.TelemetryClient.
It seems to have backed up 10.77 GB of RunnableScheduleFutures with com.newrelic.telemetry.spans.SpanBatch objects to send to New Relic in a ScheduleThreadPoolExecutore.DelayedWorkQueue.
We checked out logs and found steadily increasing log messages from 10/14/2020 23:57:52.947 -0700 to 10/20/2020 23:55:49.481 -0700 (when we turned off open telemetry tracing for this community web app) that looked like this:
2020-10-15 08:57:52,947 +0200 [8-thread-1] INFO [cid=, tx=, rh=, userId=] newrelic.telemetry.TelemetryClient - Metric batch sending failed. Backing off 0 MILLISECONDS
We currently have open telemetry tracing turned off for this app, but we'd like to turn it back on and make sure that it doesn't happen again (or on any of our other community apps). Can you see what might have happened to cause this issue?
I can provide any further info that's needed. thanks!
The text was updated successfully, but these errors were encountered:
Thanks for sharing your experience with us...it sounds a lot like #189 and like an unusual case where the communication was down for an extended period of time (not just a network blip). Do you have any idea what was failing? Did anything interesting happen with the network or firewall or DNS or proxies or routes or anything else at that time? Was there any other indication in the logs about why things were failing?
For the sake of discussion, I would like to entertain the idea of allowing data over a certain size to
"age-out" of the buffer. That is, if you've buffered more than spans, the oldest will be replaced when new spans are being buffered. This approach should serve to keep the memory usage constrained, at the cost of potential data loss.
Do you have a sense of how many spans were being buffered to give the 10.77GB size? Do you have a sense of what might be a reasonable number of spans to buffer?
We use Open Telemetry tracing in our product and log spans on almost all request to our Community web application. One of our production sites (app name: xxxxxxx.prod, app id: xxxxxxxxx) on one of our hosts (hostname.domain.com) starting seeing a large spike in memory use on 10/15 until we turned of the open telemetry tracing just before 10/21. We took a memory snapshot and found about 10.77 GB was being used by the class com.newrelic.telemetry.TelemetryClient.
It seems to have backed up 10.77 GB of RunnableScheduleFutures with com.newrelic.telemetry.spans.SpanBatch objects to send to New Relic in a ScheduleThreadPoolExecutore.DelayedWorkQueue.
We checked out logs and found steadily increasing log messages from 10/14/2020 23:57:52.947 -0700 to 10/20/2020 23:55:49.481 -0700 (when we turned off open telemetry tracing for this community web app) that looked like this:
2020-10-15 08:57:52,947 +0200 [8-thread-1] INFO [cid=, tx=, rh=, userId=] newrelic.telemetry.TelemetryClient - Metric batch sending failed. Backing off 0 MILLISECONDS
We currently have open telemetry tracing turned off for this app, but we'd like to turn it back on and make sure that it doesn't happen again (or on any of our other community apps). Can you see what might have happened to cause this issue?
I can provide any further info that's needed. thanks!
The text was updated successfully, but these errors were encountered: