You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm reporting what I think is a memory leak in the prometheusremotewriteexporter that is triggered when the downstream prometheus endpoint is either slow to respond or failing entirely. This leak ultimately puts the collector into a GC loop that never recovers, causing impact to all the work that the collector is doing, not just the pipeline with the downstream problems. I've spent the last ~2 days troubleshooting this with AWS and talking through it with @Aneurysm9.
Steps to Reproduce
In my test case - we have a set of OTEL Collectors called metric-aggregators which accept inbound OTLP Metric data (generally sourced from Prometheus Receivers) and write the data into two different pipelines - call it a production and a debug pipeline. The data going into these pipelines can be the same, or it can be totally unique. In this case, the data is unique... I have data=foo -> production and data=bar -> debug essentially.
Once the pipeline is humming along, introduce intentional throttling to the Prometheus endpoint on the debug pipeline - I did this by setting resources.requests.cpu=1 and resources.limits.cpu=1 ... and we're writing ~50-80k datapoints/sec, so that was enough to introduce throttling.
Expected Result
My expectation is that the debug pipeline will start failing requests (_I'd expect to see context deadline exceeded messages) - and data would ultimately be refused by the batch processor, which would in turn refuse data upstream. I expect the production pipeline to continue to operate just fine because there's no impact to its downstream targets.
Actual Result
Interestingly, we see impact that starts with the debug pipeline, but then spreads to all of the pipelines in the collector. After a period of time (~20-40m), the collectors are completely stuck and are in a GC loop triggered by the memory_limiter. Data then fails to write to the production pipeline. Additionally, when we un-clog the prometheus debug endpoint, the collector doesn't self recover ... it is stuck in this GC loop essentially indefinitely until we restart the pods.
I think the only way to explain the flow here is to start with a picture, and then talk through the timeline. In this picture, we have 5 graphs that are important to see at the same time.
Metric Datapoints Exported: This is the graph of successful exported metrics per exporter. The blue and yellow lines are two exporters connected to the metrics/prometheus_prod pipeline. They are sending "production" data that we've validated. The orange line is the metrics/prometheus_beta pipeline that is sending data we haven't yet validated - but a high volume of it. The green line is a debug output, it can be ignored.
Percentage of Metrics Exported: This is the success-rate graph for each of the exporters described above.
HTTP Response Times - ...amazonaws.com: This is the response time graph for the prometheusremotewrite/amp exporter (a local AMP endpoint in the same region/account)
HTTP Response Times - ...com: This is the exporter prometheusremotewrite/centralProd which happens to be an AMP endpoint, but is cross-account and region (going through a proxy).
HTTP Response Times - prometheus-operated: This is the internal prometheusremotewrite/debug endpoint which is a single internal prometheus pod attached to the prometheus_beta pipeline that I used to introduce throttling.
Timeline
9:40:00: Everything is roughly humming along just fine....
9:44:00: Throttling is introduced to the prometheusremotewrite/debug exporter by reducing the CPU limits on the Prometheus pod. Latency starts to creep up.
10:07:00: We finally start to see the success-rate for the prometheusremotewrite/debug exporter tank. Note at this point the other two exporters are still operating just fine.
10:33:00: We see a dip in the success-rate for the prometheusremotewrite/centralProd and prometheusremotewrite/amp endpoints now.
10:37:00: Success rate tanks on all exporters now other than the debug exporter.
...
11:12:00: I un-cork the Prometheus Pod by unlimiting its CPU and letting it restart. We see immediate response to the latencies for the prometheusremotewrite/debug exporter (though, it does not drop enough, and starts climbing again)
... recovery never happens automatically...
Logs
Obviously we have lots of logs ... but here are two graphs that are interesting. First, just the high level graph of error log lines:
Rather than looking at the logs individually, I started looking at them in terms of two key messages... Forcing a GC and out of order errors:
We can see that roughly at 10:03 we start seeing the Forcing a GC message rates start climbing, and at 10:07 and 10:36 respectively we see corresponding jumps in the out of order error messages from the upstream Prometheus endpoints. We never see an active recovery of these metrics, even after we un-corked the downstream prometheusremotewrite/debug endpoint.
Finally - PPROF...
At @Aneurysm9's suggestion, I grabbed a profile and a few heap dumps from one pod during this time frame:
We can see in the CPU profile that most of the time is spent in GC:
When we look at the heap we can see an interesting memory usage in the prometheusremotewrite code:
Final thoughts
In this scenario, I expect the batch processor to prevent data from getting into the pipeline after the initial ~8-16k datapoints are collected and are failing to send. Once they fail to send, I expect that pressure to push upstream all the way to the receiver. I then expect to not really see any memory problems during this outage, just a blockage of the data going to the prometheusremotewrite/debug endpoint.
Instead I believe we see a memory leak in the prometheusremotewrite code. When that leak happens, it has the downstream impact of eventually tripping the memory_limiter circuit breaker which then starts forcing GCs ... but these GCs can't recover the data, so it just happens over and over and over again. This cycle then causes impact to the rest of the data pipeline flowing through the collector.
Lastly, I think this memory leak has some critical impact to the data payloads themselves sent to Prometheus which then causes duplicate or out-of-order samples to be sent that normally would not be, and this further exasterbates the problem.
The text was updated successfully, but these errors were encountered:
Oh there is a secondary question here - why does the p99 graph pin out at 10s... all of my timeouts are far higher than that. I found that pretty suspicious too.
Component(s)
exporter/prometheusremotewrite
What happened?
Description
I'm reporting what I think is a memory leak in the
prometheusremotewriteexporter
that is triggered when the downstream prometheus endpoint is either slow to respond or failing entirely. This leak ultimately puts the collector into a GC loop that never recovers, causing impact to all the work that the collector is doing, not just the pipeline with the downstream problems. I've spent the last ~2 days troubleshooting this with AWS and talking through it with @Aneurysm9.Steps to Reproduce
In my test case - we have a set of OTEL Collectors called
metric-aggregators
which accept inbound OTLP Metric data (generally sourced from Prometheus Receivers) and write the data into two different pipelines - call it aproduction
and adebug
pipeline. The data going into these pipelines can be the same, or it can be totally unique. In this case, the data is unique... I havedata=foo -> production
anddata=bar -> debug
essentially.Once the pipeline is humming along, introduce intentional throttling to the Prometheus endpoint on the
debug
pipeline - I did this by settingresources.requests.cpu=1
andresources.limits.cpu=1
... and we're writing ~50-80k datapoints/sec, so that was enough to introduce throttling.Expected Result
My expectation is that the
debug
pipeline will start failing requests (_I'd expect to seecontext deadline exceeded
messages) - and data would ultimately be refused by thebatch
processor, which would in turn refuse data upstream. I expect theproduction
pipeline to continue to operate just fine because there's no impact to its downstream targets.Actual Result
Interestingly, we see impact that starts with the
debug
pipeline, but then spreads to all of the pipelines in the collector. After a period of time (~20-40m), the collectors are completely stuck and are in a GC loop triggered by thememory_limiter
. Data then fails to write to theproduction
pipeline. Additionally, when we un-clog the prometheus debug endpoint, the collector doesn't self recover ... it is stuck in this GC loop essentially indefinitely until we restart the pods.Collector version
0.101.0
Environment information
Environment
OS: BottleRocket 1.19.4
OpenTelemetry Collector configuration
Additional context
Setting the Scene
I think the only way to explain the flow here is to start with a picture, and then talk through the timeline. In this picture, we have 5 graphs that are important to see at the same time.
Metric Datapoints Exported
: This is the graph of successful exported metrics per exporter. The blue and yellow lines are two exporters connected to themetrics/prometheus_prod
pipeline. They are sending "production" data that we've validated. The orange line is themetrics/prometheus_beta
pipeline that is sending data we haven't yet validated - but a high volume of it. The green line is a debug output, it can be ignored.Percentage of Metrics Exported
: This is the success-rate graph for each of the exporters described above.HTTP Response Times - ...amazonaws.com
: This is the response time graph for theprometheusremotewrite/amp
exporter (a local AMP endpoint in the same region/account)HTTP Response Times - ...com
: This is the exporterprometheusremotewrite/centralProd
which happens to be an AMP endpoint, but is cross-account and region (going through a proxy).HTTP Response Times - prometheus-operated
: This is the internalprometheusremotewrite/debug
endpoint which is a single internal prometheus pod attached to theprometheus_beta
pipeline that I used to introduce throttling.Timeline
9:40:00
: Everything is roughly humming along just fine....9:44:00
: Throttling is introduced to theprometheusremotewrite/debug
exporter by reducing the CPU limits on the Prometheus pod. Latency starts to creep up.10:07:00
: We finally start to see the success-rate for theprometheusremotewrite/debug
exporter tank. Note at this point the other two exporters are still operating just fine.10:33:00
: We see a dip in the success-rate for theprometheusremotewrite/centralProd
andprometheusremotewrite/amp
endpoints now.10:37:00
: Success rate tanks on all exporters now other than thedebug
exporter.11:12:00
: I un-cork the Prometheus Pod by unlimiting its CPU and letting it restart. We see immediate response to the latencies for theprometheusremotewrite/debug
exporter (though, it does not drop enough, and starts climbing again)Logs
Obviously we have lots of logs ... but here are two graphs that are interesting. First, just the high level graph of
![image](https://private-user-images.githubusercontent.com/768067/335706092-86d52d6c-ed16-42ff-9386-9cc08d44b0dc.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE5MTU3NTIsIm5iZiI6MTcyMTkxNTQ1MiwicGF0aCI6Ii83NjgwNjcvMzM1NzA2MDkyLTg2ZDUyZDZjLWVkMTYtNDJmZi05Mzg2LTljYzA4ZDQ0YjBkYy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyNVQxMzUwNTJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mZDhiZmY0MGM4YTQ0NWRkMGYxNjM2MjEyYWJiZTFjOTg0NWQ3MTEwNmM2MjA1YzkxNTMyYjMzNDY2YzVhMTA2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.Io7eUmx-sO1pYZWgPegvFkiNJHv2LqG3Wv0q4YuLVFw)
error
log lines:Rather than looking at the logs individually, I started looking at them in terms of two key messages...
Forcing a GC
andout of order
errors:We can see that roughly at
10:03
we start seeing theForcing a GC
message rates start climbing, and at10:07
and10:36
respectively we see corresponding jumps in theout of order
error messages from the upstream Prometheus endpoints. We never see an active recovery of these metrics, even after we un-corked the downstreamprometheusremotewrite/debug
endpoint.Finally - PPROF...
At @Aneurysm9's suggestion, I grabbed a
profile
and a fewheap
dumps from one pod during this time frame:11:04:00
11:11:00
11:14:00
11:26:00
We can see in the CPU profile that most of the time is spent in GC:
When we look at the heap we can see an interesting memory usage in the
![image](https://private-user-images.githubusercontent.com/768067/335707924-cc19eb1d-e635-46d3-9690-282a22d9bb0b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE5MTU3NTIsIm5iZiI6MTcyMTkxNTQ1MiwicGF0aCI6Ii83NjgwNjcvMzM1NzA3OTI0LWNjMTllYjFkLWU2MzUtNDZkMy05NjkwLTI4MmEyMmQ5YmIwYi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyNVQxMzUwNTJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT02MWViZTVjOTM2OTYwZjUzMDIyNjUyNTA0ZDIyNzg4YTEyNzA0ODRjNjE1MDI0YjhiZjE5NTI0M2UzZjhlYmQ2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.7RyReHk4OfgEMEWuzFeutO-iJMJv-aCbvvVVg8OT5TM)
prometheusremotewrite
code:Final thoughts
In this scenario, I expect the
batch
processor to prevent data from getting into the pipeline after the initial ~8-16k datapoints are collected and are failing to send. Once they fail to send, I expect that pressure to push upstream all the way to the receiver. I then expect to not really see any memory problems during this outage, just a blockage of the data going to theprometheusremotewrite/debug
endpoint.Instead I believe we see a memory leak in the
prometheusremotewrite
code. When that leak happens, it has the downstream impact of eventually tripping thememory_limiter
circuit breaker which then starts forcing GCs ... but these GCs can't recover the data, so it just happens over and over and over again. This cycle then causes impact to the rest of the data pipeline flowing through the collector.Lastly, I think this memory leak has some critical impact to the data payloads themselves sent to Prometheus which then causes duplicate or out-of-order samples to be sent that normally would not be, and this further exasterbates the problem.
The text was updated successfully, but these errors were encountered: