-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent use high memory even metrics scraped is not with a high volume #55
Comments
10 GB definitely seems like a lot. I suspect this might be caused by Go >=1.12's behavior of using Checking the If you look at |
In my testing, most of the memory usage tends to boil down to: a) how many series exist in a single scraped target If you have one target with a significant number of series, the memory allocated for scraping that target sticks around in a pool even when the target isn't being actively scraped. Likewise, even though strings are being interned, having mostly long label names and values will negatively influence the memory usage, as will a significant amount of series. |
Thanks for your help @rfratto! We tried to set the GODEBUG env variable as you said, however the same issue persists (memory spikes and then pod OOMs as we have set a memory limit). Does this indicate that it's not a memory management issue? This is what the size per series looks like (I added the env variable at 16:54): Looking at the heap profile from an agent with the issue and an agent without: remote-write-agent3 (with the memory spiking problem) remote-write-agent0 (no problems) Here are the heap dumps: Does anything look suspicious to you in those heapdumps? |
Thanks for providing the heap dumps! Unfortunately nothing seems out of the ordinary; most of the memory usage is coming in from building labels from the relabeling process (which I assume is because of using From what I can tell, I agree that this seems like there's just a giant target that's moving between agents and pushing your pod beyond its memory limits. How many active series were on I've reached out to my team to see if we can get a second pair of eyes on this just in case I'm missing something here. |
Yeah, the # active series seems like the prime suspect here. Number of active series (same pattern as memory usage): Number of samples (seems to be around the same across agents): It seems like there's probably a target with a high number of unique metric series that's moving between the agents. Do you know what's the best way to analyze which targets have the most number of active series? I saw this blog post about analyzing cardinality and churn for Prometheus. I'm wondering if we can do something similar for the remote write agent. |
That information isn't exposed yet in the Agent unfortunately. Prometheus can expose information about targets and series per target in its scraper, but I'm not using it yet (I had planned to expose it in an API that I vaguely described in #6). Short term, the easiest way for you to find out the problematic target might be to hack on the Agent to print out that metadata. Off the top of my head, I'd suggest adding a goroutine in the instance code that polls the scrape manager and logs out targets with a high number of series. You could then build a new image by running |
Thanks @rfratto. will give a try to see. btw, curious why we still need to keep active queries if it is a pure remote write? Is this one prometheus limitation? :) |
Hi @yuecong, I'm not sure what you mean be needing to keep active queries; could you explain a little more? |
Say I have M metrics scraped from all the targets for one agent, let us say this is about 15K. So, I am thinking about whether we could not care about the cardinality for each metric in the agent, but just call the remote write API to the storage backend so that the agent will not suffer for the high cardinality. But I agree that the storage backend will suffer from it for sure. :) Hope this is more clear. |
@yuecong Because of the way Prometheus' WAL and remote write system are designed, which is what the agent is based on top of, there's no way to not care about active series. The WAL has some record types, but the ones we care about for remote write are series records and sample records. Series records are written when a new series starts (or in checkpoints if a series is still active) and contain a reference ID and the labels for that series. Sample records only contain the reference ID and the current timestamp/value for that series. In order for remote write to be more reliable via reusing the WAL, the remote write system has to cache the data it reads from series records. Series churn over short periods of time will lead to increased memory usage with remote write. |
Thanks for your help @cstyan. We found the target with the many metrics. We added an external label for the shard number of each agent so that we could query to see which targets were scraped by each agent and narrowed down from there. Moving forward, we're setting up a dedicated agent for that target with higher memory requests. Thanks again! |
We use
hashmod
to shard the agents to remote write to some Prometheus remote write compatible backends. We found some agents are taking much more memory than the otheres even the metrics it scraped is not that much. Any ideas why some agents can take so much memory?The text was updated successfully, but these errors were encountered: