Agent use high memory even metrics scraped is not with a high volume #55

yuecong · 2020-04-28T23:05:59Z

We use hashmod to shard the agents to remote write to some Prometheus remote write compatible backends. We found some agents are taking much more memory than the otheres even the metrics it scraped is not that much. Any ideas why some agents can take so much memory?

The text was updated successfully, but these errors were encountered:

rfratto · 2020-04-28T23:22:01Z

10 GB definitely seems like a lot. I suspect this might be caused by Go >=1.12's behavior of using MADV_FREE for memory management, which leads to a higher reported RSS size even though a portion of it is available to be reclaimed by the kernel when it needs to. You can revert back to the 1.11 behavior by setting the GODEBUG environment variable to madvdontneed=1, which gives a more accurate RSS size.

Checking the go_memstats_heap_inuse_bytes metric from the Agent will give a clearer picture here on how much memory is actively being used. Dividing that by agent_wal_storage_active_series will also help you find out the average bytes per series; we tend to see around 9KB for our 20 agents.

If you look at go_memstats_heap_inuse_bytes and it's still unexpectedly high, it would be useful to share a heap profile. You can generate that by wgeting /debug/pprof/heap against the Agent's http server.

rfratto · 2020-04-28T23:29:05Z

In my testing, most of the memory usage tends to boil down to:

a) how many series exist in a single scraped target
b) the length of label names and values
c) the number of active series

If you have one target with a significant number of series, the memory allocated for scraping that target sticks around in a pool even when the target isn't being actively scraped.

Likewise, even though strings are being interned, having mostly long label names and values will negatively influence the memory usage, as will a significant amount of series.

yywandb · 2020-04-29T00:10:35Z

Thanks for your help @rfratto!

We tried to set the GODEBUG env variable as you said, however the same issue persists (memory spikes and then pod OOMs as we have set a memory limit). Does this indicate that it's not a memory management issue?

This is what the size per series looks like (I added the env variable at 16:54):

Looking at the heap profile from an agent with the issue and an agent without:

remote-write-agent3 (with the memory spiking problem)

remote-write-agent0 (no problems)

Here are the heap dumps:
heapdump.zip

Does anything look suspicious to you in those heapdumps?

yywandb · 2020-04-29T00:27:32Z

Update:
go_memstats_heap_inuse_bytes metrics

container_memory_rss using cadvisor metrics

After updating the agents with the env variable at 16:54, for remote-write-agent-3, there was a spike in the memory usage followed by it being OOMKilled at 17:07. However, it seems that the memory usage of remote-write-agent-3 is stable now, but we are beginning to see the memory usage of remote-write-usage-2 creep up.

Do you think that this indicates that there's probably a scrape target with large metrics (e.g. long labels) that has moved between the agents? We shard the scrape targets based on pod name, so it's possible that the pod name has changed, causing the scrape target under a different pot name to be scraped by a different agent.

rfratto · 2020-04-29T00:35:50Z

Thanks for providing the heap dumps! Unfortunately nothing seems out of the ordinary; most of the memory usage is coming in from building labels from the relabeling process (which I assume is because of using hashmod for all the series).

From what I can tell, I agree that this seems like there's just a giant target that's moving between agents and pushing your pod beyond its memory limits. How many active series were on remote-write-agent-3 before and after the memory usage moved to -2?

I've reached out to my team to see if we can get a second pair of eyes on this just in case I'm missing something here.

yywandb · 2020-04-29T00:53:52Z

Yeah, the # active series seems like the prime suspect here.

Number of active series (same pattern as memory usage):

Number of samples (seems to be around the same across agents):

It seems like there's probably a target with a high number of unique metric series that's moving between the agents.

Do you know what's the best way to analyze which targets have the most number of active series? I saw this blog post about analyzing cardinality and churn for Prometheus. I'm wondering if we can do something similar for the remote write agent.

rfratto · 2020-04-29T01:06:30Z

That information isn't exposed yet in the Agent unfortunately. Prometheus can expose information about targets and series per target in its scraper, but I'm not using it yet (I had planned to expose it in an API that I vaguely described in #6).

Short term, the easiest way for you to find out the problematic target might be to hack on the Agent to print out that metadata. Off the top of my head, I'd suggest adding a goroutine in the instance code that polls the scrape manager and logs out targets with a high number of series. You could then build a new image by running RELEASE_BUILD=1 make agent-image.

yuecong · 2020-04-29T16:30:52Z

Thanks @rfratto. will give a try to see. btw, curious why we still need to keep active queries if it is a pure remote write? Is this one prometheus limitation? :)

rfratto · 2020-04-29T16:46:55Z

Hi @yuecong, I'm not sure what you mean be needing to keep active queries; could you explain a little more?

yuecong · 2020-04-29T17:03:09Z

Say I have M metrics scraped from all the targets for one agent, let us say this is about 15K.
And some of the metrics scraped have a high cardinality and the label value is changing over time per scrape, and this will cause active queries much higher than # of metrics scraped each time. I think this is likely what is happening in our system.

So, I am thinking about whether we could not care about the cardinality for each metric in the agent, but just call the remote write API to the storage backend so that the agent will not suffer for the high cardinality. But I agree that the storage backend will suffer from it for sure. :)

Hope this is more clear.

cstyan · 2020-04-29T17:15:09Z

@yuecong Because of the way Prometheus' WAL and remote write system are designed, which is what the agent is based on top of, there's no way to not care about active series.

The WAL has some record types, but the ones we care about for remote write are series records and sample records. Series records are written when a new series starts (or in checkpoints if a series is still active) and contain a reference ID and the labels for that series. Sample records only contain the reference ID and the current timestamp/value for that series. In order for remote write to be more reliable via reusing the WAL, the remote write system has to cache the data it reads from series records. Series churn over short periods of time will lead to increased memory usage with remote write.

yywandb · 2020-05-01T23:13:59Z

Thanks for your help @cstyan. We found the target with the many metrics. We added an external label for the shard number of each agent so that we could query to see which targets were scraped by each agent and narrowed down from there.

Moving forward, we're setting up a dedicated agent for that target with higher memory requests.

Thanks again!

yuecong · 2020-05-01T23:15:45Z

Thanks @cstyan and @rfratto ! closing this issue.

yuecong closed this as completed May 1, 2020

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 26, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent use high memory even metrics scraped is not with a high volume #55

Agent use high memory even metrics scraped is not with a high volume #55

yuecong commented Apr 28, 2020

rfratto commented Apr 28, 2020

rfratto commented Apr 28, 2020

yywandb commented Apr 29, 2020 •

edited

Loading

yywandb commented Apr 29, 2020 •

edited

Loading

rfratto commented Apr 29, 2020 •

edited

Loading

yywandb commented Apr 29, 2020 •

edited

Loading

rfratto commented Apr 29, 2020

yuecong commented Apr 29, 2020 •

edited

Loading

rfratto commented Apr 29, 2020

yuecong commented Apr 29, 2020 •

edited

Loading

cstyan commented Apr 29, 2020

yywandb commented May 1, 2020

yuecong commented May 1, 2020

Agent use high memory even metrics scraped is not with a high volume #55

Agent use high memory even metrics scraped is not with a high volume #55

Comments

yuecong commented Apr 28, 2020

rfratto commented Apr 28, 2020

rfratto commented Apr 28, 2020

yywandb commented Apr 29, 2020 • edited Loading

yywandb commented Apr 29, 2020 • edited Loading

rfratto commented Apr 29, 2020 • edited Loading

yywandb commented Apr 29, 2020 • edited Loading

rfratto commented Apr 29, 2020

yuecong commented Apr 29, 2020 • edited Loading

rfratto commented Apr 29, 2020

yuecong commented Apr 29, 2020 • edited Loading

cstyan commented Apr 29, 2020

yywandb commented May 1, 2020

yuecong commented May 1, 2020

yywandb commented Apr 29, 2020 •

edited

Loading

yywandb commented Apr 29, 2020 •

edited

Loading

rfratto commented Apr 29, 2020 •

edited

Loading

yywandb commented Apr 29, 2020 •

edited

Loading

yuecong commented Apr 29, 2020 •

edited

Loading

yuecong commented Apr 29, 2020 •

edited

Loading