Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent use high memory even metrics scraped is not with a high volume #55

Closed
yuecong opened this issue Apr 28, 2020 · 13 comments
Closed
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.

Comments

@yuecong
Copy link

yuecong commented Apr 28, 2020

We use hashmod to shard the agents to remote write to some Prometheus remote write compatible backends. We found some agents are taking much more memory than the otheres even the metrics it scraped is not that much. Any ideas why some agents can take so much memory?

image (1)
image

@rfratto
Copy link
Member

rfratto commented Apr 28, 2020

10 GB definitely seems like a lot. I suspect this might be caused by Go >=1.12's behavior of using MADV_FREE for memory management, which leads to a higher reported RSS size even though a portion of it is available to be reclaimed by the kernel when it needs to. You can revert back to the 1.11 behavior by setting the GODEBUG environment variable to madvdontneed=1, which gives a more accurate RSS size.

Checking the go_memstats_heap_inuse_bytes metric from the Agent will give a clearer picture here on how much memory is actively being used. Dividing that by agent_wal_storage_active_series will also help you find out the average bytes per series; we tend to see around 9KB for our 20 agents.

If you look at go_memstats_heap_inuse_bytes and it's still unexpectedly high, it would be useful to share a heap profile. You can generate that by wgeting /debug/pprof/heap against the Agent's http server.

@rfratto
Copy link
Member

rfratto commented Apr 28, 2020

In my testing, most of the memory usage tends to boil down to:

a) how many series exist in a single scraped target
b) the length of label names and values
c) the number of active series

If you have one target with a significant number of series, the memory allocated for scraping that target sticks around in a pool even when the target isn't being actively scraped.

Likewise, even though strings are being interned, having mostly long label names and values will negatively influence the memory usage, as will a significant amount of series.

@yywandb
Copy link

yywandb commented Apr 29, 2020

Thanks for your help @rfratto!

We tried to set the GODEBUG env variable as you said, however the same issue persists (memory spikes and then pod OOMs as we have set a memory limit). Does this indicate that it's not a memory management issue?

This is what the size per series looks like (I added the env variable at 16:54):
image

Looking at the heap profile from an agent with the issue and an agent without:

remote-write-agent3 (with the memory spiking problem)
image

remote-write-agent0 (no problems)
image

Here are the heap dumps:
heapdump.zip

Does anything look suspicious to you in those heapdumps?

@yywandb
Copy link

yywandb commented Apr 29, 2020

Update:
go_memstats_heap_inuse_bytes metrics
image

container_memory_rss using cadvisor metrics
image

After updating the agents with the env variable at 16:54, for remote-write-agent-3, there was a spike in the memory usage followed by it being OOMKilled at 17:07. However, it seems that the memory usage of remote-write-agent-3 is stable now, but we are beginning to see the memory usage of remote-write-usage-2 creep up.

Do you think that this indicates that there's probably a scrape target with large metrics (e.g. long labels) that has moved between the agents? We shard the scrape targets based on pod name, so it's possible that the pod name has changed, causing the scrape target under a different pot name to be scraped by a different agent.

@rfratto
Copy link
Member

rfratto commented Apr 29, 2020

Thanks for providing the heap dumps! Unfortunately nothing seems out of the ordinary; most of the memory usage is coming in from building labels from the relabeling process (which I assume is because of using hashmod for all the series).

From what I can tell, I agree that this seems like there's just a giant target that's moving between agents and pushing your pod beyond its memory limits. How many active series were on remote-write-agent-3 before and after the memory usage moved to -2?

I've reached out to my team to see if we can get a second pair of eyes on this just in case I'm missing something here.

@yywandb
Copy link

yywandb commented Apr 29, 2020

Yeah, the # active series seems like the prime suspect here.

Number of active series (same pattern as memory usage):
image

Number of samples (seems to be around the same across agents):
image

It seems like there's probably a target with a high number of unique metric series that's moving between the agents.

Do you know what's the best way to analyze which targets have the most number of active series? I saw this blog post about analyzing cardinality and churn for Prometheus. I'm wondering if we can do something similar for the remote write agent.

@rfratto
Copy link
Member

rfratto commented Apr 29, 2020

That information isn't exposed yet in the Agent unfortunately. Prometheus can expose information about targets and series per target in its scraper, but I'm not using it yet (I had planned to expose it in an API that I vaguely described in #6).

Short term, the easiest way for you to find out the problematic target might be to hack on the Agent to print out that metadata. Off the top of my head, I'd suggest adding a goroutine in the instance code that polls the scrape manager and logs out targets with a high number of series. You could then build a new image by running RELEASE_BUILD=1 make agent-image.

@yuecong
Copy link
Author

yuecong commented Apr 29, 2020

Thanks @rfratto. will give a try to see. btw, curious why we still need to keep active queries if it is a pure remote write? Is this one prometheus limitation? :)

@rfratto
Copy link
Member

rfratto commented Apr 29, 2020

Hi @yuecong, I'm not sure what you mean be needing to keep active queries; could you explain a little more?

@yuecong
Copy link
Author

yuecong commented Apr 29, 2020

Say I have M metrics scraped from all the targets for one agent, let us say this is about 15K.
And some of the metrics scraped have a high cardinality and the label value is changing over time per scrape, and this will cause active queries much higher than # of metrics scraped each time. I think this is likely what is happening in our system.

So, I am thinking about whether we could not care about the cardinality for each metric in the agent, but just call the remote write API to the storage backend so that the agent will not suffer for the high cardinality. But I agree that the storage backend will suffer from it for sure. :)

Hope this is more clear.

@cstyan
Copy link

cstyan commented Apr 29, 2020

@yuecong Because of the way Prometheus' WAL and remote write system are designed, which is what the agent is based on top of, there's no way to not care about active series.

The WAL has some record types, but the ones we care about for remote write are series records and sample records. Series records are written when a new series starts (or in checkpoints if a series is still active) and contain a reference ID and the labels for that series. Sample records only contain the reference ID and the current timestamp/value for that series. In order for remote write to be more reliable via reusing the WAL, the remote write system has to cache the data it reads from series records. Series churn over short periods of time will lead to increased memory usage with remote write.

@yywandb
Copy link

yywandb commented May 1, 2020

Thanks for your help @cstyan. We found the target with the many metrics. We added an external label for the shard number of each agent so that we could query to see which targets were scraped by each agent and narrowed down from there.

Moving forward, we're setting up a dedicated agent for that target with higher memory requests.

Thanks again!

@yuecong
Copy link
Author

yuecong commented May 1, 2020

Thanks @cstyan and @rfratto ! closing this issue.

@yuecong yuecong closed this as completed May 1, 2020
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 26, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

No branches or pull requests

4 participants