-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Behavior similar to memory leak in Loki Distributor #5569
Comments
Thanks for reporting this @putrasattvika Please provide the config you are using as well. This will help us investigate this apparent bug. |
This is the config we're currently using:
|
We let Distributor run for a couple of days without restarting it, and in our case the peak memory usage was ~290MiB. We have multiple Kubernetes CronJobs with relatively high frequency (e.g. every 5 minutes) and we use the pod's name as a log label, so this might explain why our Distributors use more memory than liguozhong's. The pods OOM-ed before I can take a heap dump of them, so I can't really confirm whether or not the increase in memory usage is actually caused by the LRU cache. Still, I think it would be better if the cache size is user-configurable and the usual cache-related metrics are exported (cache size, cache hits & misses, cache evictions, etc.). |
Hi! This issue has been automatically marked as stale because it has not had any We use a stalebot among other tools to help manage the state of issues in this project. Stalebots are also emotionless and cruel and can close issues which are still very relevant. If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry. We regularly sort for closed issues which have a We may also:
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, |
Describe the bug
After upgrading to v2.3 and then to v2.4 (from v2.1), we noticed a constant growth of memory usage on Distributor pods. Their memory usage will always increase over time, to the point where it will almost reach our resource limit and will need to be restarted. This memory usage does not seem to be related to our ingestion rate. Below are some graphs of 24hr period: (left) memory usage of Distributor pods in MiB, (right) rate of bytes received by Distributor pods over 5m period.
To be fair, the memory we allocated to our Distributor pods is relatively low (~128MiB), but this was more than enough for v2.1 where the memory usage was relatively constant (see below).
We suspect (but currently have no proof) that this is caused by an LRU cache implemented in #3092 (this feature is not present in v2.1). Unfortunately, we can't really confirm this suspicion as there are no metrics that expose the current size of the cache.
Here's a heap profile of one of our Distributor pods after running for 16hr: loki-distributor.22-03-08T10-32-46.pb.gz (also uploaded to flamegraph.com).
To Reproduce
Steps to reproduce the behavior:
container_memory_working_set_bytes
Expected behavior
Memory usage of Distributor in v2.3++ to be relatively constant like in v2.1. Below is memory usage graph of our v2.1 Distributors (in MiB), before we upgraded it.
Environment:
The text was updated successfully, but these errors were encountered: