Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behavior similar to memory leak in Loki Distributor #5569

Closed
putrasattvika opened this issue Mar 8, 2022 · 5 comments
Closed

Behavior similar to memory leak in Loki Distributor #5569

putrasattvika opened this issue Mar 8, 2022 · 5 comments
Labels
component/distributor need-investigation stale A stale issue or PR that will automatically be closed. type/bug Somehing is not working as expected

Comments

@putrasattvika
Copy link
Contributor

Describe the bug
After upgrading to v2.3 and then to v2.4 (from v2.1), we noticed a constant growth of memory usage on Distributor pods. Their memory usage will always increase over time, to the point where it will almost reach our resource limit and will need to be restarted. This memory usage does not seem to be related to our ingestion rate. Below are some graphs of 24hr period: (left) memory usage of Distributor pods in MiB, (right) rate of bytes received by Distributor pods over 5m period.

image

To be fair, the memory we allocated to our Distributor pods is relatively low (~128MiB), but this was more than enough for v2.1 where the memory usage was relatively constant (see below).

We suspect (but currently have no proof) that this is caused by an LRU cache implemented in #3092 (this feature is not present in v2.1). Unfortunately, we can't really confirm this suspicion as there are no metrics that expose the current size of the cache.

Here's a heap profile of one of our Distributor pods after running for 16hr: loki-distributor.22-03-08T10-32-46.pb.gz (also uploaded to flamegraph.com).

To Reproduce
Steps to reproduce the behavior:

  1. Started Loki (v2.4.2)
  2. Ingest our logs like usual
  3. Query memory usage as exported by cadvisor (container_memory_working_set_bytes

Expected behavior
Memory usage of Distributor in v2.3++ to be relatively constant like in v2.1. Below is memory usage graph of our v2.1 Distributors (in MiB), before we upgraded it.
image

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: our own Kubernetes YAML files, based on jsonnet
@dannykopping
Copy link
Contributor

Thanks for reporting this @putrasattvika

Please provide the config you are using as well. This will help us investigate this apparent bug.

@putrasattvika
Copy link
Contributor Author

This is the config we're currently using:

chunk_store_config:
  chunk_cache_config:
    enable_fifocache: false
    memcached:
      batch_size: 100
    memcached_client:
      host: loki.memcached.<REDACTED>

  max_look_back_period: 8904h

  write_dedupe_cache_config:
    enable_fifocache: false
    memcached:
      batch_size: 100
    memcached_client:
      host: loki.memcached.<REDACTED>

ingester:
  chunk_idle_period: 30m
  chunk_retain_period: 10m

  lifecycler:
    interface_names:
    - eth0
    join_after: 1m
    ring:
      kvstore:
        etcd:
          endpoints: <REDACTED>
        prefix: /loki/collectors/
        store: etcd
      replication_factor: 3

  max_transfer_retries: 0

  wal:
    dir: /loki/wal
    enabled: true
    replay_memory_ceiling: 512MB

ingester_client:
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
  remote_timeout: 1s

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

schema_config:
  configs:
  - chunks:
      period: 168h
      prefix: loki_prod_chunk_
    from: '2019-11-25'
    index:
      period: 168h
      prefix: loki_prod_index_
    object_store: gcs
    schema: v10
    store: aws

  - from: '2020-07-22'
    index:
      period: 168h
      prefix: loki_prod_index_
    object_store: gcs
    schema: v11
    store: aws

server:
  graceful_shutdown_timeout: 5s
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  http_server_idle_timeout: 120s

storage_config:
  aws:
    dynamodb:
      dynamodb_url: dynamodb://<REDACTED>

  gcs:
    bucket_name: <REDACTED>

  index_queries_cache_config:
    enable_fifocache: false
    memcached:
      batch_size: 100
    memcached_client:
      host: loki.memcached.<REDACTED>

table_manager:
  chunk_tables_provisioning:
    inactive_read_throughput: 1
    inactive_write_throughput: 1
    provisioned_read_throughput: 1
    provisioned_write_throughput: 1

  index_tables_provisioning:
    enable_inactive_throughput_on_demand_mode: true
    enable_ondemand_throughput_mode: true

  retention_deletes_enabled: true
  retention_period: 8904h

@liguozhong
Copy link
Contributor

maxLabelCacheSize = 100000

distributor will be stable at around 200Mb and will not continue to rise.

image

@putrasattvika
Copy link
Contributor Author

distributor will be stable at around 200Mb and will not continue to rise

We let Distributor run for a couple of days without restarting it, and in our case the peak memory usage was ~290MiB. We have multiple Kubernetes CronJobs with relatively high frequency (e.g. every 5 minutes) and we use the pod's name as a log label, so this might explain why our Distributors use more memory than liguozhong's.

image

The pods OOM-ed before I can take a heap dump of them, so I can't really confirm whether or not the increase in memory usage is actually caused by the LRU cache. Still, I think it would be better if the cache size is user-configurable and the usual cache-related metrics are exported (cache size, cache hits & misses, cache evictions, etc.).

@stale
Copy link

stale bot commented Apr 16, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Apr 16, 2022
@stale stale bot closed this as completed May 1, 2022
@chaudum chaudum added the type/bug Somehing is not working as expected label Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/distributor need-investigation stale A stale issue or PR that will automatically be closed. type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests

5 participants