Behavior similar to memory leak in Loki Distributor #5569

putrasattvika · 2022-03-08T03:40:01Z

Describe the bug
After upgrading to v2.3 and then to v2.4 (from v2.1), we noticed a constant growth of memory usage on Distributor pods. Their memory usage will always increase over time, to the point where it will almost reach our resource limit and will need to be restarted. This memory usage does not seem to be related to our ingestion rate. Below are some graphs of 24hr period: (left) memory usage of Distributor pods in MiB, (right) rate of bytes received by Distributor pods over 5m period.

To be fair, the memory we allocated to our Distributor pods is relatively low (~128MiB), but this was more than enough for v2.1 where the memory usage was relatively constant (see below).

We suspect (but currently have no proof) that this is caused by an LRU cache implemented in #3092 (this feature is not present in v2.1). Unfortunately, we can't really confirm this suspicion as there are no metrics that expose the current size of the cache.

Here's a heap profile of one of our Distributor pods after running for 16hr: loki-distributor.22-03-08T10-32-46.pb.gz (also uploaded to flamegraph.com).

To Reproduce
Steps to reproduce the behavior:

Started Loki (v2.4.2)
Ingest our logs like usual
Query memory usage as exported by cadvisor (container_memory_working_set_bytes

Expected behavior
Memory usage of Distributor in v2.3++ to be relatively constant like in v2.1. Below is memory usage graph of our v2.1 Distributors (in MiB), before we upgraded it.

Environment:

Infrastructure: Kubernetes
Deployment tool: our own Kubernetes YAML files, based on jsonnet

The text was updated successfully, but these errors were encountered:

dannykopping · 2022-03-08T08:34:24Z

Thanks for reporting this @putrasattvika

Please provide the config you are using as well. This will help us investigate this apparent bug.

putrasattvika · 2022-03-08T10:26:34Z

This is the config we're currently using:

chunk_store_config:
  chunk_cache_config:
    enable_fifocache: false
    memcached:
      batch_size: 100
    memcached_client:
      host: loki.memcached.<REDACTED>

  max_look_back_period: 8904h

  write_dedupe_cache_config:
    enable_fifocache: false
    memcached:
      batch_size: 100
    memcached_client:
      host: loki.memcached.<REDACTED>

ingester:
  chunk_idle_period: 30m
  chunk_retain_period: 10m

  lifecycler:
    interface_names:
    - eth0
    join_after: 1m
    ring:
      kvstore:
        etcd:
          endpoints: <REDACTED>
        prefix: /loki/collectors/
        store: etcd
      replication_factor: 3

  max_transfer_retries: 0

  wal:
    dir: /loki/wal
    enabled: true
    replay_memory_ceiling: 512MB

ingester_client:
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
  remote_timeout: 1s

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

schema_config:
  configs:
  - chunks:
      period: 168h
      prefix: loki_prod_chunk_
    from: '2019-11-25'
    index:
      period: 168h
      prefix: loki_prod_index_
    object_store: gcs
    schema: v10
    store: aws

  - from: '2020-07-22'
    index:
      period: 168h
      prefix: loki_prod_index_
    object_store: gcs
    schema: v11
    store: aws

server:
  graceful_shutdown_timeout: 5s
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  http_server_idle_timeout: 120s

storage_config:
  aws:
    dynamodb:
      dynamodb_url: dynamodb://<REDACTED>

  gcs:
    bucket_name: <REDACTED>

  index_queries_cache_config:
    enable_fifocache: false
    memcached:
      batch_size: 100
    memcached_client:
      host: loki.memcached.<REDACTED>

table_manager:
  chunk_tables_provisioning:
    inactive_read_throughput: 1
    inactive_write_throughput: 1
    provisioned_read_throughput: 1
    provisioned_write_throughput: 1

  index_tables_provisioning:
    enable_inactive_throughput_on_demand_mode: true
    enable_ondemand_throughput_mode: true

  retention_deletes_enabled: true
  retention_period: 8904h

liguozhong · 2022-03-08T15:27:15Z

maxLabelCacheSize = 100000

distributor will be stable at around 200Mb and will not continue to rise.

putrasattvika · 2022-03-15T03:18:21Z

distributor will be stable at around 200Mb and will not continue to rise

We let Distributor run for a couple of days without restarting it, and in our case the peak memory usage was ~290MiB. We have multiple Kubernetes CronJobs with relatively high frequency (e.g. every 5 minutes) and we use the pod's name as a log label, so this might explain why our Distributors use more memory than liguozhong's.

The pods OOM-ed before I can take a heap dump of them, so I can't really confirm whether or not the increase in memory usage is actually caused by the LRU cache. Still, I think it would be better if the cache size is user-configurable and the usual cache-related metrics are exported (cache size, cache hits & misses, cache evictions, etc.).

stale · 2022-04-16T04:14:51Z

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

kavirajk added component/distributor kind/bug need-investigation labels Mar 8, 2022

stale bot added the stale A stale issue or PR that will automatically be closed. label Apr 16, 2022

stale bot closed this as completed May 1, 2022

potiuk mentioned this issue Nov 29, 2022

Scheduler pods memory leak on airflow 2.3.2 apache/airflow#27589

Closed

2 tasks

chaudum added the type/bug Somehing is not working as expected label Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior similar to memory leak in Loki Distributor #5569

Behavior similar to memory leak in Loki Distributor #5569

putrasattvika commented Mar 8, 2022

dannykopping commented Mar 8, 2022

putrasattvika commented Mar 8, 2022

liguozhong commented Mar 8, 2022

putrasattvika commented Mar 15, 2022

stale bot commented Apr 16, 2022

Behavior similar to memory leak in Loki Distributor #5569

Behavior similar to memory leak in Loki Distributor #5569

Comments

putrasattvika commented Mar 8, 2022

dannykopping commented Mar 8, 2022

putrasattvika commented Mar 8, 2022

liguozhong commented Mar 8, 2022

putrasattvika commented Mar 15, 2022

stale bot commented Apr 16, 2022