Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc: Have a dedicated doc about caching in Loki #6201

Open
kavirajk opened this issue May 19, 2022 · 13 comments
Open

Doc: Have a dedicated doc about caching in Loki #6201

kavirajk opened this issue May 19, 2022 · 13 comments
Assignees
Labels
component/cache help wanted We would love help on these issues. Please come help us! keepalive An issue or PR that will be kept alive and never marked as stale. type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories

Comments

@kavirajk
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like

  1. Explaining different kinds of cache (chunk_cache, index_cache, result_cache, etc)
  2. When are we using in-memory fifo cache vs external cache (like memcache)
  3. what components mutates the different cache and what components consumes different cache.
  4. Key configs and Key metrics on operating and observing different caches.
@kavirajk kavirajk changed the title Doc: Have a dedicated doc for caching in Loki Doc: Have a dedicated doc about caching in Loki May 19, 2022
@alexandre1984rj
Copy link
Contributor

I have deployed Loki with promtail to EKS using helm and loki-distributed chart. I have configured caching with Redis also. The problem is that my configuration seems right but when I check the logs from Ingester it seems that it is still using fifocache.

When I do port-forwarding to ingester service and check through localhost:3100/config the index_queries_cache_config does not show the redis configuration with endpoint and password and I'm getting enable_fifocache: true and ingester logs with:

level=warn ts=2022-05-17T19:01:48.039860391Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=warn ts=2022-05-17T19:01:48.040721736Z caller=experimental.go:20 msg="experimental feature in use" feature="Redis cache"

Even when I set to disable it with:

storage_config
  index_queries_cache_config
    enable_fifocache: false

and extraArgs -store.index-cache-read.cache.enable-fifocache=false

Follow the configuration:

host_redis: ~
pass_redis: ~

loki:
  structuredConfig:
    auth_enabled: false

    query_range:
      cache_results: true
      align_queries_with_step: true
      results_cache:
        cache:
          enable_fifocache: false
          redis:
            endpoint: {{ .Values.host_redis }}
            expiration: 30m
            timeout: 5s
            password: {{ .Values.pass_redis }}
            tls_enabled: true
            
    storage_config:
      aws:
        s3: "s3://us-east-1/"
        bucketnames: {{ .Values.bucketName | quote }}
      boltdb_shipper:
        shared_store: s3
        active_index_directory: /var/loki/index
        cache_location: /var/loki/cache
        cache_ttl: 24h
      index_queries_cache_config:
        enable_fifocache: false
        redis:
          endpoint: {{ .Values.host_redis }}
          expiration: 30m
          timeout: 5s
          password: {{ .Values.pass_redis }}
          tls_enabled: true
    
    chunk_store_config:
      max_look_back_period: 0s
      chunk_cache_config:
        enable_fifocache: false
        redis:
          endpoint: {{ .Values.host_redis }}
          expiration: 30m
          timeout: 5s
          password: {{ .Values.pass_redis }}
          tls_enabled: true

    server:
      http_server_read_timeout: 300s
      http_server_write_timeout: 300s
      grpc_listen_port: 9095

    distributor:
      ring:
        kvstore:
          store: memberlist
    
    frontend:
      compress_responses: true
      log_queries_longer_than: 15s
      max_outstanding_per_tenant: 2048
      tail_proxy_url: http://{{ .Release.Name }}-loki-distributed-querier:3100

    frontend_worker:
      frontend_address: {{ .Release.Name }}-loki-distributed-query-frontend:9095

    querier:
      query_timeout: 5m
      query_ingesters_within: 1h
      engine:
        timeout: 5m
        
    memberlist:
      join_members:
        - {{ .Release.Name }}-loki-distributed-memberlist
          
    ingester:
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 3
      chunk_idle_period: 30m
      chunk_block_size: 262144
      chunk_encoding: snappy
      chunk_retain_period: 1m
      # https://grafana.com/docs/loki/latest/best-practices/#use-chunk_target_size
      chunk_target_size: 5242880
      max_transfer_retries: 0
      wal:
        dir: /var/loki/wal

    compactor:
      shared_store: s3
      # Without this the Compactor will only compact tables
      retention_enabled: true
      # Directory where marked chunks and temporary tables will be saved
      working_directory: /var/loki/compactor/retention
      # Dictates how often compaction and/or retention is applied. If the Compactor falls behind, compaction and/or retention occur as soon as possible.
      compaction_interval: 10m
      # Delay after which the compactor will delete marked chunks
      retention_delete_delay: 2h
      # Specifies the maximum quantity of goroutine workers instantiated to delete chunks
      retention_delete_worker_count: 150

    # Retention period is configured within the limits_config configuration section
    limits_config:
      ingestion_rate_strategy: "local"
      enforce_metric_name: false
      split_queries_by_interval: 1h
      retention_period: 168h
      reject_old_samples_max_age: 168h
      max_cache_freshness_per_query: 10m
      reject_old_samples: true
      ingestion_rate_mb: 20
      per_stream_rate_limit: 20MB
      ingestion_burst_size_mb: 20
      
    schema_config:
      configs:
      - from: 2021-03-30
        store: boltdb-shipper
        object_store: aws
        schema: v11
        index:
          prefix: loki_
          period: 24h
      - from: 2022-05-12
        store: boltdb-shipper
        object_store: aws
        schema: v12
        index:
          prefix: loki_
          period: 24h

@kavirajk
Copy link
Collaborator Author

I did some investigation, looks like the log message msg="experimental feature in use" feature="Redis cache" comes from chunk cache and not index cache (index_queries_cache_config) like you suspected.

Reason is, even though enable_fifocache default value is false in all the caches (result_cache, index_cache and chunk_cache), there are some additional logic while setting up chunk_cache here. There we set fifo cache to true if neither of the memcache and redis is set by default.

The tricky thing here is, we just only check if redis.Endpoint != "" to make sure there are no redis config set. I think that's what happening in your case. The value for that endpoint comes from values file ({{.Values.host_redis}} which I think is empty.

chunk_store_config:
      max_look_back_period: 0s
      chunk_cache_config:
        enable_fifocache: false
        redis:
          endpoint: {{ .Values.host_redis }}
          expiration: 30m
          timeout: 5s
          password: {{ .Values.pass_redis }}
          tls_enabled: true

I'm aware you use same for other cache also (result_cache and index_cache). But in those places, there is no special logic to set the cache, they have fifo_cache disabled by default. So you don't see that experimental warning in those places.

One thing we can do is to make the experimental warning more clear by adding what kind of cache (index, results or chunk) it is exactly. I will fix that in the separate PR.

@alexandre1984rj
Copy link
Contributor

alexandre1984rj commented May 23, 2022

@kavirajk really appreciated for the feedback. Concerning the {{.Values.host_redis}} I guess it is not empty because checking the configmap for the Loki configuration and for the querier I can see the redis data correctly.

Doing a port-forward to the ingester service I can get the chunk_cache_config but not from index_queries_cache_config

$ k port-forward service/observability-loki-loki-distributed-ingester 3100:3100                                       
Forwarding from 127.0.0.1:3100 -> 3100
Forwarding from [::1]:3100 -> 3100

2022-05-23_16-44

2022-05-23_16-48

@sureshgoli25
Copy link

I am also having same problem. I am trying to use memcache but still configured with fifocache.

Below is my configuration snippet.

    storage_config:
      engine: chunks
      max_parallel_get_chunk: 300
      index_cache_validity: 5m0s
      index_queries_cache_config:
        enable_fifocache: false
        background:
          writeback_goroutines: 30
          writeback_buffer: 10000
        memcached:
          batch_size: 100
          parallelism: 100
        memcached_client:
          consistent_hash: true
          host: xxx-memcached-index-queries.XXX.svc.cluster.local
          service: http

I am using loki-distributed (0.48.4) helm chart

@trevorwhitney
Copy link
Collaborator

@sureshgoli25 would you be able to paste the output for the configmap generated by helm template with your most recent values.yaml. That config for using memcahced looks good to me, so I'm curious why loki isn't reflecting that config on /config.

@sureshgoli25
Copy link

@trevorwhitney below is the snippet of the config map generated from my latest values.

    storage_config:
      boltdb_shipper:
        active_index_directory: /var/loki/index
        cache_location: /var/loki/cache
        cache_ttl: 168h
        index_gateway_client:
          server_address: dns:///xxx-index-gateway:9095
        resync_interval: 3m0s
        shared_store: s3
        shared_store_key_prefix: index/
      engine: chunks
      filesystem:
        directory: null
      index_cache_validity: 5m0s
      index_queries_cache_config:
        background:
          writeback_buffer: 10000
          writeback_goroutines: 30
        enable_fifocache: false
        memcached:
          batch_size: 100
          parallelism: 100
        memcached_client:
          consistent_hash: true
          host: xxx-memcached-index-queries.XXX.svc.cluster.local
          service: http
      max_parallel_get_chunk: 300

As you see, the helm template rendered the correct config file for loki. But somehow when calling config endpoint.

The index queries cache config shows fifocache enabled as true.

      index_queries_cache_config:
        enable_fifocache: true

@trevorwhitney
Copy link
Collaborator

I better understand now, thank you for providing that, though from reading above I thought the original post was regarding the loki-distributed helm chart. @sureshgoli25 are you running in SSD mode?

Currently in SSD mode, the index query cache is hard-coded to use the fifo cache. You can provide an external cache for results and chunks, but not index queries. This is probably something we should better document. Is this a problem for your use case, or are you just calling out the need for documentation around this?

@sureshgoli25
Copy link

@trevorwhitney thank you for feedback.
I am using loki-distributed helm chart.
May, be my configuration is wrong? If possible kindly adivse based on below complete configuration i am passing through helm chart.

Cloud Provider: AWS
Kubernetes Cluster: RKE2 v1.21.7

auth_enabled: true
    
common:
  replication_factor: 6
  instance_interface_names:
  - eth0
  - en0
  - lo
  ring:
    kvstore:
      store: memberlist

  storage:
    s3:
      s3: ""
      s3forcepathstyle: true
      bucketnames: XXXX
      endpoint: https://XXXXX
      region: us-east-1
      access_key_id: XXXXX
      secret_access_key: XXXXXXXX
      insecure: false
      sse_encryption: false
      http_config:
        idle_conn_timeout: 5m0s
        response_header_timeout: 2m0s
        insecure_skip_verify: false
        ca_file: ""
      signature_version: v4
      backoff_config:
        min_period: 100ms
        max_period: 3s
        max_retries: 5

distributor:
  ring:
    instance_addr: 127.0.0.1

server:
  log_level: debug
  http_listen_port: 3100
  grpc_listen_port: 9095
  grpc_server_max_recv_msg_size: 1073741824
  grpc_server_max_send_msg_size: 1073741824
  grpc_server_max_concurrent_streams: 0
  http_server_read_timeout: 120s
  http_server_write_timeout: 120s
  http_server_idle_timeout: 2m0s

querier:
  query_timeout: 2m0s
  query_ingesters_within: 3h
  engine:
    timeout: 5m0s
    max_look_back_period: 60s

ingester_client:
  pool_config:
    client_cleanup_period: 60s
    health_check_ingesters: true
    remote_timeout: 15s
  remote_timeout: 30s
  grpc_client_config:
    max_send_msg_size: 1073741824
    max_recv_msg_size: 1073741824

ingester:
  lifecycler:
    ring:
      zone_awareness_enabled: false
      replication_factor: 5
    heartbeat_period: 5s
  chunk_idle_period: 15m
  max_chunk_age: 15m
  chunk_block_size: 262144
  chunk_target_size: 1572864
  chunk_encoding: snappy
  chunk_retain_period: 1m
  max_transfer_retries: 0
  wal:
    dir: /var/loki/wal
    flush_on_shutdown: true
    replay_memory_ceiling: 4GB

storage_config:
  engine: chunks
  max_parallel_get_chunk: 300
  index_cache_validity: 5m0s
  index_queries_cache_config:
    enable_fifocache: false
    background:
      writeback_goroutines: 30
      writeback_buffer: 10000
    memcached:
      batch_size: 100
      parallelism: 100
    memcached_client:
      consistent_hash: true
      host: xxxxxx-memcached-index-queries.ns.svc.cluster.local
      service: http
  boltdb_shipper:
    shared_store: s3
    shared_store_key_prefix: index/
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 168h
    resync_interval: 3m0s
    index_gateway_client:
      server_address: dns:///xxxxxx-index-gateway:9095
  filesystem:
    directory: null

chunk_store_config:
  max_look_back_period: 0s
  chunk_cache_config:
    enable_fifocache: false
    background:
      writeback_goroutines: 30
      writeback_buffer: 10000
    memcached:
      batch_size: 100
      parallelism: 100
    memcached_client:
      consistent_hash: true
      host: xxxxxx-memcached-chunks.ns.svc.cluster.local
      service: http
      timeout: 600ms
  write_dedupe_cache_config:
    enable_fifocache: false
    background:
      writeback_goroutines: 30
      writeback_buffer: 10000
    memcached:
      batch_size: 100
      parallelism: 100
    memcached_client:
      consistent_hash: true
      host: xxxxxx-memcached-index-writes.ns.svc.cluster.local
      service: http
      timeout: 600ms

schema_config:
  configs:
  - from: 2020-09-07
    store: boltdb-shipper
    object_store: s3
    schema: v11
    index:
      prefix: loki_index_
      period: 24h
    chunks:
      prefix: loki_chunks_
      period: 24h
    row_shards: 32

limits_config:
  ingestion_rate_strategy: "local"
  enforce_metric_name: false
  reject_old_samples: false
  reject_old_samples_max_age: 168h
  max_cache_freshness_per_query: 30m
  split_queries_by_interval: 1h
  retention_period: 168h
  per_stream_rate_limit: 2048MB
  per_stream_rate_limit_burst: 2048MB
  ingestion_rate_mb: 2048
  ingestion_burst_size_mb: 2048
  max_entries_limit_per_query: 100000
  max_global_streams_per_user: 100000
  max_streams_matchers_per_query: 100000
  max_concurrent_tail_requests: 100
  max_query_parallelism: 64

table_manager:
  retention_deletes_enabled: true
  retention_period: 31d

frontend_worker:
  frontend_address: xxxxxx-query-frontend:9095
  grpc_client_config:
    max_send_msg_size: 1073741824
    max_recv_msg_size: 1073741824
  parallelism: 18

frontend:
  max_body_size: 1073741824
  log_queries_longer_than: 15s
  compress_responses: true
  tail_proxy_url: http://xxxxxx-querier:3100
  grpc_client_config:
    max_send_msg_size: 1073741824
    max_recv_msg_size: 1073741824

query_range:
  align_queries_with_step: true
  max_retries: 5
  cache_results: true
  results_cache:
    cache:
      enable_fifocache: false
      default_validity: 1h0m0s
      background:
        writeback_goroutines: 100
        writeback_buffer: 100000
      memcached:
        batch_size: 100
        parallelism: 100
      memcached_client:
        consistent_hash: true
        host: xxxxxx-memcached-frontend.ns.svc.cluster.local
        max_idle_conns: 16
        service: http
        timeout: 1500ms
        update_interval: 1m

memberlist:
  join_members:
    - {{ include "loki.fullname" . }}-memberlist
  dead_node_reclaim_time: 30s
  gossip_to_dead_nodes_time: 15s
  left_ingesters_timeout: 30s
  bind_port: 7946

compactor:
  shared_store: s3

query_scheduler:
  max_outstanding_requests_per_tenant: 1000
  grpc_client_config:
    max_recv_msg_size: 1073741824
    max_send_msg_size: 1073741824

analytics:
  reporting_enabled: false

ruler:
  enable_api: true
  alertmanager_url: XXXXX
  enable_alertmanager_discovery: false
  alertmanager_client:
    tls_insecure_skip_verify: true
  storage:
    type: s3

@trevorwhitney
Copy link
Collaborator

@sureshgoli25 that config looks good. Can you check that all components are overriding your disabling of the fifo cache. We do always override this in the ingester but should not in your queriers.

@sureshgoli25
Copy link

@trevorwhitney thanks for the pointers. I can see in queriers memcached updated for chunk cache. I was looking in ingesters and i thought, configuration is same across all components. So, i was always looking at ingester level.

chunk_store_config:
  chunk_cache_config:
    enable_fifocache: false
    default_validity: 1h0m0s
    background:
      writeback_goroutines: 30
      writeback_buffer: 10000
    memcached:
      expiration: 0s
      batch_size: 100
      parallelism: 100
    memcached_client:
      host: XXXX-memcached-chunks.YYY.svc.cluster.local
      service: http
      addresses: ""
      timeout: 600ms
      max_idle_conns: 16
      max_item_size: 0
      update_interval: 1m0s
      consistent_hash: true
      circuit_breaker_consecutive_failures: 10
      circuit_breaker_timeout: 10s
      circuit_breaker_interval: 10s

@stale
Copy link

stale bot commented Jul 10, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Jul 10, 2022
@kavirajk kavirajk added the keepalive An issue or PR that will be kept alive and never marked as stale. label Aug 15, 2022
@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Aug 15, 2022
@osg-grafana osg-grafana added type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories and removed area/docs labels Oct 19, 2022
@GrafanaWriter
Copy link
Contributor

@JStickler - can you please investigate and assess priority with @kristiandeppe and @minhdanh ?

@JStickler JStickler added the help wanted We would love help on these issues. Please come help us! label Sep 27, 2023
@alex-berger
Copy link

For me it is still no clear which components are accessing which caches (chunks, frontend,index-queries) in what mode (read or write). Would love if someone could shed some light on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/cache help wanted We would love help on these issues. Please come help us! keepalive An issue or PR that will be kept alive and never marked as stale. type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

No branches or pull requests

8 participants