Backend deletes tsdb files before compacting them #12105

hervenicol · 2024-03-01T14:39:59Z

Bug description

I noticed my loki-backend pods keep some deleted files open, as lsof can show me:

1       /usr/bin/loki   89      /var/loki/compactor/loki_index_19732/1704893247-loki-write-3-1704893247327457542.tsdb (deleted)

It outputs some error logs that may be related:

level=error ts=2024-02-20T08:53:49.175429725Z caller=compactor.go:601 msg="failed to compact files" table=loki_index_19732 err="invalid tsdb path: /var/loki/compactor/loki_index_19732/loki-write-0-1701118454604792126-1704893400"
level=error ts=2024-02-20T08:53:49.175477321Z caller=compactor.go:523 msg="failed to run compaction" err="invalid tsdb path: /var/loki/compactor/loki_index_19732/loki-write-0-1701118454604792126-1704893400"

I noticed this because disks eventually fill up, as deleted files are never closed.
I couldn't find reports of similar issue, but maybe I didn't use the right github-search wizardry.

Expected behavior

Deleted files should be released and disk space freed.

Environment:

Infrastructure: kubernetes
Deployment tool: helm (this one: https://github.com/grafana/loki/tree/main/production/helm/loki) release 5.43.2
Loki version: 2.9.4
Schema: TSDB
Object storage: AWS S3

Extra info

I have multiple clusters running similar loki+promtail setup, and around 1 out of 10 does this. So I'm not sure how to reproduce.
However, on the clusters where this happens, disk usage growth is very linear. Example:

Usage goes back to 0 when the pod is restarted, but starts growing again right after.

I'm not sure how to investigate further, but I'll do my best to provide whatever extra info could help.

The text was updated successfully, but these errors were encountered:

jon-rei · 2024-03-20T07:13:22Z

We see a similar problem in 1 of our 5 clusters. The disk space of 1 of our Loki backend pods (out of 3) just keeps increasing until we reboot it.
For us the problem appeared after we upgraded from v2.9.2 to v2.9.4 (helm chart 5.39.0 to 5.43.1).

@hervenicol did you find a solution for this problem?

hervenicol · 2024-03-20T11:59:10Z

Happy to see I'm not the only one 😅
But no @jon-rei, I don't have a solution yet. I still have to restart some pods once in a while 😞

jon-rei · 2024-04-01T10:02:56Z

Hi @hervenicol, I was looking into this again as it was getting really annoying restarting the pods all the time.
While investigating, I found out that we're actually running into another problem.

caller=compactor.go:601 msg="failed to compact files" table=tsdb_index_123456 err="failed to get s3 object: NoSuchKey: \n\tstatus code: 404, request id: xxxx-s3, host id: "

This error is happening due to a problem with our s3. But the behavior of the backend pods are kind of the same. It seems that when there is a problem with the compactor, the space grows until there is no space left. I suspect that the compaction cycle can't be completed and the loki is downloading some files from the S3 bucket over and over again.

I discovered the corrupt file in S3 by trying to download the entire tsdb_index_ folder to my local machine. Can you verify that the file you are getting the error for actually exists?

dragoangel · 2024-05-01T22:30:00Z

Hi @hervenicol, I was looking into this again as it was getting really annoying restarting the pods all the time. While investigating, I found out that we're actually running into another problem.
caller=compactor.go:601 msg="failed to compact files" table=tsdb_index_123456 err="failed to get s3 object: NoSuchKey: \n\tstatus code: 404, request id: xxxx-s3, host id: "
This error is happening due to a problem with our s3. But the behavior of the backend pods are kind of the same. It seems that when there is a problem with the compactor, the space grows until there is no space left. I suspect that the compaction cycle can't be completed and the loki is downloading some files from the S3 bucket over and over again.

I discovered the corrupt file in S3 by trying to download the entire tsdb_index_ folder to my local machine. Can you verify that the file you are getting the error for actually exists?

After updating to 3.0.0 and tsdb I also get same 404, which was not the case on 2.9.3 and boltdb-shipper. Can you please share more details in scope of - which problems with s3 you mean?

ikogan · 2024-05-02T01:54:50Z

I'm also getting this now as of upgrading to 3.0.0. I'm not seeing the 404. I'm using the latest helm chart in singlebinary mode, here's some relevant values:

    storage:
      type: filesystem
    compactor:
      delete_request_store: filesystem
      retention_enabled: true
    schemaConfig:
      configs:
      - from: "2020-10-24"
        index:
          period: 24h
          prefix: index_
        object_store: filesystem
        schema: v11
        store: boltdb-shipper
      - from: "2022-11-16"
        index:
          period: 24h
          prefix: loki_index_
        object_store: filesystem
        schema: v12
        store: boltdb-shipper
      - from: "2024-04-20"
        index:
          period: 24h
          prefix: loki_index_
        object_store: filesystem
        schema: v13
        store: tsdb
    structuredConfig:
      ingester:
        chunk_idle_period: 3m
        chunk_retain_period: 1m
        lifecycler:
          ring:
            replication_factor: 1
      querier:
        engine:
          max_look_back_period: 0s
      limits_config:
        max_concurrent_tail_requests: 40
        retention_period: 336h

Although, I'm not seeing the files in the error stuck open with lsof.

jon-rei · 2024-05-02T08:05:37Z

Can you please share more details in scope of - which problems with s3 you mean?

I'm running Ceph Object Gateway which seems to have lost some files. Meaning you can list them with S3 CLI, but when you try downloading they are gone. Even deleting them did not work. I fixed this by uploading a dummy file with the same name and deleting it. After that, the loki compactor wasn't stuck on the tsdb index anymore and was able to finish its work.

dragoangel · 2024-05-02T08:07:58Z

Can you please share more details in scope of - which problems with s3 you mean?

I'm running Ceph Object Gateway which seems to have lost some files. Meaning you can list them with S3 CLI, but when you try downloading they are gone. Even deleting them did not work. I fixed this by uploading a dummy file with the same name and deleting it. After that, the loki compactor wasn't stuck on the tsdb index anymore and was able to finish its work.

I have same situation! So you trying to say it will hangup all way long (loosing files agains and agains) or just once upload a file and issue solved? You faced this issue only on 3.0.0 or with tsdb in general? I not saw such issue on boltdb-shipper and 2.9.x versions, but looks like you right that this ceph issue as of https://tracker.ceph.com/issues/47866 and my ceph version is 16.2.13

dragoangel · 2024-05-03T08:46:01Z

@jon-rei I downgraded back to 2.9.x, but leave tsdb as index and I don't see any issues so far regarding NoSuchKey 404 error. Based on this I assume issue in the new Loki 3.0.0 or maybe somewhere from 2.9.4+. I generally saw that 2.9.4+ versions gives many issues.

jon-rei · 2024-05-03T13:14:16Z

Glad to hear it, I didn't actually try to rollback the version and thought it was just a Ceph issue since the same problem appears with other tools I use.

dragoangel · 2024-05-03T17:58:58Z

@jon-rei can you please share for understanding what Ceph version you are using? If this 16.2.x - maybe you can share some names of other tools where you had such issues. I very sorry to everybody here for doing offpotic, just very interested in potential problems to be aware off them.

JStickler added the type/bug Somehing is not working as expected label Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend deletes tsdb files before compacting them #12105

Backend deletes tsdb files before compacting them #12105

hervenicol commented Mar 1, 2024

jon-rei commented Mar 20, 2024

hervenicol commented Mar 20, 2024

jon-rei commented Apr 1, 2024

dragoangel commented May 1, 2024

ikogan commented May 2, 2024

jon-rei commented May 2, 2024

dragoangel commented May 2, 2024 •

edited

dragoangel commented May 3, 2024

jon-rei commented May 3, 2024

dragoangel commented May 3, 2024 •

edited

Backend deletes tsdb files before compacting them #12105

Backend deletes tsdb files before compacting them #12105

Comments

hervenicol commented Mar 1, 2024

Bug description

Expected behavior

Environment:

Extra info

jon-rei commented Mar 20, 2024

hervenicol commented Mar 20, 2024

jon-rei commented Apr 1, 2024

dragoangel commented May 1, 2024

ikogan commented May 2, 2024

jon-rei commented May 2, 2024

dragoangel commented May 2, 2024 • edited

dragoangel commented May 3, 2024

jon-rei commented May 3, 2024

dragoangel commented May 3, 2024 • edited

dragoangel commented May 2, 2024 •

edited

dragoangel commented May 3, 2024 •

edited