Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend deletes tsdb files before compacting them #12105

Open
hervenicol opened this issue Mar 1, 2024 · 10 comments
Open

Backend deletes tsdb files before compacting them #12105

hervenicol opened this issue Mar 1, 2024 · 10 comments
Labels
type/bug Somehing is not working as expected

Comments

@hervenicol
Copy link
Contributor

Bug description

I noticed my loki-backend pods keep some deleted files open, as lsof can show me:

1       /usr/bin/loki   89      /var/loki/compactor/loki_index_19732/1704893247-loki-write-3-1704893247327457542.tsdb (deleted)

It outputs some error logs that may be related:

level=error ts=2024-02-20T08:53:49.175429725Z caller=compactor.go:601 msg="failed to compact files" table=loki_index_19732 err="invalid tsdb path: /var/loki/compactor/loki_index_19732/loki-write-0-1701118454604792126-1704893400"
level=error ts=2024-02-20T08:53:49.175477321Z caller=compactor.go:523 msg="failed to run compaction" err="invalid tsdb path: /var/loki/compactor/loki_index_19732/loki-write-0-1701118454604792126-1704893400"

I noticed this because disks eventually fill up, as deleted files are never closed.
I couldn't find reports of similar issue, but maybe I didn't use the right github-search wizardry.

Expected behavior

Deleted files should be released and disk space freed.

Environment:

Extra info

I have multiple clusters running similar loki+promtail setup, and around 1 out of 10 does this. So I'm not sure how to reproduce.
However, on the clusters where this happens, disk usage growth is very linear. Example:
image
Usage goes back to 0 when the pod is restarted, but starts growing again right after.

I'm not sure how to investigate further, but I'll do my best to provide whatever extra info could help.

@JStickler JStickler added the type/bug Somehing is not working as expected label Mar 11, 2024
@jon-rei
Copy link

jon-rei commented Mar 20, 2024

We see a similar problem in 1 of our 5 clusters. The disk space of 1 of our Loki backend pods (out of 3) just keeps increasing until we reboot it.
For us the problem appeared after we upgraded from v2.9.2 to v2.9.4 (helm chart 5.39.0 to 5.43.1).

@hervenicol did you find a solution for this problem?

@hervenicol
Copy link
Contributor Author

Happy to see I'm not the only one 😅
But no @jon-rei, I don't have a solution yet. I still have to restart some pods once in a while 😞

@jon-rei
Copy link

jon-rei commented Apr 1, 2024

Hi @hervenicol, I was looking into this again as it was getting really annoying restarting the pods all the time.
While investigating, I found out that we're actually running into another problem.

caller=compactor.go:601 msg="failed to compact files" table=tsdb_index_123456 err="failed to get s3 object: NoSuchKey: \n\tstatus code: 404, request id: xxxx-s3, host id: "

This error is happening due to a problem with our s3. But the behavior of the backend pods are kind of the same. It seems that when there is a problem with the compactor, the space grows until there is no space left. I suspect that the compaction cycle can't be completed and the loki is downloading some files from the S3 bucket over and over again.

I discovered the corrupt file in S3 by trying to download the entire tsdb_index_ folder to my local machine. Can you verify that the file you are getting the error for actually exists?

@dragoangel
Copy link

Hi @hervenicol, I was looking into this again as it was getting really annoying restarting the pods all the time. While investigating, I found out that we're actually running into another problem.

caller=compactor.go:601 msg="failed to compact files" table=tsdb_index_123456 err="failed to get s3 object: NoSuchKey: \n\tstatus code: 404, request id: xxxx-s3, host id: "

This error is happening due to a problem with our s3. But the behavior of the backend pods are kind of the same. It seems that when there is a problem with the compactor, the space grows until there is no space left. I suspect that the compaction cycle can't be completed and the loki is downloading some files from the S3 bucket over and over again.

I discovered the corrupt file in S3 by trying to download the entire tsdb_index_ folder to my local machine. Can you verify that the file you are getting the error for actually exists?

After updating to 3.0.0 and tsdb I also get same 404, which was not the case on 2.9.3 and boltdb-shipper. Can you please share more details in scope of - which problems with s3 you mean?

@ikogan
Copy link

ikogan commented May 2, 2024

I'm also getting this now as of upgrading to 3.0.0. I'm not seeing the 404. I'm using the latest helm chart in singlebinary mode, here's some relevant values:

    storage:
      type: filesystem
    compactor:
      delete_request_store: filesystem
      retention_enabled: true
    schemaConfig:
      configs:
      - from: "2020-10-24"
        index:
          period: 24h
          prefix: index_
        object_store: filesystem
        schema: v11
        store: boltdb-shipper
      - from: "2022-11-16"
        index:
          period: 24h
          prefix: loki_index_
        object_store: filesystem
        schema: v12
        store: boltdb-shipper
      - from: "2024-04-20"
        index:
          period: 24h
          prefix: loki_index_
        object_store: filesystem
        schema: v13
        store: tsdb
    structuredConfig:
      ingester:
        chunk_idle_period: 3m
        chunk_retain_period: 1m
        lifecycler:
          ring:
            replication_factor: 1
      querier:
        engine:
          max_look_back_period: 0s
      limits_config:
        max_concurrent_tail_requests: 40
        retention_period: 336h

Although, I'm not seeing the files in the error stuck open with lsof.

@jon-rei
Copy link

jon-rei commented May 2, 2024

Can you please share more details in scope of - which problems with s3 you mean?

I'm running Ceph Object Gateway which seems to have lost some files. Meaning you can list them with S3 CLI, but when you try downloading they are gone. Even deleting them did not work. I fixed this by uploading a dummy file with the same name and deleting it. After that, the loki compactor wasn't stuck on the tsdb index anymore and was able to finish its work.

@dragoangel
Copy link

dragoangel commented May 2, 2024

Can you please share more details in scope of - which problems with s3 you mean?

I'm running Ceph Object Gateway which seems to have lost some files. Meaning you can list them with S3 CLI, but when you try downloading they are gone. Even deleting them did not work. I fixed this by uploading a dummy file with the same name and deleting it. After that, the loki compactor wasn't stuck on the tsdb index anymore and was able to finish its work.

I have same situation! So you trying to say it will hangup all way long (loosing files agains and agains) or just once upload a file and issue solved? You faced this issue only on 3.0.0 or with tsdb in general? I not saw such issue on boltdb-shipper and 2.9.x versions, but looks like you right that this ceph issue as of https://tracker.ceph.com/issues/47866 and my ceph version is 16.2.13

@dragoangel
Copy link

@jon-rei I downgraded back to 2.9.x, but leave tsdb as index and I don't see any issues so far regarding NoSuchKey 404 error. Based on this I assume issue in the new Loki 3.0.0 or maybe somewhere from 2.9.4+. I generally saw that 2.9.4+ versions gives many issues.

@jon-rei
Copy link

jon-rei commented May 3, 2024

Glad to hear it, I didn't actually try to rollback the version and thought it was just a Ceph issue since the same problem appears with other tools I use.

@dragoangel
Copy link

dragoangel commented May 3, 2024

@jon-rei can you please share for understanding what Ceph version you are using? If this 16.2.x - maybe you can share some names of other tools where you had such issues. I very sorry to everybody here for doing offpotic, just very interested in potential problems to be aware off them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests

5 participants