-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backend deletes tsdb files before compacting them #12105
Comments
We see a similar problem in 1 of our 5 clusters. The disk space of 1 of our Loki backend pods (out of 3) just keeps increasing until we reboot it. @hervenicol did you find a solution for this problem? |
Happy to see I'm not the only one 😅 |
Hi @hervenicol, I was looking into this again as it was getting really annoying restarting the pods all the time.
This error is happening due to a problem with our s3. But the behavior of the backend pods are kind of the same. It seems that when there is a problem with the compactor, the space grows until there is no space left. I suspect that the compaction cycle can't be completed and the loki is downloading some files from the S3 bucket over and over again. I discovered the corrupt file in S3 by trying to download the entire tsdb_index_ folder to my local machine. Can you verify that the file you are getting the error for actually exists? |
After updating to 3.0.0 and tsdb I also get same 404, which was not the case on 2.9.3 and boltdb-shipper. Can you please share more details in scope of - which problems with s3 you mean? |
I'm also getting this now as of upgrading to 3.0.0. I'm not seeing the 404. I'm using the latest helm chart in singlebinary mode, here's some relevant values: storage:
type: filesystem
compactor:
delete_request_store: filesystem
retention_enabled: true
schemaConfig:
configs:
- from: "2020-10-24"
index:
period: 24h
prefix: index_
object_store: filesystem
schema: v11
store: boltdb-shipper
- from: "2022-11-16"
index:
period: 24h
prefix: loki_index_
object_store: filesystem
schema: v12
store: boltdb-shipper
- from: "2024-04-20"
index:
period: 24h
prefix: loki_index_
object_store: filesystem
schema: v13
store: tsdb
structuredConfig:
ingester:
chunk_idle_period: 3m
chunk_retain_period: 1m
lifecycler:
ring:
replication_factor: 1
querier:
engine:
max_look_back_period: 0s
limits_config:
max_concurrent_tail_requests: 40
retention_period: 336h Although, I'm not seeing the files in the error stuck open with |
I'm running Ceph Object Gateway which seems to have lost some files. Meaning you can list them with S3 CLI, but when you try downloading they are gone. Even deleting them did not work. I fixed this by uploading a dummy file with the same name and deleting it. After that, the loki compactor wasn't stuck on the tsdb index anymore and was able to finish its work. |
I have same situation! So you trying to say it will hangup all way long (loosing files agains and agains) or just once upload a file and issue solved? You faced this issue only on 3.0.0 or with tsdb in general? I not saw such issue on boltdb-shipper and 2.9.x versions, but looks like you right that this ceph issue as of https://tracker.ceph.com/issues/47866 and my ceph version is 16.2.13 |
@jon-rei I downgraded back to 2.9.x, but leave tsdb as index and I don't see any issues so far regarding NoSuchKey 404 error. Based on this I assume issue in the new Loki 3.0.0 or maybe somewhere from 2.9.4+. I generally saw that 2.9.4+ versions gives many issues. |
Glad to hear it, I didn't actually try to rollback the version and thought it was just a Ceph issue since the same problem appears with other tools I use. |
@jon-rei can you please share for understanding what Ceph version you are using? If this 16.2.x - maybe you can share some names of other tools where you had such issues. I very sorry to everybody here for doing offpotic, just very interested in potential problems to be aware off them. |
Bug description
I noticed my loki-backend pods keep some deleted files open, as lsof can show me:
It outputs some error logs that may be related:
I noticed this because disks eventually fill up, as deleted files are never closed.
I couldn't find reports of similar issue, but maybe I didn't use the right github-search wizardry.
Expected behavior
Deleted files should be released and disk space freed.
Environment:
Extra info
I have multiple clusters running similar loki+promtail setup, and around 1 out of 10 does this. So I'm not sure how to reproduce.
However, on the clusters where this happens, disk usage growth is very linear. Example:
Usage goes back to 0 when the pod is restarted, but starts growing again right after.
I'm not sure how to investigate further, but I'll do my best to provide whatever extra info could help.
The text was updated successfully, but these errors were encountered: