Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsdb: Checkpoint closes mmaped chunk file despite open ChunkQuerier query; causing SIGSEGV #8217

Open
bwplotka opened this issue Nov 23, 2020 · 10 comments

Comments

@bwplotka
Copy link
Member

bwplotka commented Nov 23, 2020

Prometheus version used: v1.8.2-0.20201029103703-63be30dceed9

Details: thanos-io/thanos#3497

Funny enough we hit this issue on ALL Thanos receivers every 16h ;p Exactly every 16h.

@bwplotka
Copy link
Member Author

bwplotka commented Nov 23, 2020

cc @codesome @pracucci

@roidelapluie
Copy link
Member

Details: #8217

Inception

@bwplotka
Copy link
Member Author

Sorry, a late hour. Edited (: -> thanos-io/thanos#3497

@codesome
Copy link
Member

Not really sure how checkpointing would close a m-mapped chunk file, we would have faced this panic if that was the case. And I am not seeing the panic pointing to the TSDB codebase (was it truncated?).

@codesome
Copy link
Member

Are you by any chance running the checkpointing in parallel?

@bwplotka
Copy link
Member Author

TODO: Double check if simple iterator is affected by this truncation & chunkDiskMapper bug.

@bwplotka
Copy link
Member Author

To potentially add: pending reader tracking as we have for blocks.

@roidelapluie
Copy link
Member

To potentially add: pending reader tracking as we have for blocks.

Covered by #5877 I think

@codesome
Copy link
Member

To add more info: once we close the m-map file, the byte slice that is m-mapped is no longer valid. Hence the panic when the query has already got the chunk and in the meanwhile when it is reading the m-map file was closed and truncated.

@fpetkovski
Copy link
Contributor

fpetkovski commented Jan 26, 2023

We are still experiencing this issue in Thanos, exactly as reported here: thanos-io/thanos#3497. Prometheus version is 2.40.

I see that #8723 is merged, but it either does not fix the root cause, or there is another place where already released memory is accessed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants