tsdb: Checkpoint closes mmaped chunk file despite open ChunkQuerier query; causing SIGSEGV #8217

bwplotka · 2020-11-23T19:31:26Z

Prometheus version used: v1.8.2-0.20201029103703-63be30dceed9

Details: thanos-io/thanos#3497

Funny enough we hit this issue on ALL Thanos receivers every 16h ;p Exactly every 16h.

The text was updated successfully, but these errors were encountered:

bwplotka · 2020-11-23T19:35:56Z

cc @codesome @pracucci

roidelapluie · 2020-11-23T19:38:38Z

Details: #8217

Inception

bwplotka · 2020-11-23T19:40:28Z

Sorry, a late hour. Edited (: -> thanos-io/thanos#3497

codesome · 2020-11-24T15:08:08Z

Not really sure how checkpointing would close a m-mapped chunk file, we would have faced this panic if that was the case. And I am not seeing the panic pointing to the TSDB codebase (was it truncated?).

codesome · 2020-11-24T15:09:12Z

Are you by any chance running the checkpointing in parallel?

bwplotka · 2020-11-25T11:37:32Z

TODO: Double check if simple iterator is affected by this truncation & chunkDiskMapper bug.

bwplotka · 2020-11-25T11:40:01Z

To potentially add: pending reader tracking as we have for blocks.

roidelapluie · 2020-11-25T12:30:09Z

To potentially add: pending reader tracking as we have for blocks.

Covered by #5877 I think

codesome · 2020-11-25T13:11:20Z

To add more info: once we close the m-map file, the byte slice that is m-mapped is no longer valid. Hence the panic when the query has already got the chunk and in the meanwhile when it is reading the m-map file was closed and truncated.

fpetkovski · 2023-01-26T19:01:32Z

We are still experiencing this issue in Thanos, exactly as reported here: thanos-io/thanos#3497. Prometheus version is 2.40.

I see that #8723 is merged, but it either does not fix the root cause, or there is another place where already released memory is accessed.

bwplotka added component/tsdb kind/bug labels Nov 23, 2020

bwplotka mentioned this issue Nov 23, 2020

receive: Query failure on Seg fault thanos-io/thanos#3497

Closed

bwplotka added the help wanted label Nov 23, 2020

This was referenced Mar 16, 2021

Prometheus receives SIGTERM after temporal storage disruption #8318

Closed

Ingester segfault when querying chunks while head chunks a truncated cortexproject/cortex#3907

Closed

pracucci mentioned this issue Apr 1, 2021

Added test to reproduce panic on TSDB head chunks truncated while querying #8681

Closed

This was referenced Apr 14, 2021

Experiment to fix head chunks panic #8722

Closed

Stop the bleed on chunk mapper panic #8723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tsdb: Checkpoint closes mmaped chunk file despite open ChunkQuerier query; causing SIGSEGV #8217

tsdb: Checkpoint closes mmaped chunk file despite open ChunkQuerier query; causing SIGSEGV #8217

bwplotka commented Nov 23, 2020 •

edited

bwplotka commented Nov 23, 2020 •

edited

roidelapluie commented Nov 23, 2020

bwplotka commented Nov 23, 2020

codesome commented Nov 24, 2020

codesome commented Nov 24, 2020

bwplotka commented Nov 25, 2020

bwplotka commented Nov 25, 2020

roidelapluie commented Nov 25, 2020

codesome commented Nov 25, 2020

fpetkovski commented Jan 26, 2023 •

edited

tsdb: Checkpoint closes mmaped chunk file despite open ChunkQuerier query; causing SIGSEGV #8217

tsdb: Checkpoint closes mmaped chunk file despite open ChunkQuerier query; causing SIGSEGV #8217

Comments

bwplotka commented Nov 23, 2020 • edited

bwplotka commented Nov 23, 2020 • edited

roidelapluie commented Nov 23, 2020

bwplotka commented Nov 23, 2020

codesome commented Nov 24, 2020

codesome commented Nov 24, 2020

bwplotka commented Nov 25, 2020

bwplotka commented Nov 25, 2020

roidelapluie commented Nov 25, 2020

codesome commented Nov 25, 2020

fpetkovski commented Jan 26, 2023 • edited

bwplotka commented Nov 23, 2020 •

edited

bwplotka commented Nov 23, 2020 •

edited

fpetkovski commented Jan 26, 2023 •

edited