Blocks can contain chunks that go past MaxTime #347

BenoitKnecht · 2018-06-12T07:33:22Z

I've been investigating thanos-io/thanos#354, and I think I've finally identified what's going on.

The issue came up because Thanos checks that all the chunks in a block are between block.MinTime and block.MaxTime, but in some circumstances, it detected chunks with block.MaxTime < chunk.MaxTime. Those chunks also happened to satisfy block.MaxTime == chunk.MinTime, which we'll see is no coincidence. After suspecting that the issue came from Thanos, I'm now pretty sure it's in tsdb.

New chunks are create when appending data to head using append(), specifically this part:

if t >= s.nextAt {
	c = s.cut(t)
	chunkCreated = true
}

where t is the timestamp of the data being added, and s.nextAt can be equal to block.MaxTime. So if Prometheus happens to scrape a metric whose timestamp is equal to block.MaxTime, it will create a chunk with chunk.MinTime == block.MaxTime.

Now, when head needs to be persisted to disk in a new block, this function determines which chunks are part of the block, specifically:

// Do not expose chunks that are outside of the specified range.
if c == nil || !intervalOverlap(mint, maxt, h.mint, h.maxt) {
	return nil, ErrNotFound
}

with

func intervalOverlap(amin, amax, bmin, bmax int64) bool {
	// Checks Overlap: http://stackoverflow.com/questions/3269434/
	return amin <= bmax && bmin <= amax
}

So a chunk is included in the block if it overlaps it in any way, including the case where chunk.MinTime == block.MaxTime.

It isn't ideal, because it means that most data in that last chunk is useless (it won't be read because it's outside the range [block.MinTime, block.MaxTime]), and the chunk is going to be part of the next block by the same criteria (it overlaps it as well), so we end up storing the same data twice.

It seems to me that the whole issue stems from the fact that by the way intervalOverlap() is defined, blocks overlap by 1 millisecond. For instance, with 2-hour blocks, if blocks[0].MinTime = 0, blocks[0].MaxTime = 7200000, then the next block will be blocks[1].MinTime = 7200000, blocks[1].MaxTime = 14400000. So a metric with t = 7200000 would be in blocks[0] and blocks[1] at the same time.

A possible way to fix this would be to say that block.MaxTime is an exclusive upper bound (so a chunk would only be included if chunk.MinTime < block.MaxTime && block.MinTime <= chunk.MaxTime). The other option would be to make sure that blocks don't overlap by the current definition of intervalOverlap(), i.e. make sure that blocks[1].MinTime = blocks[0].MaxTime + 1.

I've taken that second approach in this proof of concept.

To make reproducing the issue and testing the fix easier, I patched Prometheus to round timestamps down to the minute (so that they would fall on the block boundary): https://github.com/BenoitKnecht/prometheus/tree/fix-block-boundaries. You can use it with

$ git clone -b fix-block-boundaries https://github.com/BenoitKnecht/prometheus.git
$ cd prometheus
$ make build
$ ./prometheus --log.level=debug --storage.tsdb.min-block-duration=10m

with prometheus.yml containing:

scrape_configs:
- job_name: self
  metrics_path: /metrics
  scrape_interval: 60s
  static_configs:
  - targets:
    - localhost:9090

The text was updated successfully, but these errors were encountered:

BenoitKnecht · 2018-06-12T07:36:36Z

In order to check for blocks with "outsider" chunks, I use this script:

package main

import (
	"fmt"
	"log"
	"os"
	"time"

	"github.com/prometheus/tsdb"
	"github.com/prometheus/tsdb/chunks"
	"github.com/prometheus/tsdb/index"
	"github.com/prometheus/tsdb/labels"
)

func ms(t int64) time.Time {
	return time.Unix(0, t*1e6)
}

func check(dir string) {
	block, err := tsdb.OpenBlock(dir, nil)
	if err != nil {
		log.Fatal(err)
	}
	defer block.Close()

	meta := block.Meta()
	fmt.Printf("block=%s, MinTime=%s, MaxTime=%s, compaction=%d\n", block, ms(meta.MinTime), ms(meta.MaxTime), meta.Compaction.Level)

	r, err := block.Index()
	if err != nil {
		log.Fatal(err)
	}
	defer r.Close()

	p, err := r.Postings(index.AllPostingsKey())
	if err != nil {
		log.Fatal(err)
	}

	var (
		lset labels.Labels
		chks []chunks.Meta
	)

	for p.Next() {
		if err := r.Series(p.At(), &lset, &chks); err != nil {
			log.Fatal(err)
		}
		for _, c := range chks {
			if c.MinTime < meta.MinTime || meta.MaxTime < c.MaxTime {
				fmt.Printf("posting=%d, MinTime=%s, maxTime=%s\n", p.At(), ms(c.MinTime), ms(c.MaxTime))
			}
		}
	}
}

func main() {
	for _, dir := range os.Args[1:] {
		check(dir)
	}
}

bwplotka · 2018-06-12T08:23:18Z

This is nice finding!

Can you propose BenoitKnecht@23b26e7 in a PR? That would maybe accelerate things up. Even if it's not ready to go, we can iterate over it.

bwplotka · 2018-06-12T22:05:06Z

@fabxc I think this issue is actually a real root cause for this: thanos-io/thanos#183

bwplotka · 2018-06-13T08:25:27Z

Will propose PR to fix this here.

Reduce `block.MaxTime` by 1 millisecond so that `blocks[1].MinTime == blocks[0].MaxTime + 1`. This prevents chunks that start exactly at the block end (i.e. `chunk.MinTime == block.MaxTime`) from being included in that block. See prometheus-junkyard#347 for a detailed explanation.

Reduce `block.MaxTime` by 1 millisecond so that `blocks[1].MinTime == blocks[0].MaxTime + 1`. This prevents chunks that start exactly at the block end (i.e. `chunk.MinTime == block.MaxTime`) from being included in that block. See prometheus-junkyard#347 for a detailed explanation. Signed-off-by: Benoît Knecht <benoit.knecht@fsfe.org

BenoitKnecht · 2018-06-13T09:35:07Z

@Bplotka Sorry, I didn't see your message before submitting my PR. Feel free to comment on mine or open a separate one with your fix, if you think it's appropriate.

bwplotka · 2018-06-13T10:05:16Z

No no, it super nice you added yours, because I tried totally different approach and not sure which one is better.
Initially, I wanted in Chunk(ref) to put:

	// Do not expose chunks that are outside of the specified range.
	if c == nil || !intervalOverlap(mint, maxt, h.mint, h.maxt) {
		return nil, ErrNotFound
	}
        // Explicitly exclude chunks that would be duplicated.
	if mint == h.maxt {
		return nil, ErrNotFound
	}

But this is not ideal either, because technically when you are asking for 0-10 range you want 10-20 chunk as well kind of, because still sample t=10 v=... is within 0-10.

In the same time I am not sure if approach from your PR works. For example based on your explanation here:

where t is the timestamp of the data being added, and s.nextAt can be equal to block.MaxTime

This is very true, ~~but there is no way to avoid this.~~ Basically, chunks should be nicely non-overlapping:
0-10, 12-20, 21-29, 35-45 (the mint and maxt for chunks depend on scrape times) and there is no guarantee that we can find proper block range that will NOT touch some chunk's mint. As far as I can tell your PR #348 is making sure TSDB blocks ends and starts in 1 milliseconds difference. I think they can even be spread with 1h and still some block A with range 0-1000 can touch some chunk that have time range 1000-1010 and it will be included. (EDIT: This is not the only change, missed the rangeTimeout change that affects block range, plus chunks ranges are made to not cross boundaries)

~~Does it make sense?~~ (EDIT: Nope, not entirely)

bwplotka · 2018-06-13T10:28:49Z

OK I missed the fact that actually chunks are cut in the way that it makes sure that they don't cross boundaries. So we could potentially do as in proposed PR: #348 Commented there why I think it might be risky. Maybe we should add proper handling on compaction instead?

EDIT: just not allowing this chunks into first block seems ok, since block 0-10 is assumed to be (0-10] (non inclusive 10 timestamp)

TSDB issue: prometheus-junkyard/tsdb#347 How we handle it? - segregate this issue in special stat entry in verifier. - auto-fix broken plan block before thanos compaction. - adding repair job to run offline batch repair for indivdual blocks. NOTE: At this point we have no power of fixing the bug when someone uses local compaction ): Fixes #354 Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka · 2018-06-14T11:08:23Z

@BenoitKnecht , @krasi-georgiev was asking really good question, why blocks are overlapping actually. We need to know the answer why, and if changing blocks to actually be milisecond away breakes anything. @krasi-georgiev is right is our attempts to fix it are only workarounds... ):

Any ideas? @fabxc do you remember why blocks vs chunks range inconsistency? and is it safe to make blocks like chunks (non-overlapping)?

TSDB issue: prometheus-junkyard/tsdb#347 How we handle it? - segregate this issue in special stat entry in verifier. - auto-fix broken plan block before thanos compaction. - adding repair job to run offline batch repair for indivdual blocks. NOTE: At this point we have no power of fixing the bug when someone uses local compaction ): Fixes #354 Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

BenoitKnecht · 2018-06-14T16:04:48Z

@BenoitKnecht , @krasi-georgiev was asking really good question, why blocks are overlapping actually. We need to know the answer why, and if changing blocks to actually be milisecond away breakes anything. @krasi-georgiev is right is our attempts to fix it are only workarounds... ):

Yes, that's how I intended to fix this issue initially. It makes things much easier in tsdb, because all intervals would be treated the same way (see #348 (comment)) . But I was told that treating blocks as [a, b) intervals was intentional, so I reworked the PR based on that.

I'm happy to revert to the initial implementation though, if the consensus is that it's the better approach.

bwplotka · 2018-06-14T17:22:16Z

I think we should merge your PR to actually fix duplicated chunk bug and aim for fixing inconsistency or figuring out why is that the case as next steps

TSDB issue: prometheus-junkyard/tsdb#347 How we handle it? - segregate this issue in special stat entry in verifier. - auto-fix broken plan block before thanos compaction. - adding repair job to run offline batch repair for indivdual blocks. NOTE: At this point we have no power of fixing the bug when someone uses local compaction ): Fixes #354 Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

…e. (#375) TSDB issue: prometheus-junkyard/tsdb#347 How we handle it? - segregate this issue in special stat entry in verifier. - auto-fix broken plan block before thanos compaction. - adding repair job to run offline batch repair for indivdual blocks. NOTE: At this point we have no power of fixing the bug when someone uses local compaction ): Fixes #354 Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

Reduce `block.MaxTime` by 1 millisecond so that `blocks[1].MinTime == blocks[0].MaxTime + 1`. This prevents chunks that start exactly at the block end (i.e. `chunk.MinTime == block.MaxTime`) from being included in that block. See prometheus-junkyard#347 for a detailed explanation. Signed-off-by: Benoît Knecht <benoit.knecht@fsfe.org>

Due to the way blocks used to overlap by 1 millisecond (see prometheus-junkyard#347), when requesting a 2-hour interval starting at `blocks[1].MinTime`, the `Querier` would consider three blocks: `blocks[0]`, `blocks[1]` and `blocks[2]`, because `blocks[0].MaxTime` and `blocks[2].MinTime` were in that interval. However, if the blocks don't overlap, only two blocks should be returned: `blocks[1]` and `blocks[2]`. This test ensures that it's indeed the case. Signed-off-by: Benoît Knecht <benoit.knecht@fsfe.org>

BenoitKnecht mentioned this issue Jun 13, 2018

Make sure blocks don't overlap to avoid outsider chunks #348

Merged

bwplotka mentioned this issue Jun 13, 2018

block: Handle gracefully extra duplicated chunk produced by TSDB issue. thanos-io/thanos#375

Merged

mysteryegg mentioned this issue Jun 20, 2018

Opening storage failed" err="invalid block sequence" prometheus/prometheus#3714

Closed

brian-brazil mentioned this issue Jun 22, 2018

block time ranges overlap prometheus/prometheus#4302

Closed

fabxc closed this as completed in #348 Jul 5, 2018

ankon mentioned this issue Jul 16, 2018

Opening storage failed invalid block sequence: block time ranges overlap after updating from 2.3.1 to 2.3.2 (potential corruption?) prometheus/prometheus#4388

Closed

danielmotaleite mentioned this issue Sep 6, 2022

Problem with thanos compact thanos-io/thanos#5675

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocks can contain chunks that go past MaxTime #347

Blocks can contain chunks that go past MaxTime #347

BenoitKnecht commented Jun 12, 2018 •

edited

BenoitKnecht commented Jun 12, 2018

bwplotka commented Jun 12, 2018 •

edited

bwplotka commented Jun 12, 2018

bwplotka commented Jun 13, 2018

BenoitKnecht commented Jun 13, 2018

bwplotka commented Jun 13, 2018 •

edited

bwplotka commented Jun 13, 2018 •

edited

bwplotka commented Jun 14, 2018

BenoitKnecht commented Jun 14, 2018

bwplotka commented Jun 14, 2018

Blocks can contain chunks that go past MaxTime #347

Blocks can contain chunks that go past MaxTime #347

Comments

BenoitKnecht commented Jun 12, 2018 • edited

BenoitKnecht commented Jun 12, 2018

bwplotka commented Jun 12, 2018 • edited

bwplotka commented Jun 12, 2018

bwplotka commented Jun 13, 2018

BenoitKnecht commented Jun 13, 2018

bwplotka commented Jun 13, 2018 • edited

bwplotka commented Jun 13, 2018 • edited

bwplotka commented Jun 14, 2018

BenoitKnecht commented Jun 14, 2018

bwplotka commented Jun 14, 2018

BenoitKnecht commented Jun 12, 2018 •

edited

bwplotka commented Jun 12, 2018 •

edited

bwplotka commented Jun 13, 2018 •

edited

bwplotka commented Jun 13, 2018 •

edited