Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

Blocks can contain chunks that go past MaxTime #347

Closed
BenoitKnecht opened this issue Jun 12, 2018 · 10 comments
Closed

Blocks can contain chunks that go past MaxTime #347

BenoitKnecht opened this issue Jun 12, 2018 · 10 comments

Comments

@BenoitKnecht
Copy link
Contributor

BenoitKnecht commented Jun 12, 2018

I've been investigating thanos-io/thanos#354, and I think I've finally identified what's going on.

The issue came up because Thanos checks that all the chunks in a block are between block.MinTime and block.MaxTime, but in some circumstances, it detected chunks with block.MaxTime < chunk.MaxTime. Those chunks also happened to satisfy block.MaxTime == chunk.MinTime, which we'll see is no coincidence. After suspecting that the issue came from Thanos, I'm now pretty sure it's in tsdb.

New chunks are create when appending data to head using append(), specifically this part:

if t >= s.nextAt {
	c = s.cut(t)
	chunkCreated = true
}

where t is the timestamp of the data being added, and s.nextAt can be equal to block.MaxTime. So if Prometheus happens to scrape a metric whose timestamp is equal to block.MaxTime, it will create a chunk with chunk.MinTime == block.MaxTime.

Now, when head needs to be persisted to disk in a new block, this function determines which chunks are part of the block, specifically:

// Do not expose chunks that are outside of the specified range.
if c == nil || !intervalOverlap(mint, maxt, h.mint, h.maxt) {
	return nil, ErrNotFound
}

with

func intervalOverlap(amin, amax, bmin, bmax int64) bool {
	// Checks Overlap: http://stackoverflow.com/questions/3269434/
	return amin <= bmax && bmin <= amax
}

So a chunk is included in the block if it overlaps it in any way, including the case where chunk.MinTime == block.MaxTime.

It isn't ideal, because it means that most data in that last chunk is useless (it won't be read because it's outside the range [block.MinTime, block.MaxTime]), and the chunk is going to be part of the next block by the same criteria (it overlaps it as well), so we end up storing the same data twice.

It seems to me that the whole issue stems from the fact that by the way intervalOverlap() is defined, blocks overlap by 1 millisecond. For instance, with 2-hour blocks, if blocks[0].MinTime = 0, blocks[0].MaxTime = 7200000, then the next block will be blocks[1].MinTime = 7200000, blocks[1].MaxTime = 14400000. So a metric with t = 7200000 would be in blocks[0] and blocks[1] at the same time.

A possible way to fix this would be to say that block.MaxTime is an exclusive upper bound (so a chunk would only be included if chunk.MinTime < block.MaxTime && block.MinTime <= chunk.MaxTime). The other option would be to make sure that blocks don't overlap by the current definition of intervalOverlap(), i.e. make sure that blocks[1].MinTime = blocks[0].MaxTime + 1.

I've taken that second approach in this proof of concept.

To make reproducing the issue and testing the fix easier, I patched Prometheus to round timestamps down to the minute (so that they would fall on the block boundary): https://github.com/BenoitKnecht/prometheus/tree/fix-block-boundaries. You can use it with

$ git clone -b fix-block-boundaries https://github.com/BenoitKnecht/prometheus.git
$ cd prometheus
$ make build
$ ./prometheus --log.level=debug --storage.tsdb.min-block-duration=10m

with prometheus.yml containing:

scrape_configs:
- job_name: self
  metrics_path: /metrics
  scrape_interval: 60s
  static_configs:
  - targets:
    - localhost:9090
@BenoitKnecht
Copy link
Contributor Author

In order to check for blocks with "outsider" chunks, I use this script:

package main

import (
	"fmt"
	"log"
	"os"
	"time"

	"github.com/prometheus/tsdb"
	"github.com/prometheus/tsdb/chunks"
	"github.com/prometheus/tsdb/index"
	"github.com/prometheus/tsdb/labels"
)

func ms(t int64) time.Time {
	return time.Unix(0, t*1e6)
}

func check(dir string) {
	block, err := tsdb.OpenBlock(dir, nil)
	if err != nil {
		log.Fatal(err)
	}
	defer block.Close()

	meta := block.Meta()
	fmt.Printf("block=%s, MinTime=%s, MaxTime=%s, compaction=%d\n", block, ms(meta.MinTime), ms(meta.MaxTime), meta.Compaction.Level)

	r, err := block.Index()
	if err != nil {
		log.Fatal(err)
	}
	defer r.Close()

	p, err := r.Postings(index.AllPostingsKey())
	if err != nil {
		log.Fatal(err)
	}

	var (
		lset labels.Labels
		chks []chunks.Meta
	)

	for p.Next() {
		if err := r.Series(p.At(), &lset, &chks); err != nil {
			log.Fatal(err)
		}
		for _, c := range chks {
			if c.MinTime < meta.MinTime || meta.MaxTime < c.MaxTime {
				fmt.Printf("posting=%d, MinTime=%s, maxTime=%s\n", p.At(), ms(c.MinTime), ms(c.MaxTime))
			}
		}
	}
}

func main() {
	for _, dir := range os.Args[1:] {
		check(dir)
	}
}

@bwplotka
Copy link
Contributor

bwplotka commented Jun 12, 2018

This is nice finding!

Can you propose BenoitKnecht@23b26e7 in a PR? That would maybe accelerate things up. Even if it's not ready to go, we can iterate over it.

@bwplotka
Copy link
Contributor

@fabxc I think this issue is actually a real root cause for this: thanos-io/thanos#183

@bwplotka
Copy link
Contributor

Will propose PR to fix this here.

BenoitKnecht added a commit to BenoitKnecht/tsdb that referenced this issue Jun 13, 2018
Reduce `block.MaxTime` by 1 millisecond so that `blocks[1].MinTime ==
blocks[0].MaxTime + 1`.

This prevents chunks that start exactly at the block end (i.e.
`chunk.MinTime == block.MaxTime`) from being included in that block.

See prometheus-junkyard#347 for a detailed explanation.
BenoitKnecht added a commit to BenoitKnecht/tsdb that referenced this issue Jun 13, 2018
Reduce `block.MaxTime` by 1 millisecond so that `blocks[1].MinTime ==
blocks[0].MaxTime + 1`.

This prevents chunks that start exactly at the block end (i.e.
`chunk.MinTime == block.MaxTime`) from being included in that block.

See prometheus-junkyard#347 for a detailed explanation.

Signed-off-by: Benoît Knecht <benoit.knecht@fsfe.org
@BenoitKnecht
Copy link
Contributor Author

@Bplotka Sorry, I didn't see your message before submitting my PR. Feel free to comment on mine or open a separate one with your fix, if you think it's appropriate.

@bwplotka
Copy link
Contributor

bwplotka commented Jun 13, 2018

No no, it super nice you added yours, because I tried totally different approach and not sure which one is better.
Initially, I wanted in Chunk(ref) to put:

	// Do not expose chunks that are outside of the specified range.
	if c == nil || !intervalOverlap(mint, maxt, h.mint, h.maxt) {
		return nil, ErrNotFound
	}
        // Explicitly exclude chunks that would be duplicated.
	if mint == h.maxt {
		return nil, ErrNotFound
	}

But this is not ideal either, because technically when you are asking for 0-10 range you want 10-20 chunk as well kind of, because still sample t=10 v=... is within 0-10.

In the same time I am not sure if approach from your PR works. For example based on your explanation here:

where t is the timestamp of the data being added, and s.nextAt can be equal to block.MaxTime

This is very true, but there is no way to avoid this. Basically, chunks should be nicely non-overlapping:
0-10, 12-20, 21-29, 35-45 (the mint and maxt for chunks depend on scrape times) and there is no guarantee that we can find proper block range that will NOT touch some chunk's mint. As far as I can tell your PR #348 is making sure TSDB blocks ends and starts in 1 milliseconds difference. I think they can even be spread with 1h and still some block A with range 0-1000 can touch some chunk that have time range 1000-1010 and it will be included. (EDIT: This is not the only change, missed the rangeTimeout change that affects block range, plus chunks ranges are made to not cross boundaries)

Does it make sense? (EDIT: Nope, not entirely)

@bwplotka
Copy link
Contributor

bwplotka commented Jun 13, 2018

OK I missed the fact that actually chunks are cut in the way that it makes sure that they don't cross boundaries. So we could potentially do as in proposed PR: #348 Commented there why I think it might be risky. Maybe we should add proper handling on compaction instead?

EDIT: just not allowing this chunks into first block seems ok, since block 0-10 is assumed to be (0-10] (non inclusive 10 timestamp)

bwplotka added a commit to thanos-io/thanos that referenced this issue Jun 13, 2018
TSDB issue: prometheus-junkyard/tsdb#347

How we handle it?
- segregate this issue in special stat entry in verifier.
- auto-fix broken plan block before thanos compaction.
- adding repair job to run offline batch repair for indivdual blocks.

NOTE: At this point we have no power of fixing the bug when someone uses local compaction ):

Fixes #354

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
bwplotka added a commit to thanos-io/thanos that referenced this issue Jun 14, 2018
TSDB issue: prometheus-junkyard/tsdb#347

How we handle it?
- segregate this issue in special stat entry in verifier.
- auto-fix broken plan block before thanos compaction.
- adding repair job to run offline batch repair for indivdual blocks.

NOTE: At this point we have no power of fixing the bug when someone uses local compaction ):

Fixes #354

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
bwplotka added a commit to thanos-io/thanos that referenced this issue Jun 14, 2018
TSDB issue: prometheus-junkyard/tsdb#347

How we handle it?
- segregate this issue in special stat entry in verifier.
- auto-fix broken plan block before thanos compaction.
- adding repair job to run offline batch repair for indivdual blocks.

NOTE: At this point we have no power of fixing the bug when someone uses local compaction ):

Fixes #354

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
@bwplotka
Copy link
Contributor

@BenoitKnecht , @krasi-georgiev was asking really good question, why blocks are overlapping actually. We need to know the answer why, and if changing blocks to actually be milisecond away breakes anything. @krasi-georgiev is right is our attempts to fix it are only workarounds... ):

Any ideas? @fabxc do you remember why blocks vs chunks range inconsistency? and is it safe to make blocks like chunks (non-overlapping)?

bwplotka added a commit to thanos-io/thanos that referenced this issue Jun 14, 2018
TSDB issue: prometheus-junkyard/tsdb#347

How we handle it?
- segregate this issue in special stat entry in verifier.
- auto-fix broken plan block before thanos compaction.
- adding repair job to run offline batch repair for indivdual blocks.

NOTE: At this point we have no power of fixing the bug when someone uses local compaction ):

Fixes #354

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
bwplotka added a commit to thanos-io/thanos that referenced this issue Jun 14, 2018
TSDB issue: prometheus-junkyard/tsdb#347

How we handle it?
- segregate this issue in special stat entry in verifier.
- auto-fix broken plan block before thanos compaction.
- adding repair job to run offline batch repair for indivdual blocks.

NOTE: At this point we have no power of fixing the bug when someone uses local compaction ):

Fixes #354

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
bwplotka added a commit to thanos-io/thanos that referenced this issue Jun 14, 2018
TSDB issue: prometheus-junkyard/tsdb#347

How we handle it?
- segregate this issue in special stat entry in verifier.
- auto-fix broken plan block before thanos compaction.
- adding repair job to run offline batch repair for indivdual blocks.

NOTE: At this point we have no power of fixing the bug when someone uses local compaction ):

Fixes #354

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
bwplotka added a commit to thanos-io/thanos that referenced this issue Jun 14, 2018
TSDB issue: prometheus-junkyard/tsdb#347

How we handle it?
- segregate this issue in special stat entry in verifier.
- auto-fix broken plan block before thanos compaction.
- adding repair job to run offline batch repair for indivdual blocks.

NOTE: At this point we have no power of fixing the bug when someone uses local compaction ):

Fixes #354

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
@BenoitKnecht
Copy link
Contributor Author

@BenoitKnecht , @krasi-georgiev was asking really good question, why blocks are overlapping actually. We need to know the answer why, and if changing blocks to actually be milisecond away breakes anything. @krasi-georgiev is right is our attempts to fix it are only workarounds... ):

Yes, that's how I intended to fix this issue initially. It makes things much easier in tsdb, because all intervals would be treated the same way (see #348 (comment)) . But I was told that treating blocks as [a, b) intervals was intentional, so I reworked the PR based on that.

I'm happy to revert to the initial implementation though, if the consensus is that it's the better approach.

@bwplotka
Copy link
Contributor

I think we should merge your PR to actually fix duplicated chunk bug and aim for fixing inconsistency or figuring out why is that the case as next steps

bwplotka added a commit to thanos-io/thanos that referenced this issue Jun 15, 2018
TSDB issue: prometheus-junkyard/tsdb#347

How we handle it?
- segregate this issue in special stat entry in verifier.
- auto-fix broken plan block before thanos compaction.
- adding repair job to run offline batch repair for indivdual blocks.

NOTE: At this point we have no power of fixing the bug when someone uses local compaction ):

Fixes #354

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
bwplotka added a commit to thanos-io/thanos that referenced this issue Jun 15, 2018
TSDB issue: prometheus-junkyard/tsdb#347

How we handle it?
- segregate this issue in special stat entry in verifier.
- auto-fix broken plan block before thanos compaction.
- adding repair job to run offline batch repair for indivdual blocks.

NOTE: At this point we have no power of fixing the bug when someone uses local compaction ):

Fixes #354

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
bwplotka added a commit to thanos-io/thanos that referenced this issue Jun 15, 2018
…e. (#375)

TSDB issue: prometheus-junkyard/tsdb#347

How we handle it?
- segregate this issue in special stat entry in verifier.
- auto-fix broken plan block before thanos compaction.
- adding repair job to run offline batch repair for indivdual blocks.

NOTE: At this point we have no power of fixing the bug when someone uses local compaction ):

Fixes #354

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
BenoitKnecht added a commit to BenoitKnecht/tsdb that referenced this issue Jun 18, 2018
Reduce `block.MaxTime` by 1 millisecond so that `blocks[1].MinTime ==
blocks[0].MaxTime + 1`.

This prevents chunks that start exactly at the block end (i.e.
`chunk.MinTime == block.MaxTime`) from being included in that block.

See prometheus-junkyard#347 for a detailed explanation.

Signed-off-by: Benoît Knecht <benoit.knecht@fsfe.org>
BenoitKnecht added a commit to BenoitKnecht/tsdb that referenced this issue Jun 18, 2018
Due to the way blocks used to overlap by 1 millisecond (see prometheus-junkyard#347), when
requesting a 2-hour interval starting at `blocks[1].MinTime`, the
`Querier` would consider three blocks: `blocks[0]`, `blocks[1]` and
`blocks[2]`, because `blocks[0].MaxTime` and `blocks[2].MinTime` were in
that interval.

However, if the blocks don't overlap, only two blocks should be
returned: `blocks[1]` and `blocks[2]`. This test ensures that it's
indeed the case.

Signed-off-by: Benoît Knecht <benoit.knecht@fsfe.org>
BenoitKnecht added a commit to BenoitKnecht/tsdb that referenced this issue Jul 2, 2018
Due to the way blocks used to overlap by 1 millisecond (see prometheus-junkyard#347), when
requesting a 2-hour interval starting at `blocks[1].MinTime`, the
`Querier` would consider three blocks: `blocks[0]`, `blocks[1]` and
`blocks[2]`, because `blocks[0].MaxTime` and `blocks[2].MinTime` were in
that interval.

However, if the blocks don't overlap, only two blocks should be
returned: `blocks[1]` and `blocks[2]`. This test ensures that it's
indeed the case.

Signed-off-by: Benoît Knecht <benoit.knecht@fsfe.org>
BenoitKnecht added a commit to BenoitKnecht/tsdb that referenced this issue Jul 2, 2018
Due to the way blocks used to overlap by 1 millisecond (see prometheus-junkyard#347), when
requesting a 2-hour interval starting at `blocks[1].MinTime`, the
`Querier` would consider three blocks: `blocks[0]`, `blocks[1]` and
`blocks[2]`, because `blocks[0].MaxTime` and `blocks[2].MinTime` were in
that interval.

However, if the blocks don't overlap, only two blocks should be
returned: `blocks[1]` and `blocks[2]`. This test ensures that it's
indeed the case.

Signed-off-by: Benoît Knecht <benoit.knecht@fsfe.org>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants