Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MaxBlockDuration is 31 days when only using size based retention configuration #6857

Open
richardwilko opened this issue Feb 21, 2020 · 10 comments

Comments

@richardwilko
Copy link

cfg.tsdb.MaxBlockDuration defaults to 31 days, not 10% of 31 days when only using size based retention.

Bug is in on line 313 of prometheus\cmd\prometheus\main.go, the 10% scaling only applied when the retention duration is non-zero, but retention duration is zero when setting only storage.tsdb.retention.size

Currently an issue on master

@brian-brazil
Copy link
Contributor

That seems right to me, we want the default to be 31d in that case.

@richardwilko
Copy link
Author

I currently have a size based cutoff (50GB), and the first retention clear out deleted almost all my metric history, because largest block contained almost all my metric history.

Maybe 50GB is pretty small compared to a usual case, but its quite unexpected to loose almost all my history.

Clearly i can set both a time based and a size based retention to 'fix' this as it will force smaller block sizes, but its not obvious. Maybe it just a case of updating the docs to make this clear?

@brian-brazil
Copy link
Contributor

If you want to keep 30d of history, you're going to need ~60d of disk space given how everything works. Changing the retention period doesn't really change that.

@dprittie
Copy link

@brian-brazil - I don't think that is true. Surely if storage.tsdb.retention.time = 34d then this bit of code from cmd/prometheus/main.go comes into effect: maxBlockDuration = cfg.tsdb.RetentionDuration / 10. So the max block duration would be set to 3.4 days, so when the retention time of 34d is exceeded a 3.4 day block would be removed resulting in 30 days of data always being visible.

I have exactly the same problem as @richardwilko I am currently only using size based retention and I see blocks created which are as big as 45% of my retention policy. Whereas ideally this would never be larger than 10%.

Would it not be possible to implement a similar strategy for max block size as is currently done for database retention, ie consider both size and length and take whichever limit is hit first? I am going to take a look and see if I can put together a PR for that, but obviously that would not be worth doing if you don't think this is a valid approach.

@dprittie
Copy link

@richardwilko - my current workaround is to set storage.tsdb.max-block-duration=2d, this is obviously not ideal as it requires that you write in a little logic into whatever method you use to launch prometheus to determine how many days worth of data your size based retention can handle, then multiple that by 0.10. If you are like us and the amount of data being ingested varies quite a bit over time it means you need to restart prometheus regularly in order to recalculate a sensible value for storage.tsdb.max-block-duration

@brian-brazil
Copy link
Contributor

The problem is that if there's a size configured but no time, we have no idea what the size translates to in time terms.

@dprittie
Copy link

So its not possible to work out how much data has been written to a block and stop writing once you hit 10% of you storage.tsdb.retention.size?

Does that mean that the storage.tsdb.retention.size setting is intended to only ever be used in conjunction with storage.tsdb.retention.time?

@brian-brazil
Copy link
Contributor

That's not how compaction works, once we've chosen to compact we work series by series rather than in time slices.

@bboreham
Copy link
Member

bboreham commented Apr 9, 2024

Reviewing this at the bug scrub, we agreed both with the sentiment that 31 days is a very big block for most people, and that Prometheus can't easily target a size in bytes.

Some suggestions came up in discussion:

  • If a block is already over 10% of the max retention size, then don't include it in further compaction. This will avoid the worst symptoms.
  • Drop the default max from 31 days to something more like 4 days, then it will work fine for most people and those that really want enormous blocks can configure them. This might need to be a Prometheus 3.0 change.

@frittentheke
Copy link

frittentheke commented Apr 10, 2024

@bboreham I know every body loves their own bugs the most. But while reading your suggestions I remembered running into and reporting yet another issue in relation to size based retention: #11112. My issue is more about running out of disk space doing rotations compactions, but I'd still love for Prometheus to be able to just work with the given space (volume), without any manual tuning, adjustments or guesses about how much churn there is or how large compaction might become.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants