Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need overflow check for `--storage.tsdb.retention.time` #5398

Closed
beorn7 opened this Issue Mar 22, 2019 · 8 comments

Comments

Projects
None yet
4 participants
@beorn7
Copy link
Member

beorn7 commented Mar 22, 2019

Bug Report

What did you do?

Ran prometheus --storage.tsdb.retention.time=300y

What did you expect to see?

A Prometheus server deleting only samples more than 300y old. Or more realistically an error that the retention time is too long.

What did you see instead? Under which circumstances?

The retention time, according to the /flags page, was set to -8985944073709ms.

Environment

  • Prometheus version:
prometheus, version 2.8.0 (branch: master, revision: b95f4337a8f8e4b4ddea6fcab4a6e4886e7ba978)
  build user:       bjoern@notlenovo
  build date:       20190322-15:00:39
  go version:       go1.12

Context

We are well aware that our retention time is limited to what is expressible with the Go time.Duration type, i.e. ~290y. Which should be enough for most purposes.

However, with the new --storage.tsdb.retention.size flag, users might be tempted to set storage.tsdb.retention.time to something absurdly high to make sure samples are only deleted once the disk is filled. If they are unlucky, the time.Duration overflow might result in a very short retention time.

@pranjaltale16

This comment has been minimized.

Copy link

pranjaltale16 commented Mar 23, 2019

Hello, I would like to work on this issue. I think we can add a check here. I'm new to prometheus and hence I'm not sure whether is the best way to fix this or not. Let me know if there is some other way to fix this issue.

@codesome

This comment has been minimized.

Copy link
Member

codesome commented Mar 23, 2019

There is already a check:

// Check for overflows. This limits our max retention to 100y.
if cfg.tsdb.RetentionDuration < 0 {
y, err := model.ParseDuration("100y")
if err != nil {
panic(err)
}
cfg.tsdb.RetentionDuration = y
level.Info(logger).Log("msg", "time retention value is too high. Limiting to: "+y.String())
}

It should have been set to 100y if there was an overflow. Not sure why it showed negative on the UI.

@beorn7 was there any log mentioning that time retention value is too high?

@codesome

This comment has been minimized.

Copy link
Member

codesome commented Mar 23, 2019

On another note, /flags shows what was passed in the command line for time and size. /status will show what actually was applied.

When I pass 300y, I see -8985944073709ms in /flags and 100y in /status. @beorn7 can you confirm?

@pranjaltale16

This comment has been minimized.

Copy link

pranjaltale16 commented Mar 23, 2019

I think, the problem is because we are not updating newFlagRetentionDuration value. If we pass 300y it sets the value of cfg.tsdb.RetentionDuration to negative here. After checking the value of cfg.tsdb.RetentionDuration we set this to 100y but never update the flag value. The can be fixed by updating the newFlagRetentionDuration value in this loop.

if cfg.tsdb.RetentionDuration < 0 {
	y, err := model.ParseDuration("100y")
	if err != nil {
		panic(err)
	}
	cfg.tsdb.RetentionDuration = y
	if newFlagRetentionDuration != 0 {
		newFlagRetentionDuration = y
	}
	level.Info(logger).Log("msg", "time retention value is too high. Limiting to: "+y.String())
}
@codesome

This comment has been minimized.

Copy link
Member

codesome commented Mar 23, 2019

As I said, /flags shows what was passed in the command line for time and size, /status will show what actually was applied. So /flag should display 300y here.

The fix for this could be display the string passed in the flag if we encounter overflow. Note that this is just for the UI and the corrected value is already being used via cfg.tsdb.RetentionDuration.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 23, 2019

So /flag should display 300y here.

That depends on kingpin library internals. I wouldn't go to a lot of effort to adjust this, we've already a log message and the right value on /status.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Mar 23, 2019

All good, saw the logging now, and the correction. I was just confused by the /flags output.
As the case is pretty rare, it's probably not worth creating a special case in flag handling here, as @brian-brazil said.
My only remaining thought is that fixing a flag value should perhaps be at warn level and not only info. But that's a minor concern. Closing this.

@beorn7 beorn7 closed this Mar 23, 2019

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 23, 2019

Yeah, warn would be better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.