Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsdb/agent: fix validation of default options #9876

Merged
merged 3 commits into from
Sep 27, 2022

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Nov 26, 2021

MaxTS is supposed to be constrained to no less than the truncation interval, but this calculation was incorrect causing MaxTS to be multiplied by 1h.

Signed-off-by: Robert Fratto robertfratto@gmail.com

Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised that we have got 1 fundamental thing different than Prometheus TSDB, which I would like to fix/sync - not assuming the unit of time, even for options since they are related to the samples.

I see that we are comparing MaxWALTime with milliseconds wallclock, and also with milliseconds TruncateFrequency.

Could we make it all function of the timestamp on the sample? For the validations that we do between TruncateFrequency and MaxWALTime in validateOptions, we can document the relationship and expect the user of this DB to give the right numbers since the caller can assume the unit of timestamp.

@stale stale bot added the stale label Jan 31, 2022
@rfratto
Copy link
Member Author

rfratto commented Feb 17, 2022

Sorry, I'm a bit lost on the suggestion.

The MinWALTime/MaxWALTime is wallclock based right now. The original intent was if we have no samples coming in, we can still delete stale series by comparing against the wall clock timestamp.FromTime(time.Now()).

Is there another way for us to remove samples without using the wall clock?

@codesome
Copy link
Member

Is there another way for us to remove samples without using the wall clock?

If there are no samples incoming for all the series, then wall clock would be the way. If there is at least one series that is getting samples, it's maxt can be used as a reference to check if other series are old.

But since agent is remote write only, it might be ok to use the wall clock since once we are done with the remote write, there is no more use of that WAL.

@tdenof
Copy link

tdenof commented Sep 21, 2022

Hello, thanks for this PR.
This issue causes WAL to grow indefinitely when remote_write is unavailable, as the test setting the mint to maxTS in the main loop is never true https://github.com/prometheus/prometheus/blob/main/tsdb/agent/db.go#L542 (because maxTS is negative since MaxWALTime was multiplied by 1 hour )
Is there any chance to see this PR merged soon ? Maybe I should create an issue ?
(I also agree about the fact that either MaxWALTime should be set to MinWALTime in case MaxWALTime < MinWALTime , or simply prevent this when parsing the cmd options, but I guess this can be the subject of another PR since in practice it doesn't cause any harm or data loss)

@phillebaba
Copy link

phillebaba commented Sep 22, 2022

@tdenof basically come to the same conclusion as you. I have been very confused when seeing WAL keep growing during network issues. Would also like to see this merged as there is no real work around by overriding the validation.

Going to test to build my own fork with these changes and report back with result.

@tdenof
Copy link

tdenof commented Sep 22, 2022

@phillebaba I did the same and built the binary with these changes, I can confirm it works as expected and the checkpoint WAL size stops growing on filesystem after reaching MaxWALTime, and the debug log produced before truncating the WAL shows the right timestamp for ts : https://github.com/prometheus/prometheus/blob/main/tsdb/agent/db.go#L546
To speed up the test , I used --storage.agent.wal-truncate-frequency 1m --storage.agent.retention.max-time 5m as cli args (truncate frenquency doesn't appear in --help as it's a hidden cli param, I don't know why tbh )

@phillebaba
Copy link

I am seeing the same, that disk does not just keep growing when connection to remote write target goes down. It seems to also solve restart issues I have seen after this occurs.

MaxTS was being incorrectly constrained to the truncation interval

Signed-off-by: Robert Fratto <robertfratto@gmail.com>
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
@rfratto
Copy link
Member Author

rfratto commented Sep 23, 2022

@codesome since this fixes an issue people are running into, can we merge this as-is and revisit your earlier request in a separate issue/PR?

@rfratto
Copy link
Member Author

rfratto commented Sep 23, 2022

This had fallen off my radar for quite a while, sorry everyone. Hopefully we can get this merged soon.

@codesome codesome merged commit 448cfda into prometheus:main Sep 27, 2022
@codesome codesome mentioned this pull request Sep 27, 2022
8 tasks
@rfratto rfratto deleted the agent-fix-defaults branch September 27, 2022 14:16
roidelapluie pushed a commit to roidelapluie/prometheus that referenced this pull request Nov 4, 2022
* tsdb/agent: fix application of defaults

MaxTS was being incorrectly constrained to the truncation interval

* add more tests to check validation

* force MaxWALTime = MinWALTime if min > max

Signed-off-by: Robert Fratto <robertfratto@gmail.com>
roidelapluie pushed a commit that referenced this pull request Nov 4, 2022
* tsdb/agent: fix application of defaults

MaxTS was being incorrectly constrained to the truncation interval

* add more tests to check validation

* force MaxWALTime = MinWALTime if min > max

Signed-off-by: Robert Fratto <robertfratto@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants