-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tsdb/agent: fix validation of default options #9876
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realised that we have got 1 fundamental thing different than Prometheus TSDB, which I would like to fix/sync - not assuming the unit of time, even for options since they are related to the samples.
I see that we are comparing MaxWALTime with milliseconds wallclock, and also with milliseconds TruncateFrequency.
Could we make it all function of the timestamp on the sample? For the validations that we do between TruncateFrequency and MaxWALTime in validateOptions, we can document the relationship and expect the user of this DB to give the right numbers since the caller can assume the unit of timestamp.
Sorry, I'm a bit lost on the suggestion. The MinWALTime/MaxWALTime is wallclock based right now. The original intent was if we have no samples coming in, we can still delete stale series by comparing against the wall clock Is there another way for us to remove samples without using the wall clock? |
0c759c0
to
5b3a767
Compare
If there are no samples incoming for all the series, then wall clock would be the way. If there is at least one series that is getting samples, it's maxt can be used as a reference to check if other series are old. But since agent is remote write only, it might be ok to use the wall clock since once we are done with the remote write, there is no more use of that WAL. |
Hello, thanks for this PR. |
@tdenof basically come to the same conclusion as you. I have been very confused when seeing WAL keep growing during network issues. Would also like to see this merged as there is no real work around by overriding the validation. Going to test to build my own fork with these changes and report back with result. |
@phillebaba I did the same and built the binary with these changes, I can confirm it works as expected and the checkpoint WAL size stops growing on filesystem after reaching MaxWALTime, and the debug log produced before truncating the WAL shows the right timestamp for |
I am seeing the same, that disk does not just keep growing when connection to remote write target goes down. It seems to also solve restart issues I have seen after this occurs. |
MaxTS was being incorrectly constrained to the truncation interval Signed-off-by: Robert Fratto <robertfratto@gmail.com>
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
5b3a767
to
87fe99c
Compare
This had fallen off my radar for quite a while, sorry everyone. Hopefully we can get this merged soon. |
* tsdb/agent: fix application of defaults MaxTS was being incorrectly constrained to the truncation interval * add more tests to check validation * force MaxWALTime = MinWALTime if min > max Signed-off-by: Robert Fratto <robertfratto@gmail.com>
* tsdb/agent: fix application of defaults MaxTS was being incorrectly constrained to the truncation interval * add more tests to check validation * force MaxWALTime = MinWALTime if min > max Signed-off-by: Robert Fratto <robertfratto@gmail.com>
MaxTS is supposed to be constrained to no less than the truncation interval, but this calculation was incorrect causing MaxTS to be multiplied by 1h.
Signed-off-by: Robert Fratto robertfratto@gmail.com