tsdb/agent: fix validation of default options #9876

rfratto · 2021-11-26T18:05:39Z

MaxTS is supposed to be constrained to no less than the truncation interval, but this calculation was incorrect causing MaxTS to be multiplied by 1h.

Signed-off-by: Robert Fratto robertfratto@gmail.com

codesome

I just realised that we have got 1 fundamental thing different than Prometheus TSDB, which I would like to fix/sync - not assuming the unit of time, even for options since they are related to the samples.

I see that we are comparing MaxWALTime with milliseconds wallclock, and also with milliseconds TruncateFrequency.

Could we make it all function of the timestamp on the sample? For the validations that we do between TruncateFrequency and MaxWALTime in validateOptions, we can document the relationship and expect the user of this DB to give the right numbers since the caller can assume the unit of timestamp.

rfratto · 2022-02-17T16:23:48Z

Sorry, I'm a bit lost on the suggestion.

The MinWALTime/MaxWALTime is wallclock based right now. The original intent was if we have no samples coming in, we can still delete stale series by comparing against the wall clock timestamp.FromTime(time.Now()).

Is there another way for us to remove samples without using the wall clock?

codesome · 2022-02-24T09:11:45Z

Is there another way for us to remove samples without using the wall clock?

If there are no samples incoming for all the series, then wall clock would be the way. If there is at least one series that is getting samples, it's maxt can be used as a reference to check if other series are old.

But since agent is remote write only, it might be ok to use the wall clock since once we are done with the remote write, there is no more use of that WAL.

tsdb/agent/db.go

tdenof · 2022-09-21T16:03:44Z

Hello, thanks for this PR.
This issue causes WAL to grow indefinitely when remote_write is unavailable, as the test setting the mint to maxTS in the main loop is never true https://github.com/prometheus/prometheus/blob/main/tsdb/agent/db.go#L542 (because maxTS is negative since MaxWALTime was multiplied by 1 hour )
Is there any chance to see this PR merged soon ? Maybe I should create an issue ?
(I also agree about the fact that either MaxWALTime should be set to MinWALTime in case MaxWALTime < MinWALTime , or simply prevent this when parsing the cmd options, but I guess this can be the subject of another PR since in practice it doesn't cause any harm or data loss)

phillebaba · 2022-09-22T11:17:22Z

@tdenof basically come to the same conclusion as you. I have been very confused when seeing WAL keep growing during network issues. Would also like to see this merged as there is no real work around by overriding the validation.

Going to test to build my own fork with these changes and report back with result.

tdenof · 2022-09-22T23:40:24Z

@phillebaba I did the same and built the binary with these changes, I can confirm it works as expected and the checkpoint WAL size stops growing on filesystem after reaching MaxWALTime, and the debug log produced before truncating the WAL shows the right timestamp for ts : https://github.com/prometheus/prometheus/blob/main/tsdb/agent/db.go#L546
To speed up the test , I used --storage.agent.wal-truncate-frequency 1m --storage.agent.retention.max-time 5m as cli args (truncate frenquency doesn't appear in --help as it's a hidden cli param, I don't know why tbh )

phillebaba · 2022-09-23T14:11:40Z

I am seeing the same, that disk does not just keep growing when connection to remote write target goes down. It seems to also solve restart issues I have seen after this occurs.

MaxTS was being incorrectly constrained to the truncation interval Signed-off-by: Robert Fratto <robertfratto@gmail.com>

Signed-off-by: Robert Fratto <robertfratto@gmail.com>

rfratto · 2022-09-23T14:17:38Z

@codesome since this fixes an issue people are running into, can we merge this as-is and revisit your earlier request in a separate issue/PR?

rfratto · 2022-09-23T14:18:16Z

This had fallen off my radar for quite a while, sorry everyone. Hopefully we can get this merged soon.

* tsdb/agent: fix application of defaults MaxTS was being incorrectly constrained to the truncation interval * add more tests to check validation * force MaxWALTime = MinWALTime if min > max Signed-off-by: Robert Fratto <robertfratto@gmail.com>

rfratto requested a review from codesome as a code owner November 26, 2021 18:05

codesome reviewed Dec 1, 2021

View reviewed changes

stale bot added the stale label Jan 31, 2022

stale bot removed the stale label Feb 17, 2022

rfratto force-pushed the agent-fix-defaults branch from 0c759c0 to 5b3a767 Compare February 17, 2022 16:24

codesome reviewed Feb 24, 2022

View reviewed changes

tsdb/agent/db.go Show resolved Hide resolved

rfratto added 3 commits September 23, 2022 10:13

tsdb/agent: fix application of defaults

f8c2251

MaxTS was being incorrectly constrained to the truncation interval Signed-off-by: Robert Fratto <robertfratto@gmail.com>

add more tests to check validation

eabc633

Signed-off-by: Robert Fratto <robertfratto@gmail.com>

force MaxWALTime = MinWALTime if min > max

87fe99c

Signed-off-by: Robert Fratto <robertfratto@gmail.com>

rfratto force-pushed the agent-fix-defaults branch from 5b3a767 to 87fe99c Compare September 23, 2022 14:16

codesome approved these changes Sep 27, 2022

View reviewed changes

codesome merged commit 448cfda into prometheus:main Sep 27, 2022

codesome mentioned this pull request Sep 27, 2022

Cut v2.39.0-rc.0 #11344

Merged

8 tasks

rfratto deleted the agent-fix-defaults branch September 27, 2022 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tsdb/agent: fix validation of default options #9876

tsdb/agent: fix validation of default options #9876

rfratto commented Nov 26, 2021

codesome left a comment

rfratto commented Feb 17, 2022

codesome commented Feb 24, 2022

tdenof commented Sep 21, 2022

phillebaba commented Sep 22, 2022 •

edited

tdenof commented Sep 22, 2022 •

edited

phillebaba commented Sep 23, 2022

rfratto commented Sep 23, 2022

rfratto commented Sep 23, 2022

tsdb/agent: fix validation of default options #9876

tsdb/agent: fix validation of default options #9876

Conversation

rfratto commented Nov 26, 2021

codesome left a comment

Choose a reason for hiding this comment

rfratto commented Feb 17, 2022

codesome commented Feb 24, 2022

tdenof commented Sep 21, 2022

phillebaba commented Sep 22, 2022 • edited

tdenof commented Sep 22, 2022 • edited

phillebaba commented Sep 23, 2022

rfratto commented Sep 23, 2022

rfratto commented Sep 23, 2022

phillebaba commented Sep 22, 2022 •

edited

tdenof commented Sep 22, 2022 •

edited