-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
promtool: backfill: allow configuring block duration #8919
promtool: backfill: allow configuring block duration #8919
Conversation
Thanks. I prefer to have the duration of the blocks, as it would be more telling for the users. Power is really obscure and it's an implementation detail. We could in the help list a few examples (2h, 4h, 8h). If you think it is too much to ask for the user, we could round down the duration provided (with a warning). e.g. if you pass 20h you would have 16h blocks. But I don't think it is necessary for a first step. Note: the documentation should be on the storage page. |
I disagree - in your code you very specifically validate this |
@roidelapluie sorry, would you mind linking where this page is, so I could edit it? |
06926be
to
23e8729
Compare
Moved the conversation about when to use the flag to |
What if, if we reject, we print to stderr the value below and above? So the user can pick easily? I am really affraid that users will write "24" or even "10", making gigantic blocks by accident, plus this is adding a lot of complexity (you need to compute the power instead of picking from the few examples we show in the --help) |
The other way to think about this is this is an API (ok, outside the Prometheus API stability guarantees currently), but by forcing this to be a particular power it means this becomes coupled to the implementation of the block storage. By using a duration that is validated, it's nicer to users but also does not get in the way of any potential (even if unlikely) changes to the block format. |
@dgl the maintainers made it very clear to me that a) only very specific values are allowed b) if you use other values, things will go very wrong. What is the benefit of planning for an unlikely future where the default block size changes? I want to build a useful tool today that makes it impossible for users to make the wrong decision, in a space where the decision is hard to make and incorrect choices are deleterious. |
@roidelapluie IDK if you and other maintainers feel really strongly I can change it but I honestly believe that giving users an infinite input space, making it seem like there's choice, and validating it down to very, very distinct values is simply terrible UX. Compare the two cases:
Why is 2 better? edit: and, overall - I've been lead to understand by @brian-brazil that this choice is very specific, very hard to do right and very important. I explicitly do not want users to have an easy time choosing any old random duration without understanding what will happen to their TSDB afterword. If this flag doesn't lead people to read the doc and make an informed decision after thinking through the implications, that's a failure in my book. There is no simple choice to make and there is no simple duration they can use other than the default. |
I am quite confident that we should not expose those kind of details to our users. What we could to is to propose a --max-block-duration=2h option instead, and we take the power of 2 below it. --max-block-duration=24h would create 16h blocks. |
@roidelapluie that is a good approach as well. Do we expect users to want a specific max value, or should we add something like |
--use-largest-blocks will probably be too big for e.g. 1 year of data. |
@roidelapluie what would be reasonable bounds for the block size? No smaller than 2h and no larger than a week? A month? three months? |
e6ea5c5
to
5d37cc3
Compare
@roidelapluie ok - I've added the flag to choose a number of blocks and added validation so it's not too large or too small. I chose a random value for the largest allowable block, but please take a look and advise me as to the best option here. |
5d37cc3
to
2e82c3d
Compare
(I think we'll need to do the close/reopen thing here too to get CircleCI to run?) |
I still don't think this is the correct fix. I stand by my proposal of a max block duration because that is what you pass to the prometheus server. Then you can have the same max block duration on backfill and prometheus. Prometheus does not have the option "keep 10 blocks". |
@roidelapluie you should have said so... your previous comments definitely sounded like you endorsing this approach. With a duration, though, could you please explain to me why it's the better approach to ask the user for a raw duration when almost all possible values are invalid? Does |
If passed storage.tsdb.max-block-duration=24h, prometheus would create blocks up to 16h, like in my proposal. I would not error. |
We ask the user for a max duration, not for an exact duration. That mimics the options of Prometheus, so it is safe for users to use the same value as their Prometheus server. |
Well, ya should have said that! If it's an existing UX used elsewhere, there's nothing to do but copy it here :) |
(to be fair, the code you linked me that you wrote did the validation and sent an error... I would really love to have learned of the Prometheus TSDB flag earlier so we could have saved SO much time talking about all this ...) |
2e82c3d
to
03a737d
Compare
Code now uses identical logic to the |
@roidelapluie I'm not sure how you got CircleCI to work on my other PR but I think you will need to use that magic again here :| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Minor changes required
36cbc45
to
20a785e
Compare
@codesome thanks for the review! Updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not noticing it before: we should be adding a couple of test cases in backfill_test.go -> TestBackfill
with higher max block duration which does not produce 2h blocks. You can add an optional variable in tests
variable for this.
When backfilling large amounts of data across long periods of time, it may in certain circumstances be useful to use a longer block duration to increase the efficiency and speed of the backfilling process. This patch adds a flag --block-duration-power to allow a user to choose the power N where the block duration is 2^(N+1)h. Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
When someone new breaks a test, seeing "expected: false, got: true" is really not useful. A nice message helps here. Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
3d2bb7d
to
875155f
Compare
@codesome of course - added a test for a longer duration. Let me know if it's ok, but I also changed the tests to run using Go subtests so I could iterate more quickly on one test case, and added failure messages to the asserts as they were pretty cryptic before. There was one assert I simply removed because to my eyes all it did was ensure that all blocks were the default duration long - without a message I can't be sure. |
Let's bring that back. We need to ensure that the blocks don't cross the expected size and also the boundaries defined by those block duration. |
@codesome sure - added tests for those edges and wrote a check for the span of all blocks. |
875155f
to
3c22a9c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last change and we are good to go! Thanks!
3c22a9c
to
0c84bfc
Compare
@codesome thanks for the thorough review! Hopefully the English wording in the tests makes it easier what, specifically, is being asserted about for the next bloke to break them ;) |
Closed and reopened to re-run the CI as they were stuck |
CI is failing, looks like you are missing gofmt on your changes :) |
0c84bfc
to
6d05f78
Compare
@codesome haha sorry! Fighting with a new editor setup 🤦 |
A test that uses a long block duration to write bigger blocks is added. The check to make sure all blocks are the default duration is removed. Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
6d05f78
to
b4a5bce
Compare
@codesome sorry about that ... looks all green now :) |
When backfilling large amounts of data across long periods of time, it
may in certain circumstances be useful to use a longer block duration to
increase the efficiency and speed of the backfilling process. This patch
adds a flag --block-duration-power to allow a user to choose the power N
where the block duration is 2^(N+1)h.
Signed-off-by: Steve Kuznetsov skuznets@redhat.com
Fixes #8388
Builds on top of #8917 so as to not create merge conflicts. I do not know the best place to leave the discussion about block durations, so I added it to the helptext. I certainly think it is useful for users who are not familiar with TSDB basics to have that discussion present when they are choosing a block duration.
/assign @roidelapluie @brian-brazil