Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promtool: backfill: allow configuring block duration #8919

Merged

Conversation

stevekuznetsov
Copy link
Contributor

When backfilling large amounts of data across long periods of time, it
may in certain circumstances be useful to use a longer block duration to
increase the efficiency and speed of the backfilling process. This patch
adds a flag --block-duration-power to allow a user to choose the power N
where the block duration is 2^(N+1)h.

Signed-off-by: Steve Kuznetsov skuznets@redhat.com


Fixes #8388
Builds on top of #8917 so as to not create merge conflicts. I do not know the best place to leave the discussion about block durations, so I added it to the helptext. I certainly think it is useful for users who are not familiar with TSDB basics to have that discussion present when they are choosing a block duration.
/assign @roidelapluie @brian-brazil

@roidelapluie
Copy link
Member

Thanks.

I prefer to have the duration of the blocks, as it would be more telling for the users. Power is really obscure and it's an implementation detail.

We could in the help list a few examples (2h, 4h, 8h).

If you think it is too much to ask for the user, we could round down the duration provided (with a warning). e.g. if you pass 20h you would have 16h blocks. But I don't think it is necessary for a first step.

Note: the documentation should be on the storage page.

@stevekuznetsov
Copy link
Contributor Author

I disagree - in your code you very specifically validate this 2^(N+1)h rule. Why should we expose a flag to users where the vast majority of values are wrong? In the other thread you and @brian-brazil were very clear that choosing an incorrect value of duration is very bad, and only a very narrow range of durations would ever work. Why should we allow users to choose wrong values, only to validate them and send them an error? The only real choice users have is the N in the above equation, so that's what they should provide IMHO.

@stevekuznetsov
Copy link
Contributor Author

Note: the documentation should be on the storage page.

@roidelapluie sorry, would you mind linking where this page is, so I could edit it?

@stevekuznetsov stevekuznetsov force-pushed the skuznets/backfill-block-duration branch from 06926be to 23e8729 Compare June 11, 2021 20:12
@stevekuznetsov
Copy link
Contributor Author

Moved the conversation about when to use the flag to docs/storage.md

@roidelapluie
Copy link
Member

What if, if we reject, we print to stderr the value below and above? So the user can pick easily?

I am really affraid that users will write "24" or even "10", making gigantic blocks by accident, plus this is adding a lot of complexity (you need to compute the power instead of picking from the few examples we show in the --help)

@dgl
Copy link
Member

dgl commented Jun 11, 2021

The other way to think about this is this is an API (ok, outside the Prometheus API stability guarantees currently), but by forcing this to be a particular power it means this becomes coupled to the implementation of the block storage.

By using a duration that is validated, it's nicer to users but also does not get in the way of any potential (even if unlikely) changes to the block format.

@stevekuznetsov
Copy link
Contributor Author

@dgl the maintainers made it very clear to me that a) only very specific values are allowed b) if you use other values, things will go very wrong. What is the benefit of planning for an unlikely future where the default block size changes? I want to build a useful tool today that makes it impossible for users to make the wrong decision, in a space where the decision is hard to make and incorrect choices are deleterious.

@stevekuznetsov
Copy link
Contributor Author

stevekuznetsov commented Jun 14, 2021

@roidelapluie IDK if you and other maintainers feel really strongly I can change it but I honestly believe that giving users an infinite input space, making it seem like there's choice, and validating it down to very, very distinct values is simply terrible UX. Compare the two cases:

  1. the user has a --block-duration-power flag, reads the documentation on how to use it, chooses a power and goes from there
  2. the user has a --block-duration flag, makes a guess, chooses an invalid duration, has to go read the doc to figure out which values are valid, implicitly chooses a power, goes from there

Why is 2 better?

edit: and, overall - I've been lead to understand by @brian-brazil that this choice is very specific, very hard to do right and very important. I explicitly do not want users to have an easy time choosing any old random duration without understanding what will happen to their TSDB afterword. If this flag doesn't lead people to read the doc and make an informed decision after thinking through the implications, that's a failure in my book. There is no simple choice to make and there is no simple duration they can use other than the default.

@roidelapluie
Copy link
Member

I am quite confident that we should not expose those kind of details to our users.

What we could to is to propose a --max-block-duration=2h option instead, and we take the power of 2 below it.

--max-block-duration=24h would create 16h blocks.

@stevekuznetsov
Copy link
Contributor Author

@roidelapluie that is a good approach as well. Do we expect users to want a specific max value, or should we add something like --use-largest-blocks that looks at the input OpenMetrics data and simply chooses the block size that would fit all of the data? Or maybe a --split-across-blocks=N and choose some "correct" duration whereby N blocks result?

@roidelapluie
Copy link
Member

--use-largest-blocks will probably be too big for e.g. 1 year of data. --split-across-blocks=N might work but it would also lie to the user which has data for less than e.g. 2 hours.

@stevekuznetsov
Copy link
Contributor Author

@roidelapluie what would be reasonable bounds for the block size? No smaller than 2h and no larger than a week? A month? three months?

@stevekuznetsov stevekuznetsov force-pushed the skuznets/backfill-block-duration branch 2 times, most recently from e6ea5c5 to 5d37cc3 Compare June 16, 2021 16:23
@stevekuznetsov
Copy link
Contributor Author

@roidelapluie ok - I've added the flag to choose a number of blocks and added validation so it's not too large or too small. I chose a random value for the largest allowable block, but please take a look and advise me as to the best option here.

@stevekuznetsov stevekuznetsov force-pushed the skuznets/backfill-block-duration branch from 5d37cc3 to 2e82c3d Compare June 16, 2021 16:28
@stevekuznetsov
Copy link
Contributor Author

(I think we'll need to do the close/reopen thing here too to get CircleCI to run?)

@roidelapluie
Copy link
Member

I still don't think this is the correct fix. I stand by my proposal of a max block duration because that is what you pass to the prometheus server. Then you can have the same max block duration on backfill and prometheus. Prometheus does not have the option "keep 10 blocks".

@stevekuznetsov
Copy link
Contributor Author

stevekuznetsov commented Jun 16, 2021

@roidelapluie you should have said so... your previous comments definitely sounded like you endorsing this approach. With a duration, though, could you please explain to me why it's the better approach to ask the user for a raw duration when almost all possible values are invalid? Does prometheus do the same validation when you pass it a block duration flag?

@roidelapluie
Copy link
Member

If passed storage.tsdb.max-block-duration=24h, prometheus would create blocks up to 16h, like in my proposal. I would not error.

@roidelapluie
Copy link
Member

We ask the user for a max duration, not for an exact duration. That mimics the options of Prometheus, so it is safe for users to use the same value as their Prometheus server.

@stevekuznetsov
Copy link
Contributor Author

Well, ya should have said that! If it's an existing UX used elsewhere, there's nothing to do but copy it here :)

@stevekuznetsov
Copy link
Contributor Author

(to be fair, the code you linked me that you wrote did the validation and sent an error... I would really love to have learned of the Prometheus TSDB flag earlier so we could have saved SO much time talking about all this ...)

@stevekuznetsov stevekuznetsov force-pushed the skuznets/backfill-block-duration branch from 2e82c3d to 03a737d Compare June 16, 2021 23:21
@stevekuznetsov
Copy link
Contributor Author

Code now uses identical logic to the --storage.tsdb.max-block-duration flag in the main prometheus binary.

@stevekuznetsov
Copy link
Contributor Author

@roidelapluie I'm not sure how you got CircleCI to work on my other PR but I think you will need to use that magic again here :|

@roidelapluie roidelapluie reopened this Jun 20, 2021
Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Minor changes required

cmd/promtool/main.go Outdated Show resolved Hide resolved
cmd/promtool/backfill.go Outdated Show resolved Hide resolved
cmd/promtool/backfill.go Outdated Show resolved Hide resolved
docs/storage.md Outdated Show resolved Hide resolved
@stevekuznetsov stevekuznetsov force-pushed the skuznets/backfill-block-duration branch from 36cbc45 to 20a785e Compare June 22, 2021 19:07
@stevekuznetsov
Copy link
Contributor Author

@codesome thanks for the review! Updated

Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not noticing it before: we should be adding a couple of test cases in backfill_test.go -> TestBackfill with higher max block duration which does not produce 2h blocks. You can add an optional variable in tests variable for this.

cmd/promtool/backfill.go Outdated Show resolved Hide resolved
When backfilling large amounts of data across long periods of time, it
may in certain circumstances be useful to use a longer block duration to
increase the efficiency and speed of the backfilling process. This patch
adds a flag --block-duration-power to allow a user to choose the power N
where the block duration is 2^(N+1)h.

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
When someone new breaks a test, seeing "expected: false, got: true" is
really not useful. A nice message helps here.

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
@stevekuznetsov stevekuznetsov force-pushed the skuznets/backfill-block-duration branch from 3d2bb7d to 875155f Compare June 23, 2021 16:03
@stevekuznetsov
Copy link
Contributor Author

@codesome of course - added a test for a longer duration. Let me know if it's ok, but I also changed the tests to run using Go subtests so I could iterate more quickly on one test case, and added failure messages to the asserts as they were pretty cryptic before. There was one assert I simply removed because to my eyes all it did was ensure that all blocks were the default duration long - without a message I can't be sure.

@codesome
Copy link
Member

There was one assert I simply removed because to my eyes all it did was ensure that all blocks were the default duration long

Let's bring that back. We need to ensure that the blocks don't cross the expected size and also the boundaries defined by those block duration. testBlocks can take an additional argument for expected block duration, and the test case struct can include that info (since max is not always the block duration). To keep it simple, 0 in the struct could mean the default block duration.

@stevekuznetsov
Copy link
Contributor Author

@codesome sure - added tests for those edges and wrote a check for the span of all blocks.

@stevekuznetsov stevekuznetsov force-pushed the skuznets/backfill-block-duration branch from 875155f to 3c22a9c Compare June 24, 2021 16:04
Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last change and we are good to go! Thanks!

cmd/promtool/backfill_test.go Outdated Show resolved Hide resolved
cmd/promtool/backfill_test.go Outdated Show resolved Hide resolved
@stevekuznetsov stevekuznetsov force-pushed the skuznets/backfill-block-duration branch from 3c22a9c to 0c84bfc Compare June 25, 2021 04:51
@stevekuznetsov
Copy link
Contributor Author

@codesome thanks for the thorough review! Hopefully the English wording in the tests makes it easier what, specifically, is being asserted about for the next bloke to break them ;)

@codesome codesome closed this Jun 25, 2021
@codesome codesome reopened this Jun 25, 2021
@codesome
Copy link
Member

Closed and reopened to re-run the CI as they were stuck

@codesome
Copy link
Member

CI is failing, looks like you are missing gofmt on your changes :)

@stevekuznetsov stevekuznetsov force-pushed the skuznets/backfill-block-duration branch from 0c84bfc to 6d05f78 Compare June 25, 2021 14:56
@stevekuznetsov
Copy link
Contributor Author

@codesome haha sorry! Fighting with a new editor setup 🤦

@codesome codesome closed this Jun 28, 2021
@codesome codesome reopened this Jun 28, 2021
A test that uses a long block duration to write bigger blocks is added.
The check to make sure all blocks are the default duration is removed.

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
@stevekuznetsov stevekuznetsov force-pushed the skuznets/backfill-block-duration branch from 6d05f78 to b4a5bce Compare June 28, 2021 16:57
@stevekuznetsov
Copy link
Contributor Author

@codesome sorry about that ... looks all green now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Backfill: change default block size
4 participants