start-latest + backfill-earliest mode #488

timbertson · 2019-02-06T02:47:18Z

This is an attempt to address #459, although with a slightly different technique than described there.

The basic idea is to run with start-latest=true, but in parallel we also consume with start-latest=false. That doesn't introduce any new way of getting offsets, the only difficulty is how to combine the offsets from both sources. But since we already have a known ordering (the offset of the commit in the __consumer_offsets topic), that's not too difficult.

Note: It's probably best to review per-commit, since the overall diff ended up quite large due to type changes.

Main changes:

We insert offsets into the consumer offset ring according to their order, rather than always appending. Commits that are older than every stored commit are dropped
When we're inserting historical commits, we can't calculate lag since we don't know the historical broker offset. This means those offset entries get a nil lag
(when enabled) we start a second consumer, which consumes each position up until the initial high water mark (queried when burrow starts up)

Benefits:

we'll see new commits as soon as they happen, so active consumers will quickly report the latest values
we'll still notice stopped consumers (eventually), once we finish backfilling and observe that their last commit was too far in the past

Inaccuracies:

During backfill, we may have some historic commits and also some new commits. This gives an exaggerated "commit window", but will get more accurate as backfilling progresses.
Storing a nil lag isn't ideal, but the evaluation rules still mostly work (with a reduced accuracy). And these nil lags will only stick around until the consumer has made enough commits (since burrow started) to fill up the buffer.

New config:

backfill-earliest: (defaults to false): When both this setting and start-latest are true, the backfill behaviour added in this PR is enabled.

Breaking changes:

Consumer offset records (e.g. start / end for a consumer detail response) returned in JSON responses are no longer guaranteed to have a lag value.

Future work:

When enabled, the code now has a good idea of when it's "caught up" - once all backfill consumers have finished. It should be possible to expose this in the HTTP api, e.g. for use in a kubernetes readiness check.

coveralls · 2019-02-12T03:43:59Z

Coverage increased (+0.07%) to 74.78% when pulling 29e7ccc on timbertson:backfill into a449cc4 on linkedin:master.

…t is configured

timbertson · 2019-05-22T05:23:07Z

Coming back to this after a bit of a pause: we're now running this branch ourselves, and seeing some nice reductions in the number and duration of false positives burrow reports on startup.

Here's some charts showing a restart of the current version (left), vs this PR (right). Based on the CPU you can see that it takes longer to complete the full backfill (since it's now processing new events in parallel), but the actual results are inaccurate for a shorter period. It reports fewer stopped partitions for less long, and the consumer lag spikes are just as high but recover much quicker:

timbertson · 2019-07-17T06:15:12Z

👋 @bai I know this is a decent sized change, have you had a chance to look at it? If you have any questions about the implementation I'm happy to answer them.

bai · 2019-08-14T04:41:10Z

LGTM, thanks 👍

kill release step Move main.go to cmd/burrow Moves the current main module into a sub-directory `cmd/burrow` in order to allow us to compile multiple binaries within the same project. Add heroku specific code Adds the logic required to get Burrow running on Heroku and for us. This includes: * `cmd/configure/main.go` which writes a Burrow configuration file from environment variables, required as Heroku Kafka provision its details as environment variables. The file needs to be written dynamically as the app boots because these environment variables may change arbitrarily as Heroku does maintenances. * `shims/shims.go` which contains a custom TLS certificate verification functions, required as Heroku Kafka provisions TLS certificates that do not match the cluster hostname(s). Also contains a basic auth middleware implementation. All of this logic and the 'hooks' to apply them into Burrow is contained in this file so as to reduce the change footprint on the upstream repository, which should make pulling upstream changes in the future easier. * `Dockerfile.web` & `entrypoint.sh` which contain the image definition and custom entrypoint. Only change is compiling multiple binaries and running them both, instead of only the `burrow` binary. Remove zookeeper dependent coordinators The `notifier` and `zookeeper` coordinators have a hard dependency on Zookeeper. Heroku Kafka does not expose access to Zookeeper and it doesn't make sense to run our own. We don't need the functionality of these modules, our Heroku Kafka consumers do not commit offsets to Zookeeper and we can get richer alerts via honeycomb. Apply shims This adds the custom logic implemented in the `shims` module. - Adding basic auth middleware to the http server(s) - Setting the certificate verification function on the sarama TLS config [PLAT-26714] Configure CI to deploy to heroku (#2) Uses [akhileshns/heroku-deploy] to build and deploy the `Dockerfile.web` to Heroku [akhileshns/heroku-deploy]: https://github.com/AkhileshNS/heroku-deploy Tune our configuration to reduce false-positives (#3) These changes update the way Burrow behaves on boot by setting: * `start-latest` to `true` causing the Burrow consumer to begin reading the `__consumer_offsets` from the log HEAD. * `backfill-earliest` to `true` causing Burrow to spin up a second (temporary) consumer that will backfill values from the earliest message in `__consumer_offsets` to HEAD on boot. This will reduce the false-positives the Burrow emits when starting up as it works to catch up on the `__consumer_offsets` topic and get an accurate view of how all consumers are behaving. We also set the cluster `offset-refresh` to `20` (up from a default of `10`), this effectively doubles the sliding window that Burrow uses to determine the status of partitions within consumer groups (From 100s to 200s), this will also reduce false-positives by giving consumers some extra leniency. See: linkedin#488 for information on `backfill-earliest` See: [Consumer Lag Evaluation Window] for information on the sliding window [Consumer Lag Evaluation Window]: https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules#evaluation-window Signed-off-by: Christian Gregg <christian@bissy.io>

timbertson force-pushed the backfill branch from 088b901 to 76abef4 Compare May 14, 2019 00:01

timbertson mentioned this pull request May 14, 2019

Report burrow's own progress as a fake consumer #528

Merged

timbertson force-pushed the backfill branch 3 times, most recently from a6287f4 to 16fc5e8 Compare May 14, 2019 00:47

timbertson added 2 commits May 21, 2019 09:48

add Order to consumer offsets and make ConsumerOffset.Lag nullable

becbe1b

storage/inmemory: keep consumer offset list sorted by Order

5c109fb

timbertson force-pushed the backfill branch 2 times, most recently from 79c6de8 to 83ac489 Compare May 21, 2019 06:14

kafka_client: start secondary backfill consumer when backfill-earlies…

29e7ccc

…t is configured

timbertson force-pushed the backfill branch from 83ac489 to 29e7ccc Compare May 22, 2019 03:39

bai merged commit 02f109d into linkedin:master Aug 14, 2019

Sipleman mentioned this pull request Dec 12, 2022

Burrow showing abnormally high consumer Lag #740

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

start-latest + backfill-earliest mode #488

start-latest + backfill-earliest mode #488

timbertson commented Feb 6, 2019

coveralls commented Feb 12, 2019 •

edited

Loading

timbertson commented May 22, 2019

timbertson commented Jul 17, 2019

bai commented Aug 14, 2019

start-latest + backfill-earliest mode #488

start-latest + backfill-earliest mode #488

Conversation

timbertson commented Feb 6, 2019

Main changes:

Benefits:

Inaccuracies:

New config:

Breaking changes:

Future work:

coveralls commented Feb 12, 2019 • edited Loading

timbertson commented May 22, 2019

timbertson commented Jul 17, 2019

bai commented Aug 14, 2019

coveralls commented Feb 12, 2019 •

edited

Loading