Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start-latest + backfill-earliest mode #488

Merged
merged 3 commits into from
Aug 14, 2019
Merged

Conversation

timbertson
Copy link
Contributor

This is an attempt to address #459, although with a slightly different technique than described there.

The basic idea is to run with start-latest=true, but in parallel we also consume with start-latest=false. That doesn't introduce any new way of getting offsets, the only difficulty is how to combine the offsets from both sources. But since we already have a known ordering (the offset of the commit in the __consumer_offsets topic), that's not too difficult.

Note: It's probably best to review per-commit, since the overall diff ended up quite large due to type changes.

Main changes:

  • We insert offsets into the consumer offset ring according to their order, rather than always appending. Commits that are older than every stored commit are dropped
  • When we're inserting historical commits, we can't calculate lag since we don't know the historical broker offset. This means those offset entries get a nil lag
  • (when enabled) we start a second consumer, which consumes each position up until the initial high water mark (queried when burrow starts up)

Benefits:

  • we'll see new commits as soon as they happen, so active consumers will quickly report the latest values
  • we'll still notice stopped consumers (eventually), once we finish backfilling and observe that their last commit was too far in the past

Inaccuracies:

  • During backfill, we may have some historic commits and also some new commits. This gives an exaggerated "commit window", but will get more accurate as backfilling progresses.

  • Storing a nil lag isn't ideal, but the evaluation rules still mostly work (with a reduced accuracy). And these nil lags will only stick around until the consumer has made enough commits (since burrow started) to fill up the buffer.

New config:

  • backfill-earliest: (defaults to false): When both this setting and start-latest are true, the backfill behaviour added in this PR is enabled.

Breaking changes:

Consumer offset records (e.g. start / end for a consumer detail response) returned in JSON responses are no longer guaranteed to have a lag value.

Future work:

When enabled, the code now has a good idea of when it's "caught up" - once all backfill consumers have finished. It should be possible to expose this in the HTTP api, e.g. for use in a kubernetes readiness check.

@coveralls
Copy link

coveralls commented Feb 12, 2019

Coverage Status

Coverage increased (+0.07%) to 74.78% when pulling 29e7ccc on timbertson:backfill into a449cc4 on linkedin:master.

@timbertson timbertson force-pushed the backfill branch 2 times, most recently from 79c6de8 to 83ac489 Compare May 21, 2019 06:14
@timbertson
Copy link
Contributor Author

Coming back to this after a bit of a pause: we're now running this branch ourselves, and seeing some nice reductions in the number and duration of false positives burrow reports on startup.

Here's some charts showing a restart of the current version (left), vs this PR (right). Based on the CPU you can see that it takes longer to complete the full backfill (since it's now processing new events in parallel), but the actual results are inaccurate for a shorter period. It reports fewer stopped partitions for less long, and the consumer lag spikes are just as high but recover much quicker:

Screen Shot 2019-05-20 at 5 06 40 pm

@timbertson
Copy link
Contributor Author

👋 @bai I know this is a decent sized change, have you had a chance to look at it? If you have any questions about the implementation I'm happy to answer them.

@bai
Copy link
Collaborator

bai commented Aug 14, 2019

LGTM, thanks 👍

@bai bai merged commit 02f109d into linkedin:master Aug 14, 2019
CGA1123 pushed a commit to CGA1123/Burrow that referenced this pull request Dec 31, 2020
kill release step

Move main.go to cmd/burrow

Moves the current main module into a sub-directory `cmd/burrow` in order
to allow us to compile multiple binaries within the same project.

Add heroku specific code

Adds the logic required to get Burrow running on Heroku and for us. This
includes:
* `cmd/configure/main.go` which writes a Burrow configuration file from
  environment variables, required as Heroku Kafka provision its details
  as environment variables. The file needs to be written dynamically as
  the app boots because these environment variables may change
  arbitrarily as Heroku does maintenances.

* `shims/shims.go` which contains a custom TLS certificate verification
  functions, required as Heroku Kafka provisions TLS certificates that
  do not match the cluster hostname(s).
  Also contains a basic auth middleware implementation.
  All of this logic and the 'hooks' to apply them into Burrow is
  contained in this file so as to reduce the change footprint on the
  upstream repository, which should make pulling upstream changes in the
  future easier.

* `Dockerfile.web` & `entrypoint.sh` which contain the image definition
  and custom entrypoint. Only change is compiling multiple binaries and
  running them both, instead of only the `burrow` binary.

Remove zookeeper dependent coordinators

The `notifier` and `zookeeper` coordinators have a hard dependency on
Zookeeper. Heroku Kafka does not expose access to Zookeeper and it
doesn't make sense to run our own.

We don't need the functionality of these modules, our Heroku Kafka
consumers do not commit offsets to Zookeeper and we can get richer
alerts via honeycomb.

Apply shims

This adds the custom logic implemented in the `shims` module.
- Adding basic auth middleware to the http server(s)
- Setting the certificate verification function on the sarama TLS config

[PLAT-26714] Configure CI to deploy to heroku (#2)

Uses [akhileshns/heroku-deploy] to build and deploy the `Dockerfile.web`
to Heroku

[akhileshns/heroku-deploy]: https://github.com/AkhileshNS/heroku-deploy

Tune our configuration to reduce false-positives (#3)

These changes update the way Burrow behaves on boot by setting:
* `start-latest` to `true` causing the Burrow consumer to begin reading
  the `__consumer_offsets` from the log HEAD.
* `backfill-earliest` to `true` causing Burrow to spin up a second
  (temporary) consumer that will backfill values from the earliest
  message in `__consumer_offsets` to HEAD on boot.

This will reduce the false-positives the Burrow emits when starting up
as it works to catch up on the `__consumer_offsets` topic and get an
accurate view of how all consumers are behaving.

We also set the cluster `offset-refresh` to `20` (up from a default of
`10`), this effectively doubles the sliding window that Burrow uses to
determine the status of partitions within consumer groups (From 100s to
200s), this will also reduce false-positives by giving consumers some
extra leniency.

See: linkedin#488 for information on `backfill-earliest`
See: [Consumer Lag Evaluation Window] for information on the sliding
window

[Consumer Lag Evaluation Window]: https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules#evaluation-window

Signed-off-by: Christian Gregg <christian@bissy.io>
CGA1123 pushed a commit to CGA1123/Burrow that referenced this pull request Dec 31, 2020
kill release step

Move main.go to cmd/burrow

Moves the current main module into a sub-directory `cmd/burrow` in order
to allow us to compile multiple binaries within the same project.

Add heroku specific code

Adds the logic required to get Burrow running on Heroku and for us. This
includes:
* `cmd/configure/main.go` which writes a Burrow configuration file from
  environment variables, required as Heroku Kafka provision its details
  as environment variables. The file needs to be written dynamically as
  the app boots because these environment variables may change
  arbitrarily as Heroku does maintenances.

* `shims/shims.go` which contains a custom TLS certificate verification
  functions, required as Heroku Kafka provisions TLS certificates that
  do not match the cluster hostname(s).
  Also contains a basic auth middleware implementation.
  All of this logic and the 'hooks' to apply them into Burrow is
  contained in this file so as to reduce the change footprint on the
  upstream repository, which should make pulling upstream changes in the
  future easier.

* `Dockerfile.web` & `entrypoint.sh` which contain the image definition
  and custom entrypoint. Only change is compiling multiple binaries and
  running them both, instead of only the `burrow` binary.

Remove zookeeper dependent coordinators

The `notifier` and `zookeeper` coordinators have a hard dependency on
Zookeeper. Heroku Kafka does not expose access to Zookeeper and it
doesn't make sense to run our own.

We don't need the functionality of these modules, our Heroku Kafka
consumers do not commit offsets to Zookeeper and we can get richer
alerts via honeycomb.

Apply shims

This adds the custom logic implemented in the `shims` module.
- Adding basic auth middleware to the http server(s)
- Setting the certificate verification function on the sarama TLS config

[PLAT-26714] Configure CI to deploy to heroku (#2)

Uses [akhileshns/heroku-deploy] to build and deploy the `Dockerfile.web`
to Heroku

[akhileshns/heroku-deploy]: https://github.com/AkhileshNS/heroku-deploy

Tune our configuration to reduce false-positives (#3)

These changes update the way Burrow behaves on boot by setting:
* `start-latest` to `true` causing the Burrow consumer to begin reading
  the `__consumer_offsets` from the log HEAD.
* `backfill-earliest` to `true` causing Burrow to spin up a second
  (temporary) consumer that will backfill values from the earliest
  message in `__consumer_offsets` to HEAD on boot.

This will reduce the false-positives the Burrow emits when starting up
as it works to catch up on the `__consumer_offsets` topic and get an
accurate view of how all consumers are behaving.

We also set the cluster `offset-refresh` to `20` (up from a default of
`10`), this effectively doubles the sliding window that Burrow uses to
determine the status of partitions within consumer groups (From 100s to
200s), this will also reduce false-positives by giving consumers some
extra leniency.

See: linkedin#488 for information on `backfill-earliest`
See: [Consumer Lag Evaluation Window] for information on the sliding
window

[Consumer Lag Evaluation Window]: https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules#evaluation-window

Signed-off-by: Christian Gregg <christian@bissy.io>
CGA1123 added a commit to CGA1123/Burrow that referenced this pull request Dec 31, 2020
kill release step

Move main.go to cmd/burrow

Moves the current main module into a sub-directory `cmd/burrow` in order
to allow us to compile multiple binaries within the same project.

Add heroku specific code

Adds the logic required to get Burrow running on Heroku and for us. This
includes:
* `cmd/configure/main.go` which writes a Burrow configuration file from
  environment variables, required as Heroku Kafka provision its details
  as environment variables. The file needs to be written dynamically as
  the app boots because these environment variables may change
  arbitrarily as Heroku does maintenances.

* `shims/shims.go` which contains a custom TLS certificate verification
  functions, required as Heroku Kafka provisions TLS certificates that
  do not match the cluster hostname(s).
  Also contains a basic auth middleware implementation.
  All of this logic and the 'hooks' to apply them into Burrow is
  contained in this file so as to reduce the change footprint on the
  upstream repository, which should make pulling upstream changes in the
  future easier.

* `Dockerfile.web` & `entrypoint.sh` which contain the image definition
  and custom entrypoint. Only change is compiling multiple binaries and
  running them both, instead of only the `burrow` binary.

Remove zookeeper dependent coordinators

The `notifier` and `zookeeper` coordinators have a hard dependency on
Zookeeper. Heroku Kafka does not expose access to Zookeeper and it
doesn't make sense to run our own.

We don't need the functionality of these modules, our Heroku Kafka
consumers do not commit offsets to Zookeeper and we can get richer
alerts via honeycomb.

Apply shims

This adds the custom logic implemented in the `shims` module.
- Adding basic auth middleware to the http server(s)
- Setting the certificate verification function on the sarama TLS config

[PLAT-26714] Configure CI to deploy to heroku (#2)

Uses [akhileshns/heroku-deploy] to build and deploy the `Dockerfile.web`
to Heroku

[akhileshns/heroku-deploy]: https://github.com/AkhileshNS/heroku-deploy

Tune our configuration to reduce false-positives (#3)

These changes update the way Burrow behaves on boot by setting:
* `start-latest` to `true` causing the Burrow consumer to begin reading
  the `__consumer_offsets` from the log HEAD.
* `backfill-earliest` to `true` causing Burrow to spin up a second
  (temporary) consumer that will backfill values from the earliest
  message in `__consumer_offsets` to HEAD on boot.

This will reduce the false-positives the Burrow emits when starting up
as it works to catch up on the `__consumer_offsets` topic and get an
accurate view of how all consumers are behaving.

We also set the cluster `offset-refresh` to `20` (up from a default of
`10`), this effectively doubles the sliding window that Burrow uses to
determine the status of partitions within consumer groups (From 100s to
200s), this will also reduce false-positives by giving consumers some
extra leniency.

See: linkedin#488 for information on `backfill-earliest`
See: [Consumer Lag Evaluation Window] for information on the sliding
window

[Consumer Lag Evaluation Window]: https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules#evaluation-window

Signed-off-by: Christian Gregg <christian@bissy.io>
CGA1123 added a commit to CGA1123/Burrow that referenced this pull request May 18, 2021
kill release step

Move main.go to cmd/burrow

Moves the current main module into a sub-directory `cmd/burrow` in order
to allow us to compile multiple binaries within the same project.

Add heroku specific code

Adds the logic required to get Burrow running on Heroku and for us. This
includes:
* `cmd/configure/main.go` which writes a Burrow configuration file from
  environment variables, required as Heroku Kafka provision its details
  as environment variables. The file needs to be written dynamically as
  the app boots because these environment variables may change
  arbitrarily as Heroku does maintenances.

* `shims/shims.go` which contains a custom TLS certificate verification
  functions, required as Heroku Kafka provisions TLS certificates that
  do not match the cluster hostname(s).
  Also contains a basic auth middleware implementation.
  All of this logic and the 'hooks' to apply them into Burrow is
  contained in this file so as to reduce the change footprint on the
  upstream repository, which should make pulling upstream changes in the
  future easier.

* `Dockerfile.web` & `entrypoint.sh` which contain the image definition
  and custom entrypoint. Only change is compiling multiple binaries and
  running them both, instead of only the `burrow` binary.

Remove zookeeper dependent coordinators

The `notifier` and `zookeeper` coordinators have a hard dependency on
Zookeeper. Heroku Kafka does not expose access to Zookeeper and it
doesn't make sense to run our own.

We don't need the functionality of these modules, our Heroku Kafka
consumers do not commit offsets to Zookeeper and we can get richer
alerts via honeycomb.

Apply shims

This adds the custom logic implemented in the `shims` module.
- Adding basic auth middleware to the http server(s)
- Setting the certificate verification function on the sarama TLS config

[PLAT-26714] Configure CI to deploy to heroku (#2)

Uses [akhileshns/heroku-deploy] to build and deploy the `Dockerfile.web`
to Heroku

[akhileshns/heroku-deploy]: https://github.com/AkhileshNS/heroku-deploy

Tune our configuration to reduce false-positives (#3)

These changes update the way Burrow behaves on boot by setting:
* `start-latest` to `true` causing the Burrow consumer to begin reading
  the `__consumer_offsets` from the log HEAD.
* `backfill-earliest` to `true` causing Burrow to spin up a second
  (temporary) consumer that will backfill values from the earliest
  message in `__consumer_offsets` to HEAD on boot.

This will reduce the false-positives the Burrow emits when starting up
as it works to catch up on the `__consumer_offsets` topic and get an
accurate view of how all consumers are behaving.

We also set the cluster `offset-refresh` to `20` (up from a default of
`10`), this effectively doubles the sliding window that Burrow uses to
determine the status of partitions within consumer groups (From 100s to
200s), this will also reduce false-positives by giving consumers some
extra leniency.

See: linkedin#488 for information on `backfill-earliest`
See: [Consumer Lag Evaluation Window] for information on the sliding
window

[Consumer Lag Evaluation Window]: https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules#evaluation-window

Signed-off-by: Christian Gregg <christian@bissy.io>
CGA1123 added a commit to CGA1123/Burrow that referenced this pull request May 18, 2021
kill release step

Move main.go to cmd/burrow

Moves the current main module into a sub-directory `cmd/burrow` in order
to allow us to compile multiple binaries within the same project.

Add heroku specific code

Adds the logic required to get Burrow running on Heroku and for us. This
includes:
* `cmd/configure/main.go` which writes a Burrow configuration file from
  environment variables, required as Heroku Kafka provision its details
  as environment variables. The file needs to be written dynamically as
  the app boots because these environment variables may change
  arbitrarily as Heroku does maintenances.

* `shims/shims.go` which contains a custom TLS certificate verification
  functions, required as Heroku Kafka provisions TLS certificates that
  do not match the cluster hostname(s).
  Also contains a basic auth middleware implementation.
  All of this logic and the 'hooks' to apply them into Burrow is
  contained in this file so as to reduce the change footprint on the
  upstream repository, which should make pulling upstream changes in the
  future easier.

* `Dockerfile.web` & `entrypoint.sh` which contain the image definition
  and custom entrypoint. Only change is compiling multiple binaries and
  running them both, instead of only the `burrow` binary.

Remove zookeeper dependent coordinators

The `notifier` and `zookeeper` coordinators have a hard dependency on
Zookeeper. Heroku Kafka does not expose access to Zookeeper and it
doesn't make sense to run our own.

We don't need the functionality of these modules, our Heroku Kafka
consumers do not commit offsets to Zookeeper and we can get richer
alerts via honeycomb.

Apply shims

This adds the custom logic implemented in the `shims` module.
- Adding basic auth middleware to the http server(s)
- Setting the certificate verification function on the sarama TLS config

[PLAT-26714] Configure CI to deploy to heroku (#2)

Uses [akhileshns/heroku-deploy] to build and deploy the `Dockerfile.web`
to Heroku

[akhileshns/heroku-deploy]: https://github.com/AkhileshNS/heroku-deploy

Tune our configuration to reduce false-positives (#3)

These changes update the way Burrow behaves on boot by setting:
* `start-latest` to `true` causing the Burrow consumer to begin reading
  the `__consumer_offsets` from the log HEAD.
* `backfill-earliest` to `true` causing Burrow to spin up a second
  (temporary) consumer that will backfill values from the earliest
  message in `__consumer_offsets` to HEAD on boot.

This will reduce the false-positives the Burrow emits when starting up
as it works to catch up on the `__consumer_offsets` topic and get an
accurate view of how all consumers are behaving.

We also set the cluster `offset-refresh` to `20` (up from a default of
`10`), this effectively doubles the sliding window that Burrow uses to
determine the status of partitions within consumer groups (From 100s to
200s), this will also reduce false-positives by giving consumers some
extra leniency.

See: linkedin#488 for information on `backfill-earliest`
See: [Consumer Lag Evaluation Window] for information on the sliding
window

[Consumer Lag Evaluation Window]: https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules#evaluation-window

Signed-off-by: Christian Gregg <christian@bissy.io>
CGA1123 added a commit to CGA1123/Burrow that referenced this pull request May 18, 2021
kill release step

Move main.go to cmd/burrow

Moves the current main module into a sub-directory `cmd/burrow` in order
to allow us to compile multiple binaries within the same project.

Add heroku specific code

Adds the logic required to get Burrow running on Heroku and for us. This
includes:
* `cmd/configure/main.go` which writes a Burrow configuration file from
  environment variables, required as Heroku Kafka provision its details
  as environment variables. The file needs to be written dynamically as
  the app boots because these environment variables may change
  arbitrarily as Heroku does maintenances.

* `shims/shims.go` which contains a custom TLS certificate verification
  functions, required as Heroku Kafka provisions TLS certificates that
  do not match the cluster hostname(s).
  Also contains a basic auth middleware implementation.
  All of this logic and the 'hooks' to apply them into Burrow is
  contained in this file so as to reduce the change footprint on the
  upstream repository, which should make pulling upstream changes in the
  future easier.

* `Dockerfile.web` & `entrypoint.sh` which contain the image definition
  and custom entrypoint. Only change is compiling multiple binaries and
  running them both, instead of only the `burrow` binary.

Remove zookeeper dependent coordinators

The `notifier` and `zookeeper` coordinators have a hard dependency on
Zookeeper. Heroku Kafka does not expose access to Zookeeper and it
doesn't make sense to run our own.

We don't need the functionality of these modules, our Heroku Kafka
consumers do not commit offsets to Zookeeper and we can get richer
alerts via honeycomb.

Apply shims

This adds the custom logic implemented in the `shims` module.
- Adding basic auth middleware to the http server(s)
- Setting the certificate verification function on the sarama TLS config

[PLAT-26714] Configure CI to deploy to heroku (#2)

Uses [akhileshns/heroku-deploy] to build and deploy the `Dockerfile.web`
to Heroku

[akhileshns/heroku-deploy]: https://github.com/AkhileshNS/heroku-deploy

Tune our configuration to reduce false-positives (#3)

These changes update the way Burrow behaves on boot by setting:
* `start-latest` to `true` causing the Burrow consumer to begin reading
  the `__consumer_offsets` from the log HEAD.
* `backfill-earliest` to `true` causing Burrow to spin up a second
  (temporary) consumer that will backfill values from the earliest
  message in `__consumer_offsets` to HEAD on boot.

This will reduce the false-positives the Burrow emits when starting up
as it works to catch up on the `__consumer_offsets` topic and get an
accurate view of how all consumers are behaving.

We also set the cluster `offset-refresh` to `20` (up from a default of
`10`), this effectively doubles the sliding window that Burrow uses to
determine the status of partitions within consumer groups (From 100s to
200s), this will also reduce false-positives by giving consumers some
extra leniency.

See: linkedin#488 for information on `backfill-earliest`
See: [Consumer Lag Evaluation Window] for information on the sliding
window

[Consumer Lag Evaluation Window]: https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules#evaluation-window

Signed-off-by: Christian Gregg <christian@bissy.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants