Burrow 1.0 - refresh metadata on leader failures #268

toddpalino · 2017-11-13T18:59:56Z

There's a case where we can get failures getting leaders for partitions. This adds the same RefreshMetadata call that is done later in the offset fetch process to the leader checks.

* Merge burrow-1.0 RC branch * Burrow 1.0 Release Candidate (#258) * Replace burrow with the proposed 1.0 framework Look, it's essentially a complete rewrite. There's almost nothing left of the original code here, and none of the modules have been fleshed out yet. The overall changes: * Make burrow itself a lib wrapped with main, so we can wrap it inside other applications * Move to a modular framework with well-defined interfaces between components * Switch logging to uber/zap and lumberjack * Start with being able to have parallel operation (notifier active eveywhere) so we can share load between instances * Restructure a bit to resolve import cycles * Make sure to gitignore the built binary * Move modules to internal packages * Tweak logging to work on windows * Clean up coordinators a little more * Fix syscalls for unix vs windows * First pass at inmemory storage module * tests for inmemory, and fixes found during testing * Additional tests to make sure channels are closed after replies * Actually start the mainLoop * Assure only 1 storage module is allow, and add coordinator tests * Fix storage code and tests for problems found while testing evaluators * Add a fixture for storage to create a coordinator with storage module for testing code outside storage * Fixes to evaluator code based on testing * Tests for the evaluator coordinator and caching module * Add a fixture for the evaluator that other testing can use * Add start/stop and multiple request tests for the evaluator coordinator * Remove extra parens * Fix config name * Add group whitelists to storage module, along with tests * Fix a potential bug in min-distance where we would never create a new offset * moar logging * Add a group delete request for storage modules * Added expiration of group data via lazy deletion on request * First pass at cluster module for kafka with limited tests * Add a shim interface for sarama.Client and sarama.Broker * Switch kafka cluster module to use the shim interface for sarama * Add tests for the rest of the kafka cluster module * Add a storage request for setting partition owner for a group * Add kafka_client consumer module and tests * Add consumer coordinator tests * Move the storage request send helper to a new file * Refactor names for the sarama shims * Add a shim for go-zookeeper so we'll be able to test * Implement the kafkazk consumer module and tests * Add tests for validation routines * comment fix * Add tests for helpers * Add whitelist support to consumers * Have the PID creator also check if the process exists before exiting * Restructure main ZK as a coordinator to use the common interface * Start notifiers, clean up some testing * Add tests for HTTP notifier module * Refactor notifier coordinator to move common logic out of the modules * Refactor notifier whitelist and threshold accept logic to coordinator * Move template execution up to a coordinator method for consistency * Email notifier * Slack notifier and tests * Use asserts instead of panics for the HTTP tests * Fix a case in the storage fixture where it won't get all the commits * Check http notifier profile configs * Make maxlag template helper use the CurrentLag field * Rename NotifierModule to just Module * Rename StorageModule to just Module * Rename EvaluatorModule to just Module * Add support for ZK locks, as well as tests * Add a ticker that can be stopped and restarted * Make the notifier coordinator use a ZK lock with the restartable ticker * Add HTTP server and tests * Update dependencies * Clean up HTTP tests so we test the router configuration code * Few more HTTP server tests, and flesh out log level set/get * Reorder imports * Fix copyright comments * Formatting cleanup * Set httprouter to master, since it hasn't released in 2 years * touch up logging * Remember to set the config as valid * Use master branch of testify * Updates found in testing * Check for null fields in member metadata * Fixes to metadata handling * Add a worker pool for inmemory to consistently process groups * Remove the kafka_client mainLoop, as it's not useful * Fix formatting and a duplicate logging field * Add support for CORS headers on the HTTP server * Add a template helper for formatting timestamps using normal Time format strings * Add support for basic auth in the HTTP notifier * Refactor config to use viper instead of gcfg * add more logging in Kafka clients, and fix config loading * fix typo in client-id config string * Catch errors when starting coordinators * Log the http listener info * Clean up some of the logging * Fix logging and notifiers from testing (#259) * Fix notifier logic in 1.0 (#261) * Fix how the extras field is pulled into the HTTP response structs * Make sure the module accept group is always called * Pause before testing stop on the storage coordinator * Fix conditions where notifications are sent, and add a much more robust test * 1.0 - Add jitter to notifier evaluations (#263) * Change the loop for evaluations to be started to be timed with jitter for each consumer group * reorder imports * Burrow 1.0 config defaults (#264) * Add owners to consumer group status response * If no storage module configured, use a default * If no evaluator module configured, use a default * Fix default http server * ConfigurationValid gets set by Start, not before * cleanup methods that don't need to be exported * Burrow 1.0 group blacklist (#266) * Add group blacklists * Reduce logging level for storage purging expired groups * Start evaluator and httpserver before clusters/consumers * Remove the requirement that you must have a cluster and consumer module defined * Explicitly update metadata for topics that had errors fetching offsets (#267) * Refresh metadata on leader failures as well (#268) * Make sure that whenever we are reading the cluster map in the notifier, we have a lock (#269) * Burrow 1.0 - No negative lag (#271) * Lag values should always be unsigned ints * unnecessary cast * Update deps * Start notifier before clusters and consumers (#272) * Remove slack notifier (#278) * Remove slack notifier * Add example slack templates * Burrow 1.0 - Godocs for everything (#281) * Godoc docs for everything, and resolve all golint issues * Burrow 1.0 - Doc cleanup (#282) * Update example configuration files * Fix example email template * Update docs

Refresh metadata on leader failures as well

1c7a3e9

toddpalino merged commit 16dfb09 into linkedin:burrow-1.0-RC Nov 13, 2017

toddpalino deleted the burrow-1.0-deletion-fix branch November 14, 2017 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Burrow 1.0 - refresh metadata on leader failures #268

Burrow 1.0 - refresh metadata on leader failures #268

toddpalino commented Nov 13, 2017

Burrow 1.0 - refresh metadata on leader failures #268

Burrow 1.0 - refresh metadata on leader failures #268

Conversation

toddpalino commented Nov 13, 2017