Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve staleness logic by improving error handling and retrying a subset of errors #32

Merged
merged 34 commits into from
Jul 10, 2019

Conversation

glaksh100
Copy link
Contributor

@glaksh100 glaksh100 commented Jun 18, 2019

This PR removes dependency on staleness duration. The application is now rolled back based on the error observed (and retried if applicable). It also makes error handling a bit more structured so that we can more easily specify retry-able errors.

Changelog

  • Adds a error_handler class that wraps around the error interface and adds information like error codes and method info to the error message
  • Defines a map of retryable error keys (method + error code)
  • Unifies all error logging in api.go to use the above error_handler
  • Adds a LastSeenError and a RetryCount to the FlinkApplicationStatus
  • Changes the logic within shouldRollback() in flink_state_machine.go to check if the LastSeenError can be retried and is within the maxRetries limit
  • Removes dependency on staleness duration

Note: go generate on Config.go seems to be failing with:

which pflags || (go get github.com/lyft/flytestdlib/cli/pflags)
# github.com/lyft/flytestdlib/promutils
../flytestdlib/promutils/workqueue.go:29:49: cannot use composite literal (type prometheusMetricsProvider) as type workqueue.MetricsProvider in argument to workqueue.SetProvider:
	prometheusMetricsProvider does not implement workqueue.MetricsProvider (missing NewDeprecatedAddsMetric method)
make: *** [gen-config] Error 2

@glaksh100
Copy link
Contributor Author

Many thanks to @anandswaminathan for helping debug and fix the pflags issue. For anyone who may run into this in their local dev:

  • make gen-config downloads flytestdlib to ~/src/go/blah... (does ago get`)
  • All references to flytestdlib are henceforth to your local flystdlib version and NOT to the one under flinkk8soperator/vendor/github.com/lyft/flytestdlib.
  • TL;DR run dep ensure on your local version of flytstdlib and the problem goes away.

@glaksh100
Copy link
Contributor Author

@mwylde @anandswaminathan After various fixes, I think this PR is now in a state to be reviewed. PTAL when you have a chance.

Copy link
Contributor

@mwylde mwylde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! A few comments, but overall my biggest concern is that in the current design we have to be very careful to avoid applications getting stuck. For example, in some local testing of the PR I saw a case where killing the JM during savepointing caused the application to get stuck midway with

ts="2019-06-20T18:01:38-07:00" level=error msg="Check savepoint status failed with response {\"errors\":[\"Operation not found under key: org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@e0ff9448\"]}" app_name=operator-test-app-ha ns=default phase=Savepointing src="api.go:248"

This is especially dangerous because there's currently no way for a user to interrupt a deploy, so they will have to delete and recreate the application.

pkg/controller/flink/client/api.go Outdated Show resolved Hide resolved
pkg/controller/flink/client/error_handler.go Show resolved Hide resolved
pkg/controller/flinkapplication/flink_state_machine.go Outdated Show resolved Hide resolved
pkg/controller/flinkapplication/flink_state_machine.go Outdated Show resolved Hide resolved
pkg/controller/flinkapplication/flink_state_machine.go Outdated Show resolved Hide resolved
pkg/controller/flinkapplication/flink_state_machine.go Outdated Show resolved Hide resolved
@anandswaminathan
Copy link
Contributor

@glaksh100

One thing I have noticed with both Operator SDK and Controller runtime is that on occasions - Callbacks can happen at a much faster loop, and not only at the Resync period. Have tried debugging it but looks like several edge cases where this can happen.

What this can translate to - you might exhaust your maxRetries in very small interval. You will need to track time, and have logic like minTimeBetweenRetries or something, and set it to an acceptable levels.

pkg/controller/flink/client/error_handler.go Outdated Show resolved Hide resolved
pkg/controller/flink/client/error_handler.go Outdated Show resolved Hide resolved
pkg/controller/flink/client/error_handler.go Outdated Show resolved Hide resolved
pkg/controller/flinkapplication/flink_state_machine.go Outdated Show resolved Hide resolved
pkg/controller/flink/client/error_handler.go Outdated Show resolved Hide resolved
pkg/controller/flink/client/error_handler.go Outdated Show resolved Hide resolved
pkg/controller/flinkapplication/flink_state_machine.go Outdated Show resolved Hide resolved
pkg/controller/flink/client/error_handler.go Outdated Show resolved Hide resolved
@glaksh100
Copy link
Contributor Author

@mwylde Thanks for the initial and the second review! I'm going to respond to comments individually and resolve ones that are outdated.

@glaksh100
Copy link
Contributor Author

Things I'm planning to address in the next pass:

  1. Improving the exponential back off with jitter and a bound
  2. Compile-check the error codes
  3. Make the retryhandler parameters configurable and move them to the configMap

@glaksh100
Copy link
Contributor Author

glaksh100 commented Jun 27, 2019

Changelog — it's perhaps easier to look at the entire diff than the ones for the commit.

  1. Refactored the retryable/failfast errors to not depend on a map at all. When the error is logged, the error itself dictates what category it belongs to along with its maxRetries.
  2. Increased defaultRetries and also added jitter to the retry delay calculation
  3. Added retry configurations to the configmap (basebackoff/maxbackoff/maxwaitonerror)

@glaksh100
Copy link
Contributor Author

As of yesterday, I've addressed all comments afaict. PTAL when you have a chance @mwylde @anandswaminathan

pkg/controller/flinkapplication/flink_state_machine.go Outdated Show resolved Hide resolved
pkg/controller/flink/client/api.go Outdated Show resolved Hide resolved
pkg/controller/flink/client/api.go Outdated Show resolved Hide resolved
pkg/controller/flink/client/error_handler.go Show resolved Hide resolved
pkg/controller/flink/client/error_handler.go Outdated Show resolved Hide resolved
…itjob, default time based application staleness, no error handling in running phase
@glaksh100
Copy link
Contributor Author

PTAL @mwylde when you have a chance!

Copy link
Contributor

@mwylde mwylde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 This is great!

@glaksh100
Copy link
Contributor Author

Thanks for the thorough review @mwylde!

@glaksh100 glaksh100 merged commit 8d847d7 into master Jul 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants