-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
destroy/gcp: bubble up errors after 5 minutes #3749
Conversation
This change will allow errors to become visible to users using log-level of warn or higher. It continues to log errors encountered while deleting cloud resources as DEBUG while escalating the errors to WARN once every 5 minutes.
/retest |
for _, item := range items { | ||
err := o.deleteAddress(item) | ||
if err != nil { | ||
errs = append(errs, err) | ||
o.errorTracker.suppressWarning(item.key, err, o.Logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have to suppress the warning in each handler instead of handling it at the caller sites?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know the "item.key" at the caller sites... so we have no way of knowing what object the error is related to. We can't use the error itself (as we found in AWS), because the error messages are sometimes dynamic and some errors will never bubble up. Thus why we have to know which object the error is related to. As a result, we aren't really bubbling up a specific error. Instead, we are bubbling up "this object has seen errors".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would it take to bubble up right error messages so that we can suppress the errors if we have to at higher level instead of inside each handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could refactor such that we have a single list of items to delete by SelfLink
which includes type
, and loop through those with a switch based on type that calls the appropriate delete function. The delete function could then just return an error and the main loop could handle the suppression. We would have to expand our discovery phase to populate that list. That sounds like a lot of work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would we have to include a new discovery.. the delete knows which items was being deleted.. would adding context about the item being deleted when returning error not be enough..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it was before, we were returning aggregated errors as a single error. Breaking that down into individual components would require significant work as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a way to think through, how would we go about changing the code so that we didn't have to suppress individually and can actually do this handling at higher level outside these runnners? So can we can shape follow-up to improve it.
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
@jstuever: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest Please review the full test history for this PR and help us cut down flakes. |
This change will allow errors to become visible to users using log-level
of warn or higher. It continues to log errors encountered while deleting
cloud resources as DEBUG while escalating the errors to WARN once every
5 minutes.