Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

destroy/gcp: bubble up errors after 5 minutes #3749

Merged
merged 1 commit into from
Jun 26, 2020

Conversation

jstuever
Copy link
Contributor

This change will allow errors to become visible to users using log-level
of warn or higher. It continues to log errors encountered while deleting
cloud resources as DEBUG while escalating the errors to WARN once every
5 minutes.

This change will allow errors to become visible to users using log-level
of warn or higher. It continues to log errors encountered while deleting
cloud resources as DEBUG while escalating the errors to WARN once every
5 minutes.
@jstuever
Copy link
Contributor Author

/retest
/test e2e-gcp

for _, item := range items {
err := o.deleteAddress(item)
if err != nil {
errs = append(errs, err)
o.errorTracker.suppressWarning(item.key, err, o.Logger)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have to suppress the warning in each handler instead of handling it at the caller sites?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know the "item.key" at the caller sites... so we have no way of knowing what object the error is related to. We can't use the error itself (as we found in AWS), because the error messages are sometimes dynamic and some errors will never bubble up. Thus why we have to know which object the error is related to. As a result, we aren't really bubbling up a specific error. Instead, we are bubbling up "this object has seen errors".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would it take to bubble up right error messages so that we can suppress the errors if we have to at higher level instead of inside each handler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could refactor such that we have a single list of items to delete by SelfLink which includes type, and loop through those with a switch based on type that calls the appropriate delete function. The delete function could then just return an error and the main loop could handle the suppression. We would have to expand our discovery phase to populate that list. That sounds like a lot of work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we have to include a new discovery.. the delete knows which items was being deleted.. would adding context about the item being deleted when returning error not be enough..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it was before, we were returning aggregated errors as a single error. Breaking that down into individual components would require significant work as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a way to think through, how would we go about changing the code so that we didn't have to suppress individually and can actually do this handling at higher level outside these runnners? So can we can shape follow-up to improve it.

@abhinavdahiya
Copy link
Contributor

/approve
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 26, 2020

@jstuever: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-openstack 613f609 link /test e2e-openstack

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 9ff752b into openshift:master Jun 26, 2020
@jstuever jstuever deleted the cors1380 branch November 11, 2020 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants