Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure virtual nodes aren't stranded in GC graph #57503

Merged
merged 2 commits into from Jan 2, 2018

Conversation

liggitt
Copy link
Member

@liggitt liggitt commented Dec 21, 2017

Fixes #56121

See #56121 (comment) for details on the sequence of events that can lead to virtual nodes getting stranded in the graph

Fixed garbage collection hang

(a branch with a commit that reliably triggers the cascading deletion test failure is at https://github.com/liggitt/kubernetes/commits/gc-debug-cascading... it's not easily made into a permanent test case because it only works when that test is run in isolation, and requires plumbing test hooks deep into the watch cache layer)

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 21, 2017
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 21, 2017
@liggitt liggitt added this to the v1.9 milestone Dec 21, 2017
@liggitt
Copy link
Member Author

liggitt commented Dec 21, 2017

@kubernetes/sig-api-machinery-pr-reviews

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Dec 21, 2017
@liggitt liggitt added kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Dec 21, 2017
@@ -259,6 +259,12 @@ func (gc *GarbageCollector) attemptToDeleteWorker() bool {
}
// retry if garbage collection of an object failed.
gc.attemptToDelete.AddRateLimited(item)
} else if !n.isObserved() {
// requeue if item hasn't been observed yet.
// otherwise a virtual node for an item added/removed during a watch outage can get orphaned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use orphaned. That does other things in this area.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworded comment

@@ -588,6 +592,9 @@ func (gb *GraphBuilder) processGraphChanges() bool {
glog.V(5).Infof("GraphBuilder process object: %s/%s, namespace %s, name %s, uid %s, event type %v", event.gvk.GroupVersion().String(), event.gvk.Kind, accessor.GetNamespace(), accessor.GetName(), string(accessor.GetUID()), event.eventType)
// Check if the node already exsits
existingNode, found := gb.uidToNode.Read(accessor.GetUID())
if found {
existingNode.markObserved()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ask discussed on slack, this is only safe because of callsite behavior. The method and workqueue are private and its not unusual in this area of code, but it is comment worthy.

This is what triggers marking the node observed in the "good" case when the informer sees the object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments, and tightened up call sites to only expose enqueuing virtual delete events into graphChanges outside the normal informer paths

@deads2k
Copy link
Contributor

deads2k commented Dec 21, 2017

lgtm

@deads2k
Copy link
Contributor

deads2k commented Dec 21, 2017

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 21, 2017
@liggitt liggitt changed the title Ensure virtual nodes aren't orphaned in GC graph Ensure virtual nodes aren't stranded in GC graph Dec 21, 2017
@liggitt
Copy link
Member Author

liggitt commented Dec 21, 2017

comments addressed

@jennybuckley
Copy link

cc @jennybuckley

@BenTheElder
Copy link
Member

BenTheElder commented Dec 23, 2017

Can we make this one happen? (regarding the milestone spam above)
Also can / should we cherry pick this pack to previous releases?

@k8s-github-robot k8s-github-robot removed this from the v1.9 milestone Dec 25, 2017
@k8s-cherrypick-bot
Copy link

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Removed From Pull Request

@caesarxuchao @ironcladlou @liggitt

Important: This pull request was missing labels required for the v1.9 milestone for more than 3 days:

priority: Must specify exactly one of priority/critical-urgent, priority/important-longterm or priority/important-soon.

Help

@k8s-cherrypick-bot
Copy link

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

@deads2k
Copy link
Contributor

deads2k commented Jan 2, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 2, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, liggitt

Associated issue: #56121

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

/test all

Tests are more than 96 hours old. Re-running tests.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jan 2, 2018

@liggitt: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-verify df60789 link /test pull-kubernetes-verify
pull-kubernetes-bazel-test df60789 link /test pull-kubernetes-bazel-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 57735, 57503). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit ff58401 into kubernetes:master Jan 2, 2018
@liggitt liggitt added this to the v1.9 milestone Jan 2, 2018
@liggitt liggitt deleted the gc-virtual-node-fix branch January 3, 2018 04:06
k8s-github-robot pushed a commit that referenced this pull request Jan 3, 2018
…3-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #57503: Ensure dependents are added to virtual node before

Cherry pick of #57503 on release-1.9.

#57503: Ensure dependents are added to virtual node before
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.9" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants