Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to handle hash collisions correctly for DaemonSets #66476

Conversation

mortent
Copy link
Member

@mortent mortent commented Jul 21, 2018

What this PR does / why we need it: This adds an integration test for the case where there is a hash collision when creating a ControllerRevision for a DaemonSet. It also fixes a shadowed variable that prevented this functionality from working as intended.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #62519

Special notes for your reviewer:
/sig apps

Release note:

Fixes issue when updating a DaemonSet causes a hash collision.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 21, 2018
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jul 21, 2018
@mortent
Copy link
Member Author

mortent commented Jul 21, 2018

CLA should be fixed now

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jul 21, 2018
@dims
Copy link
Member

dims commented Jul 22, 2018

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 22, 2018
@janetkuo janetkuo self-assigned this Jul 23, 2018
done, err := Match(ds, existedHistory)
if err != nil {
return nil, err
done, matchErr := Match(ds, existedHistory)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is bound to happen again. Can you do something like

if outerErr := err; errors.IsAlreadExists(outerErr) {
  ...
  return outerErr
}

t.Fatalf("Failed to update DaemonSet: %v", err)
}

validateDaemonSetPodsAndMarkReady(podClient, podInformer, 1, t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you validate that the collision counter is bumped and that the pod has TerminationGracePeriodSeconds set to one?

@mortent
Copy link
Member Author

mortent commented Jul 25, 2018

Updated to address comments from @mikedanese.

Copy link
Member

@janetkuo janetkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching the bug!

}

validateDaemonSetPodsAndMarkReady(podClient, podInformer, 1, t)
validateDaemonSetStatus(dsClient, ds.Name, 1, t)
Copy link
Member

@janetkuo janetkuo Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validateDaemonSetPodsAndMarkReady checks DS has X number of pods and mark them Ready. It doesn't check if the Pods are updated or not. We can skip this step because we don't care if the pods are ready or not.

validateDaemonSetStatus checks whether the status matches current # of Pods or not, but it doesn't check if the Pods are updated or not, either. We can also skip this one.

There's a possible race that the Pods haven't been updated before we validate collision count and Pods. Suggest instead wait for new pods and new controllerrevision to be created, and then check DS's collision count.

one := int64(1)
ds.Spec.Template.Spec.TerminationGracePeriodSeconds = &one

newHash, newName := hashAndNameForDaemonSet(ds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest add comments for each step (or some steps).

@mortent
Copy link
Member Author

mortent commented Jul 26, 2018

@janetkuo Thanks for the detailed review.
I have updated the validation part of the test to wait for a pod to come up with the updated value for TerminationGracePeriodSeconds and then check that the CollisionCount is the expected value. I think this should avoid the race.
Also added some comments on the key steps in the test code.


// Look up the ControllerRevision for the DaemonSet
_, name := hashAndNameForDaemonSet(ds)
revision, err := clientset.AppsV1().ControllerRevisions(ds.Namespace).Get(name, metav1.GetOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that the CR haven't been created yet after the DS is created, so wait for CR to be created to avoid possible test flake.

@@ -372,6 +374,33 @@ func waitForPodsCreated(podInformer cache.SharedIndexInformer, num int) error {
})
}

func waitForDaemonSetCreated(c clientset.Interface, name string, namespace string) error {
return wait.Poll(2*time.Second, 30*time.Second, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use PollImmediate() instead of Poll(). Use shorter intervals (e.g. 100ms) and shorter timeouts (10s).

I'd like to change all Poll() to PollImmediate() and make interval and timeout constants, e.g.

const (
	interval = 100 * time.Millisecond
	shortTimeout  = 10 * time.Second
    longTimeout = 60 * time.Second
)

but that should be a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the test to use PollImmediate and shorter intervals and timeouts. I can do a separate PR to pull out the constants and switch the other tests in the file to use PollImmediate.

}

// Wait for the pod with the latest Spec to become available
err = wait.Poll(10*time.Second, 60*time.Second, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

t.Fatalf("Failed to update DaemonSet: %v", err)
}

// Wait for the pod with the latest Spec to become available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Wait for any pod ... to exist"

objects := podInformer.GetIndexer().List()
for _, object := range objects {
pod := object.(*v1.Pod)
if *pod.Spec.TerminationGracePeriodSeconds == int64(1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nil: == *ds.Spec.Template.Spec.TerminationGracePeriodSeconds

return false, nil
})
if err != nil {
t.Fatal(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.Fatalf("Failed to wait for Pods with the latest Spec to be created: %v", err)

if err != nil {
t.Fatalf("Failed to create ControllerRevision: %v", err)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: save ds's collision count here for comparison later. The count should be count+1 at the end of this test.

@mortent mortent force-pushed the IntegrationTestForDaemonSetHashCollision branch from 737cfdc to 0e28b6c Compare August 3, 2018 22:23
@mortent
Copy link
Member Author

mortent commented Aug 3, 2018

@janetkuo I have updated the PR based on the comments.

@mortent
Copy link
Member Author

mortent commented Aug 3, 2018

/retest

@mortent mortent force-pushed the IntegrationTestForDaemonSetHashCollision branch 2 times, most recently from 57ff30a to ab23bc4 Compare August 4, 2018 16:54
var orgCollisionCount int32
if ds.Status.CollisionCount != nil {
orgCollisionCount = *ds.Status.CollisionCount
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this down to L846 after DS's CR is created and a fresh DS is got back from the server. This is because collision count might change after CR creation.

@mortent mortent force-pushed the IntegrationTestForDaemonSetHashCollision branch from ab23bc4 to a93ea43 Compare August 8, 2018 20:54
@janetkuo
Copy link
Member

janetkuo commented Aug 9, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 9, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: janetkuo, mortent

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 9, 2018
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 67090, 67159, 66866, 62111, 66476). If you want to cherry-pick this change to another branch, please follow the instructions here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add DaemonSet integration test for hash collision avoidance
6 participants