Fix race condition in setting node statusUpdateNeeded flag #32807

Merged
merged 1 commit into from Sep 23, 2016

Projects

None yet

10 participants

@jingxu97
Contributor
jingxu97 commented Sep 15, 2016 edited

This PR fixes the race condition in setting node statusUpdateNeeded flag
in master's attachdetach controller. This flag is used to indicate
whether a node status has been updated by the node_status_updater or
not. When updater finishes update a node status, it is set to false.
When the node status is changed such as volume is detached or new volume
is attached to the node, the flag is set to true so that updater can
update the status again. The previous workflow has a race condition as
follows

  1. updater gets the currently attached volume list from the node which needs to be
    updated.
  2. A new volume A is attached to the same node right after 1 and set the
    flag to TRUE
  3. updater updates the node attached volume list (which does not include volume A) and then set the flag to FALSE.
    The result is that volume A will be never added to the attached volume
    list so at node side, this volume is never attached.

So in this PR, the flag is set to FALSE when updater tries to get the
attached volume list (as in an atomic operation). So in the above
example, after step 2, the flag will be TRUE again, in step 3, updater
does not set the flag if updates is sucessful. So after that, flag is
still TRUE and in next round of update, the node status will be updated.

Fix race condition in setting node statusUpdateNeeded flag 

This change is Reviewable

@googlebot googlebot added the cla: yes label Sep 15, 2016
@jingxu97 jingxu97 added this to the v1.4 milestone Sep 15, 2016
@saad-ali
Member

Removing this from 1.4 milestone per offline discussion. We will get it merged, give it time to bake, and merge it into a 1.4.x release.

@saad-ali saad-ali removed this from the v1.4 milestone Sep 15, 2016
@jingxu97 jingxu97 changed the title from Fix race conditino in setting node statusUpdateNeeded flag to Fix race condition in setting node statusUpdateNeeded flag Sep 15, 2016
- ResetNodeStatusUpdateNeeded(nodeName string) error
+ // node to true indicating the AttachedVolume field of the Node's Status
+ // object needs to be updated by the node updater again.
+ ResetNodeStatusUpdateNeeded(nodeName string)
@saad-ali
saad-ali Sep 19, 2016 Member

Revert the If no node with the... portion of the comment since it is still applicable.

@jingxu97
jingxu97 Sep 21, 2016 Contributor

Here I remove the returned error because I could not see the use of it.

@saad-ali
saad-ali Sep 22, 2016 Member

This method can result in a failed to setNodeStatusUpdateNeeded error if nodeName does not exist. I like documentation comments to capture 1) what the input is, 2) what the normal output is, and 3) what results in an error. So I would leave the comment as is. That said, I'll leave it up to you to decide.

@jingxu97
jingxu97 Sep 22, 2016 Contributor

update the comments

- ResetNodeStatusUpdateNeeded(nodeName string) error
+ // node to true indicating the AttachedVolume field of the Node's Status
+ // object needs to be updated by the node updater again.
+ ResetNodeStatusUpdateNeeded(nodeName string)
@saad-ali
saad-ali Sep 19, 2016 Member

SInce this method now sets the value of statusUpdatedNeeded to true instead of false, change the name to reflect the behavior: SetNodeStatusUpdateNeeded(...)

@jingxu97
jingxu97 Sep 21, 2016 Contributor

done

@saad-ali

A couple minor comments, otherwise LGTM

@jingxu97
Contributor

@saad-ali PTAL

@jingxu97
Contributor

@saad-ali PTAL

- ResetNodeStatusUpdateNeeded(nodeName string) error
+ // node to true indicating the AttachedVolume field of the Node's Status
+ // object needs to be updated by the node updater again.
+ ResetNodeStatusUpdateNeeded(nodeName string)
@saad-ali
saad-ali Sep 22, 2016 Member

This method can result in a failed to setNodeStatusUpdateNeeded error if nodeName does not exist. I like documentation comments to capture 1) what the input is, 2) what the normal output is, and 3) what results in an error. So I would leave the comment as is. That said, I'll leave it up to you to decide.

- "failed to ResetNodeStatusUpdateNeeded(nodeName=%q) nodeName does not exist",
+ // should not happen
+ glog.Errorf(
+ "failed to setNodeStatusUpdateNeeded(nodeName=%q) nodeName does not exist",
@saad-ali
saad-ali Sep 22, 2016 Member

Add needed bool to error message for clarity.

@jingxu97
jingxu97 Sep 22, 2016 Contributor

done

@jingxu97
jingxu97 Sep 22, 2016 Contributor

done

@saad-ali

A few more comments

+// Update the flag statusUpdateNeeded to indicate whether node status is already updated or
+// needs to be updated again by the node status updater.
+// This is an internal function and caller should acquire and release the lock
+func (asw *actualStateOfWorld) setNodeStatusUpdateNeeded(nodeName string, needed bool) {
@saad-ali
saad-ali Sep 22, 2016 Member

nit: to avoid confusing this with the SetNodeStatusUpdateNeeded which sets the value to true, maybe rename it to modifyNodeStatusUpdateNeeded?

@jingxu97
jingxu97 Sep 22, 2016 Contributor

done

@jingxu97
jingxu97 Sep 22, 2016 Contributor

done

@jingxu97 jingxu97 Fix race conditino in setting node statusUpdateNeeded flag
This PR fixes the race condition in setting node statusUpdateNeeded flag
in master's attachdetach controller. This flag is used to indicate
whether a node status has been updated by the node_status_updater or
not. When updater finishes update a node status, it is set to false.
When the node status is changed such as volume is detached or new volume
is attached to the node, the flag is set to true so that updater can
update the status again. The previous workflow has a race condition as
follows
1. updater gets the currently attached volume list from the node which needs to be
updated.
2. A new volume A is attached to the same node right after 1 and set the
flag to TRUE
3. updater updates the node attached volume list (which does not include volume A) and then set the flag to FALSE.
The result is that volume A will be never added to the attached volume
list so at node side, this volume is never attached.

So in this PR, the flag is set to FALSE when updater tries to get the
attached volume list (as in an atomic operation). So in the above
example, after step 2, the flag will be TRUE again, in step 3, updater
does not set the flag if updates is sucessful. So after that, flag is
still TRUE and in next round of update, the node status will be updated.

This PR also changes a unit test due to the workflow changes
14cad20
@jingxu97
Contributor

@saad-ali PTAL

@saad-ali

LGTM

@saad-ali
Member

This PR should be cherry-picked to v1.4.1.

@jingxu97 jingxu97 added this to the v1.4 milestone Sep 22, 2016
@jingxu97 jingxu97 removed this from the v1.4 milestone Sep 22, 2016
@k8s-ci-robot
Collaborator

Jenkins GCE e2e failed for commit 14cad20. Full PR test history.

The magic incantation to run this job again is @k8s-bot gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@jingxu97
Contributor

@k8s-bot gce e2e test this

@k8s-merge-robot
Collaborator

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-merge-robot
Collaborator

Automatic merge from submit-queue

@k8s-merge-robot k8s-merge-robot merged commit 0a4316f into kubernetes:master Sep 23, 2016

7 of 8 checks passed

Submit Queue Github CI tests are not green.
Details
Jenkins GCE Node e2e Build finished. 691 tests run, 200 skipped, 0 failed.
Details
Jenkins GCE e2e Build succeeded.
Details
Jenkins GKE smoke e2e Build succeeded.
Details
Jenkins Kubemark GCE e2e Build succeeded.
Details
Jenkins unit/integration Build succeeded.
Details
Jenkins verification Build succeeded.
Details
cla/google All necessary CLAs are signed
@saad-ali saad-ali added this to the v1.4 milestone Sep 26, 2016
@saad-ali
Member

Adding cherrypick-candidate and v1.4 milestone to have this picked up for v1.4.1

@k8s-cherrypick-bot

Commit found in the "release-1.4" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

@saad-ali
Member
saad-ali commented Oct 4, 2016 edited

Now that this has been cherry-picked to the 1.4 branch (for 1.4.1), let's also cherry-pick it to the 1.3 branch (for 1.3.9).

@saad-ali saad-ali modified the milestone: v1.3, v1.4 Oct 4, 2016
@k8s-cherrypick-bot

Commit found in the "release-1.3" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment