Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in setting node statusUpdateNeeded flag #32807

Merged
merged 1 commit into from Sep 23, 2016

Conversation

@jingxu97
Copy link
Contributor

jingxu97 commented Sep 15, 2016

This PR fixes the race condition in setting node statusUpdateNeeded flag
in master's attachdetach controller. This flag is used to indicate
whether a node status has been updated by the node_status_updater or
not. When updater finishes update a node status, it is set to false.
When the node status is changed such as volume is detached or new volume
is attached to the node, the flag is set to true so that updater can
update the status again. The previous workflow has a race condition as
follows

  1. updater gets the currently attached volume list from the node which needs to be
    updated.
  2. A new volume A is attached to the same node right after 1 and set the
    flag to TRUE
  3. updater updates the node attached volume list (which does not include volume A) and then set the flag to FALSE.
    The result is that volume A will be never added to the attached volume
    list so at node side, this volume is never attached.

So in this PR, the flag is set to FALSE when updater tries to get the
attached volume list (as in an atomic operation). So in the above
example, after step 2, the flag will be TRUE again, in step 3, updater
does not set the flag if updates is sucessful. So after that, flag is
still TRUE and in next round of update, the node status will be updated.

Fix race condition in setting node statusUpdateNeeded flag 

This change is Reviewable

@saad-ali
Copy link
Member

saad-ali commented Sep 15, 2016

Removing this from 1.4 milestone per offline discussion. We will get it merged, give it time to bake, and merge it into a 1.4.x release.

@saad-ali saad-ali removed this from the v1.4 milestone Sep 15, 2016
@jingxu97 jingxu97 changed the title Fix race conditino in setting node statusUpdateNeeded flag Fix race condition in setting node statusUpdateNeeded flag Sep 15, 2016
@jingxu97 jingxu97 force-pushed the jingxu97:stateupdateNeeded-9-15 branch 2 times, most recently from dfeb259 to 25a0678 Sep 15, 2016
@k8s-bot
Copy link

k8s-bot commented Sep 16, 2016

GCE e2e build/test passed for commit 25a0678.

ResetNodeStatusUpdateNeeded(nodeName string) error
// node to true indicating the AttachedVolume field of the Node's Status
// object needs to be updated by the node updater again.
ResetNodeStatusUpdateNeeded(nodeName string)

This comment has been minimized.

Copy link
@saad-ali

saad-ali Sep 19, 2016

Member

Revert the If no node with the... portion of the comment since it is still applicable.

This comment has been minimized.

Copy link
@jingxu97

jingxu97 Sep 21, 2016

Author Contributor

Here I remove the returned error because I could not see the use of it.

This comment has been minimized.

Copy link
@saad-ali

saad-ali Sep 22, 2016

Member

This method can result in a failed to setNodeStatusUpdateNeeded error if nodeName does not exist. I like documentation comments to capture 1) what the input is, 2) what the normal output is, and 3) what results in an error. So I would leave the comment as is. That said, I'll leave it up to you to decide.

This comment has been minimized.

Copy link
@jingxu97

jingxu97 Sep 22, 2016

Author Contributor

update the comments

ResetNodeStatusUpdateNeeded(nodeName string) error
// node to true indicating the AttachedVolume field of the Node's Status
// object needs to be updated by the node updater again.
ResetNodeStatusUpdateNeeded(nodeName string)

This comment has been minimized.

Copy link
@saad-ali

saad-ali Sep 19, 2016

Member

SInce this method now sets the value of statusUpdatedNeeded to true instead of false, change the name to reflect the behavior: SetNodeStatusUpdateNeeded(...)

This comment has been minimized.

Copy link
@jingxu97

jingxu97 Sep 21, 2016

Author Contributor

done

Copy link
Member

saad-ali left a comment

A couple minor comments, otherwise LGTM

@jingxu97
Copy link
Contributor Author

jingxu97 commented Sep 21, 2016

@saad-ali PTAL

1 similar comment
@jingxu97
Copy link
Contributor Author

jingxu97 commented Sep 22, 2016

@saad-ali PTAL

@jingxu97 jingxu97 force-pushed the jingxu97:stateupdateNeeded-9-15 branch from 25a0678 to d696d45 Sep 22, 2016
"failed to ResetNodeStatusUpdateNeeded(nodeName=%q) nodeName does not exist",
// should not happen
glog.Errorf(
"failed to setNodeStatusUpdateNeeded(nodeName=%q) nodeName does not exist",

This comment has been minimized.

Copy link
@saad-ali

saad-ali Sep 22, 2016

Member

Add needed bool to error message for clarity.

This comment has been minimized.

Copy link
@jingxu97

jingxu97 Sep 22, 2016

Author Contributor

done

This comment has been minimized.

Copy link
@jingxu97

jingxu97 Sep 22, 2016

Author Contributor

done

ResetNodeStatusUpdateNeeded(nodeName string) error
// node to true indicating the AttachedVolume field of the Node's Status
// object needs to be updated by the node updater again.
ResetNodeStatusUpdateNeeded(nodeName string)

This comment has been minimized.

Copy link
@saad-ali

saad-ali Sep 22, 2016

Member

This method can result in a failed to setNodeStatusUpdateNeeded error if nodeName does not exist. I like documentation comments to capture 1) what the input is, 2) what the normal output is, and 3) what results in an error. So I would leave the comment as is. That said, I'll leave it up to you to decide.

Copy link
Member

saad-ali left a comment

A few more comments

// Update the flag statusUpdateNeeded to indicate whether node status is already updated or
// needs to be updated again by the node status updater.
// This is an internal function and caller should acquire and release the lock
func (asw *actualStateOfWorld) setNodeStatusUpdateNeeded(nodeName string, needed bool) {

This comment has been minimized.

Copy link
@saad-ali

saad-ali Sep 22, 2016

Member

nit: to avoid confusing this with the SetNodeStatusUpdateNeeded which sets the value to true, maybe rename it to modifyNodeStatusUpdateNeeded?

This comment has been minimized.

Copy link
@jingxu97

jingxu97 Sep 22, 2016

Author Contributor

done

This comment has been minimized.

Copy link
@jingxu97

jingxu97 Sep 22, 2016

Author Contributor

done

This PR fixes the race condition in setting node statusUpdateNeeded flag
in master's attachdetach controller. This flag is used to indicate
whether a node status has been updated by the node_status_updater or
not. When updater finishes update a node status, it is set to false.
When the node status is changed such as volume is detached or new volume
is attached to the node, the flag is set to true so that updater can
update the status again. The previous workflow has a race condition as
follows
1. updater gets the currently attached volume list from the node which needs to be
updated.
2. A new volume A is attached to the same node right after 1 and set the
flag to TRUE
3. updater updates the node attached volume list (which does not include volume A) and then set the flag to FALSE.
The result is that volume A will be never added to the attached volume
list so at node side, this volume is never attached.

So in this PR, the flag is set to FALSE when updater tries to get the
attached volume list (as in an atomic operation). So in the above
example, after step 2, the flag will be TRUE again, in step 3, updater
does not set the flag if updates is sucessful. So after that, flag is
still TRUE and in next round of update, the node status will be updated.

This PR also changes a unit test due to the workflow changes
@jingxu97 jingxu97 force-pushed the jingxu97:stateupdateNeeded-9-15 branch from d696d45 to 14cad20 Sep 22, 2016
@jingxu97
Copy link
Contributor Author

jingxu97 commented Sep 22, 2016

@saad-ali PTAL

Copy link
Member

saad-ali left a comment

LGTM

@saad-ali
Copy link
Member

saad-ali commented Sep 22, 2016

This PR should be cherry-picked to v1.4.1.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Sep 23, 2016

Jenkins GCE e2e failed for commit 14cad20. Full PR test history.

The magic incantation to run this job again is @k8s-bot gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@jingxu97
Copy link
Contributor Author

jingxu97 commented Sep 23, 2016

@k8s-bot gce e2e test this

@k8s-github-robot
Copy link
Contributor

k8s-github-robot commented Sep 23, 2016

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link
Contributor

k8s-github-robot commented Sep 23, 2016

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 0a4316f into kubernetes:master Sep 23, 2016
7 of 8 checks passed
7 of 8 checks passed
Submit Queue Github CI tests are not green.
Details
Jenkins GCE Node e2e Build finished. 691 tests run, 200 skipped, 0 failed.
Details
Jenkins GCE e2e Build succeeded.
Details
Jenkins GKE smoke e2e Build succeeded.
Details
Jenkins Kubemark GCE e2e Build succeeded.
Details
Jenkins unit/integration Build succeeded.
Details
Jenkins verification Build succeeded.
Details
cla/google All necessary CLAs are signed
@saad-ali saad-ali added this to the v1.4 milestone Sep 26, 2016
@saad-ali
Copy link
Member

saad-ali commented Sep 26, 2016

Adding cherrypick-candidate and v1.4 milestone to have this picked up for v1.4.1

k8s-github-robot pushed a commit that referenced this pull request Oct 4, 2016
…07-upstream-release-1.4

Automatic merge from submit-queue

Automated cherry pick of #32807

Cherry pick of #32807 on release-1.4.
@k8s-cherrypick-bot
Copy link

k8s-cherrypick-bot commented Oct 4, 2016

Commit found in the "release-1.4" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

@saad-ali
Copy link
Member

saad-ali commented Oct 4, 2016

Now that this has been cherry-picked to the 1.4 branch (for 1.4.1), let's also cherry-pick it to the 1.3 branch (for 1.3.9).

@saad-ali saad-ali modified the milestones: v1.3, v1.4 Oct 4, 2016
shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this pull request Dec 1, 2016
…ck-of-#32807-upstream-release-1.4

Automatic merge from submit-queue

Automated cherry pick of kubernetes#32807

Cherry pick of kubernetes#32807 on release-1.4.
@k8s-cherrypick-bot
Copy link

k8s-cherrypick-bot commented Jan 13, 2017

Commit found in the "release-1.3" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

10 participants
You can’t perform that action at this time.