New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in setting node statusUpdateNeeded flag #32807

Merged
merged 1 commit into from Sep 23, 2016

Conversation

Projects
None yet
10 participants
@jingxu97
Contributor

jingxu97 commented Sep 15, 2016

This PR fixes the race condition in setting node statusUpdateNeeded flag
in master's attachdetach controller. This flag is used to indicate
whether a node status has been updated by the node_status_updater or
not. When updater finishes update a node status, it is set to false.
When the node status is changed such as volume is detached or new volume
is attached to the node, the flag is set to true so that updater can
update the status again. The previous workflow has a race condition as
follows

  1. updater gets the currently attached volume list from the node which needs to be
    updated.
  2. A new volume A is attached to the same node right after 1 and set the
    flag to TRUE
  3. updater updates the node attached volume list (which does not include volume A) and then set the flag to FALSE.
    The result is that volume A will be never added to the attached volume
    list so at node side, this volume is never attached.

So in this PR, the flag is set to FALSE when updater tries to get the
attached volume list (as in an atomic operation). So in the above
example, after step 2, the flag will be TRUE again, in step 3, updater
does not set the flag if updates is sucessful. So after that, flag is
still TRUE and in next round of update, the node status will be updated.

Fix race condition in setting node statusUpdateNeeded flag 

This change is Reviewable

@saad-ali

This comment has been minimized.

Show comment
Hide comment
@saad-ali

saad-ali Sep 15, 2016

Member

Removing this from 1.4 milestone per offline discussion. We will get it merged, give it time to bake, and merge it into a 1.4.x release.

Member

saad-ali commented Sep 15, 2016

Removing this from 1.4 milestone per offline discussion. We will get it merged, give it time to bake, and merge it into a 1.4.x release.

@saad-ali saad-ali removed this from the v1.4 milestone Sep 15, 2016

@jingxu97 jingxu97 changed the title from Fix race conditino in setting node statusUpdateNeeded flag to Fix race condition in setting node statusUpdateNeeded flag Sep 15, 2016

@k8s-bot

This comment has been minimized.

Show comment
Hide comment

k8s-bot commented Sep 16, 2016

GCE e2e build/test passed for commit 25a0678.

Show outdated Hide outdated pkg/controller/volume/attachdetach/cache/actual_state_of_world.go
ResetNodeStatusUpdateNeeded(nodeName string) error
// node to true indicating the AttachedVolume field of the Node's Status
// object needs to be updated by the node updater again.
ResetNodeStatusUpdateNeeded(nodeName string)

This comment has been minimized.

@saad-ali

saad-ali Sep 19, 2016

Member

Revert the If no node with the... portion of the comment since it is still applicable.

@saad-ali

saad-ali Sep 19, 2016

Member

Revert the If no node with the... portion of the comment since it is still applicable.

This comment has been minimized.

@jingxu97

jingxu97 Sep 21, 2016

Contributor

Here I remove the returned error because I could not see the use of it.

@jingxu97

jingxu97 Sep 21, 2016

Contributor

Here I remove the returned error because I could not see the use of it.

This comment has been minimized.

@saad-ali

saad-ali Sep 22, 2016

Member

This method can result in a failed to setNodeStatusUpdateNeeded error if nodeName does not exist. I like documentation comments to capture 1) what the input is, 2) what the normal output is, and 3) what results in an error. So I would leave the comment as is. That said, I'll leave it up to you to decide.

@saad-ali

saad-ali Sep 22, 2016

Member

This method can result in a failed to setNodeStatusUpdateNeeded error if nodeName does not exist. I like documentation comments to capture 1) what the input is, 2) what the normal output is, and 3) what results in an error. So I would leave the comment as is. That said, I'll leave it up to you to decide.

This comment has been minimized.

@jingxu97

jingxu97 Sep 22, 2016

Contributor

update the comments

@jingxu97

jingxu97 Sep 22, 2016

Contributor

update the comments

Show outdated Hide outdated pkg/controller/volume/attachdetach/cache/actual_state_of_world.go
ResetNodeStatusUpdateNeeded(nodeName string) error
// node to true indicating the AttachedVolume field of the Node's Status
// object needs to be updated by the node updater again.
ResetNodeStatusUpdateNeeded(nodeName string)

This comment has been minimized.

@saad-ali

saad-ali Sep 19, 2016

Member

SInce this method now sets the value of statusUpdatedNeeded to true instead of false, change the name to reflect the behavior: SetNodeStatusUpdateNeeded(...)

@saad-ali

saad-ali Sep 19, 2016

Member

SInce this method now sets the value of statusUpdatedNeeded to true instead of false, change the name to reflect the behavior: SetNodeStatusUpdateNeeded(...)

This comment has been minimized.

@jingxu97

jingxu97 Sep 21, 2016

Contributor

done

@jingxu97

jingxu97 Sep 21, 2016

Contributor

done

@saad-ali

A couple minor comments, otherwise LGTM

@jingxu97

This comment has been minimized.

Show comment
Hide comment
@jingxu97

jingxu97 Sep 21, 2016

Contributor

@saad-ali PTAL

Contributor

jingxu97 commented Sep 21, 2016

@saad-ali PTAL

@jingxu97

This comment has been minimized.

Show comment
Hide comment
@jingxu97

jingxu97 Sep 22, 2016

Contributor

@saad-ali PTAL

Contributor

jingxu97 commented Sep 22, 2016

@saad-ali PTAL

Show outdated Hide outdated pkg/controller/volume/attachdetach/cache/actual_state_of_world.go
"failed to ResetNodeStatusUpdateNeeded(nodeName=%q) nodeName does not exist",
// should not happen
glog.Errorf(
"failed to setNodeStatusUpdateNeeded(nodeName=%q) nodeName does not exist",

This comment has been minimized.

@saad-ali

saad-ali Sep 22, 2016

Member

Add needed bool to error message for clarity.

@saad-ali

saad-ali Sep 22, 2016

Member

Add needed bool to error message for clarity.

This comment has been minimized.

@jingxu97

jingxu97 Sep 22, 2016

Contributor

done

@jingxu97

jingxu97 Sep 22, 2016

Contributor

done

This comment has been minimized.

@jingxu97

jingxu97 Sep 22, 2016

Contributor

done

@jingxu97

jingxu97 Sep 22, 2016

Contributor

done

Show outdated Hide outdated pkg/controller/volume/attachdetach/cache/actual_state_of_world.go
ResetNodeStatusUpdateNeeded(nodeName string) error
// node to true indicating the AttachedVolume field of the Node's Status
// object needs to be updated by the node updater again.
ResetNodeStatusUpdateNeeded(nodeName string)

This comment has been minimized.

@saad-ali

saad-ali Sep 22, 2016

Member

This method can result in a failed to setNodeStatusUpdateNeeded error if nodeName does not exist. I like documentation comments to capture 1) what the input is, 2) what the normal output is, and 3) what results in an error. So I would leave the comment as is. That said, I'll leave it up to you to decide.

@saad-ali

saad-ali Sep 22, 2016

Member

This method can result in a failed to setNodeStatusUpdateNeeded error if nodeName does not exist. I like documentation comments to capture 1) what the input is, 2) what the normal output is, and 3) what results in an error. So I would leave the comment as is. That said, I'll leave it up to you to decide.

@saad-ali

A few more comments

Show outdated Hide outdated pkg/controller/volume/attachdetach/cache/actual_state_of_world.go
// Update the flag statusUpdateNeeded to indicate whether node status is already updated or
// needs to be updated again by the node status updater.
// This is an internal function and caller should acquire and release the lock
func (asw *actualStateOfWorld) setNodeStatusUpdateNeeded(nodeName string, needed bool) {

This comment has been minimized.

@saad-ali

saad-ali Sep 22, 2016

Member

nit: to avoid confusing this with the SetNodeStatusUpdateNeeded which sets the value to true, maybe rename it to modifyNodeStatusUpdateNeeded?

@saad-ali

saad-ali Sep 22, 2016

Member

nit: to avoid confusing this with the SetNodeStatusUpdateNeeded which sets the value to true, maybe rename it to modifyNodeStatusUpdateNeeded?

This comment has been minimized.

@jingxu97

jingxu97 Sep 22, 2016

Contributor

done

@jingxu97

jingxu97 Sep 22, 2016

Contributor

done

This comment has been minimized.

@jingxu97

jingxu97 Sep 22, 2016

Contributor

done

@jingxu97

jingxu97 Sep 22, 2016

Contributor

done

Fix race conditino in setting node statusUpdateNeeded flag
This PR fixes the race condition in setting node statusUpdateNeeded flag
in master's attachdetach controller. This flag is used to indicate
whether a node status has been updated by the node_status_updater or
not. When updater finishes update a node status, it is set to false.
When the node status is changed such as volume is detached or new volume
is attached to the node, the flag is set to true so that updater can
update the status again. The previous workflow has a race condition as
follows
1. updater gets the currently attached volume list from the node which needs to be
updated.
2. A new volume A is attached to the same node right after 1 and set the
flag to TRUE
3. updater updates the node attached volume list (which does not include volume A) and then set the flag to FALSE.
The result is that volume A will be never added to the attached volume
list so at node side, this volume is never attached.

So in this PR, the flag is set to FALSE when updater tries to get the
attached volume list (as in an atomic operation). So in the above
example, after step 2, the flag will be TRUE again, in step 3, updater
does not set the flag if updates is sucessful. So after that, flag is
still TRUE and in next round of update, the node status will be updated.

This PR also changes a unit test due to the workflow changes
@jingxu97

This comment has been minimized.

Show comment
Hide comment
@jingxu97

jingxu97 Sep 22, 2016

Contributor

@saad-ali PTAL

Contributor

jingxu97 commented Sep 22, 2016

@saad-ali PTAL

@saad-ali

LGTM

@saad-ali

This comment has been minimized.

Show comment
Hide comment
@saad-ali

saad-ali Sep 22, 2016

Member

This PR should be cherry-picked to v1.4.1.

Member

saad-ali commented Sep 22, 2016

This PR should be cherry-picked to v1.4.1.

@k8s-ci-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-ci-robot

k8s-ci-robot Sep 23, 2016

Jenkins GCE e2e failed for commit 14cad20. Full PR test history.

The magic incantation to run this job again is @k8s-bot gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

k8s-ci-robot commented Sep 23, 2016

Jenkins GCE e2e failed for commit 14cad20. Full PR test history.

The magic incantation to run this job again is @k8s-bot gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@jingxu97

This comment has been minimized.

Show comment
Hide comment
@jingxu97

jingxu97 Sep 23, 2016

Contributor

@k8s-bot gce e2e test this

Contributor

jingxu97 commented Sep 23, 2016

@k8s-bot gce e2e test this

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Sep 23, 2016

Contributor

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

Contributor

k8s-merge-robot commented Sep 23, 2016

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Sep 23, 2016

Contributor

Automatic merge from submit-queue

Contributor

k8s-merge-robot commented Sep 23, 2016

Automatic merge from submit-queue

@k8s-merge-robot k8s-merge-robot merged commit 0a4316f into kubernetes:master Sep 23, 2016

7 of 8 checks passed

Submit Queue Github CI tests are not green.
Details
Jenkins GCE Node e2e Build finished. 691 tests run, 200 skipped, 0 failed.
Details
Jenkins GCE e2e Build succeeded.
Details
Jenkins GKE smoke e2e Build succeeded.
Details
Jenkins Kubemark GCE e2e Build succeeded.
Details
Jenkins unit/integration Build succeeded.
Details
Jenkins verification Build succeeded.
Details
cla/google All necessary CLAs are signed

@saad-ali saad-ali added this to the v1.4 milestone Sep 26, 2016

@saad-ali

This comment has been minimized.

Show comment
Hide comment
@saad-ali

saad-ali Sep 26, 2016

Member

Adding cherrypick-candidate and v1.4 milestone to have this picked up for v1.4.1

Member

saad-ali commented Sep 26, 2016

Adding cherrypick-candidate and v1.4 milestone to have this picked up for v1.4.1

k8s-merge-robot added a commit that referenced this pull request Oct 4, 2016

Merge pull request #34038 from jingxu97/automated-cherry-pick-of-#328…
…07-upstream-release-1.4

Automatic merge from submit-queue

Automated cherry pick of #32807

Cherry pick of #32807 on release-1.4.
@k8s-cherrypick-bot

This comment has been minimized.

Show comment
Hide comment
@k8s-cherrypick-bot

k8s-cherrypick-bot Oct 4, 2016

Commit found in the "release-1.4" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

k8s-cherrypick-bot commented Oct 4, 2016

Commit found in the "release-1.4" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

@saad-ali

This comment has been minimized.

Show comment
Hide comment
@saad-ali

saad-ali Oct 4, 2016

Member

Now that this has been cherry-picked to the 1.4 branch (for 1.4.1), let's also cherry-pick it to the 1.3 branch (for 1.3.9).

Member

saad-ali commented Oct 4, 2016

Now that this has been cherry-picked to the 1.4 branch (for 1.4.1), let's also cherry-pick it to the 1.3 branch (for 1.3.9).

@saad-ali saad-ali modified the milestones: v1.3, v1.4 Oct 4, 2016

shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this pull request Dec 1, 2016

Merge pull request #34038 from jingxu97/automated-cherry-pick-of-#328…
…07-upstream-release-1.4

Automatic merge from submit-queue

Automated cherry pick of #32807

Cherry pick of #32807 on release-1.4.
@k8s-cherrypick-bot

This comment has been minimized.

Show comment
Hide comment
@k8s-cherrypick-bot

k8s-cherrypick-bot Jan 13, 2017

Commit found in the "release-1.3" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

k8s-cherrypick-bot commented Jan 13, 2017

Commit found in the "release-1.3" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment