New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle failed attach operation leave uncertain volume attach state #71276

Open
wants to merge 3 commits into
base: master
from

Conversation

@jingxu97
Copy link
Contributor

jingxu97 commented Nov 20, 2018

This PR fixes issue #32727.

When an attach operation fails, it is still possible that the volume
will be attached to the node later. This PR adds the logic to record the
volume to node with attached state no matter whether the operation
succeeded or not. If the operation fails, mark the attached state to
false. If the operation succeeded, mark the attached state to true. The
reconciler will still issue attach operation until it returns
successfully. If the pod is removed in the meantime, the reconciler
will issue detach operations for all the volumes no matter what the
attached state is.

jingxu97 added some commits Oct 2, 2018

WIP: Handle failed attach operation leave uncertain volume attach state
This PR fixes issue #32727.

When an attach operation fails, it is still possible that the volume
will be attached to the node later. This PR adds the logic to record the
volume to node with attached state no matter whether the operation
succedded or not. If the operation fails, mark the attached state to
false. If the operation succeeded, mark the attached state to true. The
reconciler will still issue attach operation until it returns
successfully. If the pod is removed in the mean time, the reconciler
will issue detach operations for all the volumes no matter what is the
attached state.
Handle failed attach operation leave uncertain volume attach state
This commit adds the unit tests for the PR. It also includes some files
that are affected by the function name changes.
@nikopen

This comment has been minimized.

Copy link
Member

nikopen commented Nov 20, 2018

/kind bug
/priority critical-urgent

@jingxu97 jingxu97 force-pushed the jingxu97:Oct/uncertain branch 2 times, most recently from b0435cc to 61e97f6 Nov 20, 2018

@davidz627

This comment has been minimized.

Copy link
Contributor

davidz627 commented Nov 20, 2018

/lgtm

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 21, 2018

/test pull-kubernetes-bazel-test

@k8s-ci-robot k8s-ci-robot removed the lgtm label Nov 21, 2018

@nikopen

This comment has been minimized.

Copy link
Member

nikopen commented Nov 21, 2018

I1121 09:03:22.566] hack/make-rules/../../hack/verify-gofmt.sh
I1121 09:03:22.568] Makefile:128: recipe for target 'verify' failed
W1121 09:03:22.668] make: *** [verify] Error 1
W1121 09:03:23.977] Traceback (most recent call last):
W1121 09:03:23.978]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_verify.py", line 167, in <module>
W1121 09:03:23.978]     main(ARGS.branch, ARGS.script, ARGS.force, ARGS.prow)
W1121 09:03:23.978]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_verify.py", line 148, in main
W1121 09:03:23.978]     'bash', '-c', 'cd kubernetes && %s' % script,
W1121 09:03:23.978]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_verify.py", line 48, in check
W1121 09:03:23.978]     subprocess.check_call(cmd)
W1121 09:03:23.978]   File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
W1121 09:03:23.978]     raise CalledProcessError(retcode, cmd)
W1121 09:03:23.979] subprocess.CalledProcessError: Command '('docker', 'run', '--rm=true', '--privileged=true', '-v', '/var/run/docker.sock:/var/run/docker.sock', '-v', '/etc/localtime:/etc/localtime:ro', '-v', '/workspace/k8s.io/kubernetes:/go/src/k8s.io/kubernetes', '-v', '/workspace/_artifacts:/workspace/artifacts', '-e', 'KUBE_FORCE_VERIFY_CHECKS=n', '-e', 'KUBE_VERIFY_GIT_BRANCH=master', '-e', 'REPO_DIR=/workspace/k8s.io/kubernetes', 'gcr.io/k8s-testimages/kubekins-test:1.13-v20181105-ceed87206', 'bash', '-c', 'cd kubernetes && ./hack/jenkins/verify-dockerized.sh')' returned non-zero exit status 2
@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 21, 2018

/test pull-kubernetes-verify

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Nov 26, 2018

Currently we have unit test to mimic the problematic attach behavior. It is hard to have real e2e test because we cannot control the attach behavior of a real volume driver.

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Jan 8, 2019

/test pull-kubernetes-local-e2e-containerized

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Jan 8, 2019

@jsafrane

This comment has been minimized.

Copy link
Member

jsafrane commented Jan 9, 2019

/test pull-kubernetes-godeps

Tested with #69697, works well.

/lgtm
/approve

@saad-ali

This comment has been minimized.

Copy link
Member

saad-ali commented Jan 9, 2019

Let's get it merged and let it bake in master for a week or so and monitor storage tests: https://k8s-testgrid.appspot.com/sig-storage#Summary&width=5

If storage tests look good we can cherry pick.

// Don't even try to start an operation if there is already one running
// This check must be done before we do any other checks, as otherwise the other checks
// may pass while at the same time the volume leaves the pending state, resulting in
// double detach attempts
if rc.attacherDetacher.IsOperationPending(attachedVolume.VolumeName, "") {
klog.V(10).Infof("Operation for volume %q is already running. Can't start detach for %q", attachedVolume.VolumeName, attachedVolume.NodeName)
klog.V(5).Infof("Operation for volume %q is already running. Can't start detach for %q", attachedVolume.VolumeName, attachedVolume.NodeName)

This comment has been minimized.

@saad-ali

saad-ali Jan 10, 2019

Member

Do we really want to change this verbosity for everyone? This loop is executed every 100ms

This comment has been minimized.

@jingxu97

jingxu97 Jan 10, 2019

Contributor

changed back to 10

}
}

return false
}

func (asw *actualStateOfWorld) GetAttachedVolumes() []AttachedVolume {
func (asw *actualStateOfWorld) GetAllVolumes() []AttachedVolume {

This comment has been minimized.

@saad-ali

saad-ali Jan 10, 2019

Member

Why is this being renamed? It still returns objects in []AttachedVolume.

This comment has been minimized.

@jingxu97

jingxu97 Jan 10, 2019

Contributor

Here we are returning all volumes (the type is still called AttachedVolume) which might be attached or might not. Changing the name is just for clarification

This comment has been minimized.

@saad-ali

saad-ali Jan 11, 2019

Member

Calling code only sees the method name (e.g. GetAllVolumes(...)) not the returned type so in the calling code it becomes confusing when you read GetAllVolumes. Let's leave the name as is, and instead update the interface documentation to mention that AttachedVolumes are attached but in some cases may not be.

This comment has been minimized.

@jingxu97

jingxu97 Jan 11, 2019

Contributor

got it. changed it back.

@@ -185,7 +194,7 @@ type attachedVolume struct {
spec *volume.Spec

// nodesAttachedTo is a map containing the set of nodes this volume has
// successfully been attached to. The key in this map is the name of the
// been trying to be attached to. The key in this map is the name of the

This comment has been minimized.

@saad-ali

saad-ali Jan 10, 2019

Member

just remove trying to be

This comment has been minimized.

@jingxu97

jingxu97 Jan 10, 2019

Contributor

fixed

// status for the node may not be set.
mountedByNodeSetCount uint
// attached indicates that the volume is confirmed to be attached to this node
attached bool

This comment has been minimized.

@saad-ali

saad-ali Jan 10, 2019

Member

how about:

// attachConfirmed indicates that the storage system verified the volume has been attached to this node.
// This value is set to false when an attach or detach operation fails and the volume may be attached or detached.
attachConfirmed bool

This comment has been minimized.

@jingxu97

jingxu97 Jan 10, 2019

Contributor

done

@@ -192,6 +192,8 @@ type ActualStateOfWorldAttacherUpdater interface {
// volumes. See issue 29695.
MarkVolumeAsAttached(volumeName v1.UniqueVolumeName, volumeSpec *volume.Spec, nodeName types.NodeName, devicePath string) error

MarkVolumeAsUncertain(volumeName v1.UniqueVolumeName, volumeSpec *volume.Spec, nodeName types.NodeName) error

This comment has been minimized.

@saad-ali

saad-ali Jan 10, 2019

Member

Add interface documentation (comments).

	// Marks the specified volume as *possibly* attached to the specified node.
	// If an attach or detach operation fails, the attach/detach controller does not know for certain if the volume is attached or not.
	// If the volume name is supplied, that volume name will be used.  If not, the
	// volume name is computed using the result from querying the plugin.
	MarkVolumeAsUncertain(...)

This comment has been minimized.

@jingxu97

jingxu97 Jan 10, 2019

Contributor

fixed

@@ -323,6 +323,12 @@ func (og *operationGenerator) GenerateAttachVolumeFunc(
}

This comment has been minimized.

@saad-ali

saad-ali Jan 10, 2019

Member

DanglingAttachError results in MarkVolumeAsAttached. Do we still want to do MarkVolumeAsUncertain(...) in that case?

This comment has been minimized.

@jingxu97

jingxu97 Jan 10, 2019

Contributor

We don't need to. I modified it. PTAL

if addVolumeNodeErr != nil {
// On failure, return error. Caller will log and retry.
return volumeToAttach.GenerateError("AttachVolume.MarkVolumeAsUncertain failed", addVolumeNodeErr)
}
// On failure, return error. Caller will log and retry.
return volumeToAttach.GenerateError("AttachVolume.Attach failed", attachErr)

This comment has been minimized.

@saad-ali

saad-ali Jan 10, 2019

Member

What about Detach failures? Should they also result in MarkVolumeAsUncertain?

This comment has been minimized.

@jingxu97

jingxu97 Jan 10, 2019

Contributor

I don't think so. If detach failed or timed out, attach/detach controller just does not remove the volume from actual state. We don't need to use MarkVolumeAsUncertain to add it since the volume is already in actual state. Then in the next iteration, attach/detach controller just trigger a detach operation again.

@jingxu97 jingxu97 force-pushed the jingxu97:Oct/uncertain branch from dd3e8a2 to 5277fa5 Jan 10, 2019

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jan 10, 2019

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Jan 10, 2019

@saad-ali , comments are addressed. PTAL. Thanks!

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Jan 10, 2019

/test pull-kubernetes-bazel-test

@jingxu97 jingxu97 force-pushed the jingxu97:Oct/uncertain branch from 5277fa5 to bb2e3d0 Jan 10, 2019

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Jan 11, 2019

/test pull-kubernetes-integration

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Jan 11, 2019

/test pull-kubernetes-verify

Address comments
This commit addressed the comment and also add a unit test.

@jingxu97 jingxu97 force-pushed the jingxu97:Oct/uncertain branch from bb2e3d0 to 7bac6ca Jan 11, 2019

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Jan 11, 2019

/test pull-kubernetes-integration
/test pull-kubernetes-e2e-gce-100-performance

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Jan 14, 2019

/test pull-kubernetes-local-e2e-containerized

@jingxu97

This comment has been minimized.

Copy link
Contributor

jingxu97 commented Jan 16, 2019

@saad-ali comments are addressed, PTAL. Thanks!

@saad-ali
Copy link
Member

saad-ali left a comment

/lgtm
/approve

Some nits feel free to address in a follow up PR.

// GetAttachedVolumesForNode generates and returns a list of volumes that added to
// the specified node reflecting which volumes are/or might be attached to that node
// based on the current actual state of the world. This function is currently used by
// attach_detach_controller to process VolumeInUse

This comment has been minimized.

@saad-ali

saad-ali Jan 16, 2019

Member

nit:
This function is currently used by attach_detach_controller to process VolumeInUse <- why does that need to be added?

// AddVolumeNode adds the given volume and node to the underlying store.
// If attached is set to true, it indicates the specified volume is already
// attached to the specified node. If attached set to false, it means that
// the volume is not confirmed to be attached to the node yet.

This comment has been minimized.

@saad-ali

saad-ali Jan 16, 2019

Member

nit: Change wording for clarity to:

The attached value indicates the specified volume was confirmed to be attached to the specified node. If an attach operation succeeds this value should be true. If an attach operation fails, and it is unclear if the volume is attached or not attached, this value should be false.

@k8s-ci-robot k8s-ci-robot added the lgtm label Jan 16, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 16, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jingxu97, jsafrane, saad-ali

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jsafrane

This comment has been minimized.

Copy link
Member

jsafrane commented Jan 17, 2019

/hold cancel
@saad-ali approved the PR and it's 1.14 time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment