Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: detach azure disk issue using dangling error #81266

Merged
merged 2 commits into from Aug 20, 2019

Conversation

@andyzhangx
Copy link
Member

commented Aug 11, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
fix: detach azure disk issue using dangling error

I think this PR could fix most of the disk attach issue due to detach disk not succeed(could be due to many reasons), this PR could do the recover automatically, I will cherry pick the fix to all old releases.

Actually this PR fixed two issues:

  • disk not found issue when detaching an azure disk
    This is fixed by compare disk URI with case insensitive (strings.EqualFold), error logs are like following(without this PR):
azure_controller_standard.go:134] detach azure disk: disk  not found, diskURI: /subscriptions/xxx/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/xxx-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db
  • azure disk could not be attached to another node since disk is still attached to the original node
    This is fixed by return dangling error, success events logs are like following:
Events:
  Type     Reason                  Age                    From                               Message
  ----     ------                  ----                   ----                               -------
  Normal   Scheduled               <unknown>              default-scheduler                  Successfully assigned default/nginx-azuredisk to k8s-agentpool-32686255-1
  Warning  FailedAttachVolume      2m46s (x7 over 3m18s)  attachdetach-controller            AttachVolume.Attach failed for volume "pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" : disk(/subscriptions/xxx/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/xxx-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db) already attached to node(k8s-agentpool-32686255-0), could not be attached to node(k8s-agentpool-32686255-1)
  Normal   SuccessfulAttachVolume  2m4s                   attachdetach-controller            AttachVolume.Attach succeeded for volume "pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db"
  Warning  FailedMount             75s                    kubelet, k8s-agentpool-32686255-1  Unable to attach or mount volumes: unmounted volumes=[disk01], unattached volumes=[disk01 default-token-xw626]: timed out waiting for the condition
  Normal   Pulling                 59s                    kubelet, k8s-agentpool-32686255-1  Pulling image "nginx"
  Normal   Pulled                  59s                    kubelet, k8s-agentpool-32686255-1  Successfully pulled image "nginx"
  Normal   Created                 59s                    kubelet, k8s-agentpool-32686255-1  Created container nginx-azuredisk
  Normal   Started                 59s                    kubelet, k8s-agentpool-32686255-1  Started container nginx-azuredisk

Which issue(s) this PR fixes:

Fixes #81079

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

fix: detach azure disk issue using dangling error

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/kind bug
/assign @gnufied
/priority important-soon
/sig cloud-provider
/area provider/azure

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Aug 11, 2019

hi @gnufied I followed your guidance with #55491, this PR seems not working (detach volume never happen) even it could enter into volerr.NewDanglingError(attachErr ..., is there sth. else I have missed?

@gnufied

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

@andyzhangx just to double check, the volume will only be detached if another pod and comes around and tries to use it. If no pod is attempting to use it, the volume will remain attached to the old node. Is that what happened?

If volume is still not detaching even if another pod tries to use it - that will be kinda weird. Can we get controller-manager logs when this happened?

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Aug 13, 2019

I used following way to "repro" this issue:

  1. attach the disk to node#A manually
  2. create a pod with PVC mount which uses that disk, and that pod would run on node#B

So the disk would be detached from node#A, right? @gnufied

@gnufied

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

@andyzhangx I think so yeah. Can you post controller-manager logs with v4 verbosity?

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Aug 13, 2019

@gnufied here is the useful logging, detach volume is skipped:

I0813 05:33:22.709564       1 reconciler.go:200] Cannot detach volume because it is still mounted for volume "pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" (UniqueName: "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db") on node "k8s-agentpool-32686255-0"

It's due to

if attachedVolume.MountedByNode && !timeout {
klog.V(5).Infof(attachedVolume.GenerateMsgDetailed("Cannot detach volume because it is still mounted", ""))
continue

At this time, the volume is in use by node#B.

And if I hack the above code by comment out the continue statement, it would go into

I0813 06:25:46.662343       1 reconciler.go:208] RemoveVolumeFromReportAsAttached failed while removing volume "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" from node "k8s-agentpool-32686255-0" with: volume "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" does not exist in volumesToReportAsAttached list or node "k8s-agentpool-32686255-0" does not exist in nodesToUpdateStatusFor list

And you could find all controller manager logs here: https://gist.githubusercontent.com/andyzhangx/6d6460ad8802f78016b0b4cbce292399/raw/3ea7142feadaf8f7bb9da7c7e75016ce7559771d/controller-manager.log

@gnufied

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

okay.. so:

I0813 05:33:22.709564       1 reconciler.go:200] Cannot detach volume because it is still mounted for volume "pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" (UniqueName: "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db") on node "k8s-agentpool-32686255-0"

This condition will automatically go away after maxWaitForUnmountDuration has expired, which is fine because since we are going to detach a volume which is in "uncertain" state.

I0813 06:25:46.662343       1 reconciler.go:208] RemoveVolumeFromReportAsAttached failed while removing volume "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" from node "k8s-agentpool-32686255-0" with: volume "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" does not exist in volumesToReportAsAttached list or node "k8s-agentpool-32686255-0" does not exist in nodesToUpdateStatusFor list

I think this is a bug that you have discovered. When I made #78595 change - to add dangling volumes as uncertain, we stopped reporting the volume to node's status and hence it is failing. But this looks like a general problem with detaching volumes which are in uncertain state, I will open a PR to fix this. This needs better coverage to prevent regression too.

@andyzhangx andyzhangx force-pushed the andyzhangx:fix-detach-azuredisk-issue branch from 002c072 to 10a7eae Aug 14, 2019
@gnufied

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

@andyzhangx actually part of my comment was not true, following error log:

I0813 06:25:46.662343       1 reconciler.go:208] RemoveVolumeFromReportAsAttached failed while removing volume oes not exist in nodesToUpdateStatusFor list

Does not stop volume from being detached - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L208 the volume should still be detached..

andyzhangx added 2 commits Aug 11, 2019
fix: unit test failure
@andyzhangx andyzhangx force-pushed the andyzhangx:fix-detach-azuredisk-issue branch from 224133b to 1fe12be Aug 18, 2019
@andyzhangx andyzhangx changed the title [WIP]fix: detach azure disk issue using dangling error fix: detach azure disk issue using dangling error Aug 18, 2019
@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Aug 18, 2019

@andyzhangx actually part of my comment was not true, following error log:

I0813 06:25:46.662343       1 reconciler.go:208] RemoveVolumeFromReportAsAttached failed while removing volume oes not exist in nodesToUpdateStatusFor list

Does not stop volume from being detached - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L208 the volume should still be detached..

@gnufied thanks a lot for your help! Finally I figured out that this is another issue in detaching disk, actually detach disk operation happens, which it could not find that disk in the list, this PR fixed these two issues.

So this PR is ready for code review.

cc @MSSedusch @dkistner @vlerenc
/assign @khenidak @feiskyer

Copy link
Member

left a comment

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Aug 19, 2019
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot

This comment has been minimized.

Copy link

commented Aug 19, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Aug 19, 2019

/hold
hold for one day more for anyone else to review, comments, thanks

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Aug 20, 2019

/hold cancel
ready to merge, thanks

@k8s-ci-robot k8s-ci-robot merged commit 1719ce7 into kubernetes:master Aug 20, 2019
23 checks passed
23 checks passed
cla/linuxfoundation andyzhangx authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Job succeeded.
Details
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Aug 20, 2019
@MSSedusch

This comment has been minimized.

Copy link

commented Aug 20, 2019

@andyzhangx Can you cherry pick that fix to 1.15?

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Aug 21, 2019

@andyzhangx Can you cherry pick that fix to 1.15?

it's been cherry picked to 1.13, 1.14, 1.15

k8s-ci-robot added a commit that referenced this pull request Aug 21, 2019
…1266-upstream-release-1.13

Automated cherry pick of #81266: fix: detach azure disk issue using dangling error
k8s-ci-robot added a commit that referenced this pull request Aug 21, 2019
…1266-upstream-release-1.15

Automated cherry pick of #81266: fix: detach azure disk issue using dangling error
k8s-ci-robot added a commit that referenced this pull request Aug 21, 2019
…1266-upstream-release-1.14

Automated cherry pick of #81266: fix: detach azure disk issue using dangling error
@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Aug 21, 2019

below are the fixed versions:

k8s version fixed version
v1.12 no fix
v1.13 1.13.11
v1.14 1.14.7
v1.15 1.15.4
v1.15 1.16.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.