fix: detach azure disk issue using dangling error #81266

andyzhangx · 2019-08-11T13:21:37Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
fix: detach azure disk issue using dangling error

I think this PR could fix most of the disk attach issue due to detach disk not succeed(could be due to many reasons), this PR could do the recover automatically, I will cherry pick the fix to all old releases.

Actually this PR fixed two issues:

disk not found issue when detaching an azure disk
This is fixed by compare disk URI with case insensitive (strings.EqualFold), error logs are like following(without this PR):

azure_controller_standard.go:134] detach azure disk: disk  not found, diskURI: /subscriptions/xxx/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/xxx-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db

azure disk could not be attached to another node since disk is still attached to the original node
This is fixed by return dangling error, success events logs are like following:

Events:
  Type     Reason                  Age                    From                               Message
  ----     ------                  ----                   ----                               -------
  Normal   Scheduled               <unknown>              default-scheduler                  Successfully assigned default/nginx-azuredisk to k8s-agentpool-32686255-1
  Warning  FailedAttachVolume      2m46s (x7 over 3m18s)  attachdetach-controller            AttachVolume.Attach failed for volume "pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" : disk(/subscriptions/xxx/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/xxx-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db) already attached to node(k8s-agentpool-32686255-0), could not be attached to node(k8s-agentpool-32686255-1)
  Normal   SuccessfulAttachVolume  2m4s                   attachdetach-controller            AttachVolume.Attach succeeded for volume "pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db"
  Warning  FailedMount             75s                    kubelet, k8s-agentpool-32686255-1  Unable to attach or mount volumes: unmounted volumes=[disk01], unattached volumes=[disk01 default-token-xw626]: timed out waiting for the condition
  Normal   Pulling                 59s                    kubelet, k8s-agentpool-32686255-1  Pulling image "nginx"
  Normal   Pulled                  59s                    kubelet, k8s-agentpool-32686255-1  Successfully pulled image "nginx"
  Normal   Created                 59s                    kubelet, k8s-agentpool-32686255-1  Created container nginx-azuredisk
  Normal   Started                 59s                    kubelet, k8s-agentpool-32686255-1  Started container nginx-azuredisk

Which issue(s) this PR fixes:

Fixes #81079

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

fix: detach azure disk issue using dangling error

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

/kind bug
/assign @gnufied
/priority important-soon
/sig cloud-provider
/area provider/azure

andyzhangx · 2019-08-11T13:23:50Z

hi @gnufied I followed your guidance with #55491, this PR seems not working (detach volume never happen) even it could enter into volerr.NewDanglingError(attachErr ..., is there sth. else I have missed?

gnufied · 2019-08-12T17:20:00Z

@andyzhangx just to double check, the volume will only be detached if another pod and comes around and tries to use it. If no pod is attempting to use it, the volume will remain attached to the old node. Is that what happened?

If volume is still not detaching even if another pod tries to use it - that will be kinda weird. Can we get controller-manager logs when this happened?

andyzhangx · 2019-08-13T01:09:47Z

I used following way to "repro" this issue:

attach the disk to node#A manually
create a pod with PVC mount which uses that disk, and that pod would run on node#B

So the disk would be detached from node#A, right? @gnufied

gnufied · 2019-08-13T01:31:51Z

@andyzhangx I think so yeah. Can you post controller-manager logs with v4 verbosity?

andyzhangx · 2019-08-13T06:46:12Z

@gnufied here is the useful logging, detach volume is skipped:

I0813 05:33:22.709564       1 reconciler.go:200] Cannot detach volume because it is still mounted for volume "pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" (UniqueName: "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db") on node "k8s-agentpool-32686255-0"

It's due to

kubernetes/pkg/controller/volume/attachdetach/reconciler/reconciler.go

Lines 199 to 201 in 61af419

    
           if attachedVolume.MountedByNode && !timeout { 
        
           	klog.V(5).Infof(attachedVolume.GenerateMsgDetailed("Cannot detach volume because it is still mounted", "")) 
        
           	continue

At this time, the volume is in use by node#B.

And if I hack the above code by comment out the continue statement, it would go into

I0813 06:25:46.662343       1 reconciler.go:208] RemoveVolumeFromReportAsAttached failed while removing volume "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" from node "k8s-agentpool-32686255-0" with: volume "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" does not exist in volumesToReportAsAttached list or node "k8s-agentpool-32686255-0" does not exist in nodesToUpdateStatusFor list

And you could find all controller manager logs here: https://gist.githubusercontent.com/andyzhangx/6d6460ad8802f78016b0b4cbce292399/raw/3ea7142feadaf8f7bb9da7c7e75016ce7559771d/controller-manager.log

gnufied · 2019-08-13T14:16:24Z

okay.. so:

I0813 05:33:22.709564       1 reconciler.go:200] Cannot detach volume because it is still mounted for volume "pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" (UniqueName: "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db") on node "k8s-agentpool-32686255-0"

This condition will automatically go away after maxWaitForUnmountDuration has expired, which is fine because since we are going to detach a volume which is in "uncertain" state.

I0813 06:25:46.662343       1 reconciler.go:208] RemoveVolumeFromReportAsAttached failed while removing volume "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" from node "k8s-agentpool-32686255-0" with: volume "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" does not exist in volumesToReportAsAttached list or node "k8s-agentpool-32686255-0" does not exist in nodesToUpdateStatusFor list

I think this is a bug that you have discovered. When I made #78595 change - to add dangling volumes as uncertain, we stopped reporting the volume to node's status and hence it is failing. But this looks like a general problem with detaching volumes which are in uncertain state, I will open a PR to fix this. This needs better coverage to prevent regression too.

gnufied · 2019-08-15T20:16:32Z

@andyzhangx actually part of my comment was not true, following error log:

I0813 06:25:46.662343       1 reconciler.go:208] RemoveVolumeFromReportAsAttached failed while removing volume oes not exist in nodesToUpdateStatusFor list

Does not stop volume from being detached - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L208 the volume should still be detached..

fix: unit test failure

andyzhangx · 2019-08-18T04:02:24Z

@andyzhangx actually part of my comment was not true, following error log:
I0813 06:25:46.662343       1 reconciler.go:208] RemoveVolumeFromReportAsAttached failed while removing volume oes not exist in nodesToUpdateStatusFor list
Does not stop volume from being detached - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L208 the volume should still be detached..

@gnufied thanks a lot for your help! Finally I figured out that this is another issue in detaching disk, actually detach disk operation happens, which it could not find that disk in the list, this PR fixed these two issues.

So this PR is ready for code review.

cc @MSSedusch @dkistner @vlerenc
/assign @khenidak @feiskyer

k8s-ci-robot · 2019-08-19T01:46:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/volume/azure_dd/OWNERS~~ [andyzhangx,feiskyer]
~~staging/src/k8s.io/legacy-cloud-providers/azure/OWNERS~~ [andyzhangx,feiskyer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2019-08-19T05:21:52Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

staging/src/k8s.io/legacy-cloud-providers/azure/azure_controller_common.go

andyzhangx · 2019-08-19T06:51:50Z

/hold
hold for one day more for anyone else to review, comments, thanks

andyzhangx · 2019-08-20T03:20:45Z

/hold cancel
ready to merge, thanks

MSSedusch · 2019-08-20T15:05:25Z

@andyzhangx Can you cherry pick that fix to 1.15?

andyzhangx · 2019-08-21T04:10:40Z

@andyzhangx Can you cherry pick that fix to 1.15?

it's been cherry picked to 1.13, 1.14, 1.15

…1266-upstream-release-1.13 Automated cherry pick of #81266: fix: detach azure disk issue using dangling error

…1266-upstream-release-1.15 Automated cherry pick of #81266: fix: detach azure disk issue using dangling error

…1266-upstream-release-1.14 Automated cherry pick of #81266: fix: detach azure disk issue using dangling error

andyzhangx · 2019-08-21T14:19:55Z

below are the fixed versions:

k8s version	fixed version
v1.12	no fix
v1.13	1.13.11
v1.14	1.14.7
v1.15	1.15.4
v1.15	1.16.0

k8s-ci-robot assigned gnufied Aug 11, 2019

k8s-ci-robot requested review from justaugustus and karataliu August 11, 2019 13:22

k8s-ci-robot added area/cloudprovider sig/storage Categorizes an issue or PR as relevant to SIG Storage. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 11, 2019

andyzhangx force-pushed the fix-detach-azuredisk-issue branch from 002c072 to 10a7eae Compare August 14, 2019 01:47

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 16, 2019

andyzhangx added 2 commits August 18, 2019 03:43

fix: detach azure disk issue using dangling error

1335f64

fix: unit test failure

fix: disk not found issue in detaching azure disk

1fe12be

andyzhangx force-pushed the fix-detach-azuredisk-issue branch from 224133b to 1fe12be Compare August 18, 2019 03:43

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 18, 2019

andyzhangx changed the title ~~[WIP]fix: detach azure disk issue using dangling error~~ fix: detach azure disk issue using dangling error Aug 18, 2019

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 18, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 19, 2019

MSSedusch reviewed Aug 19, 2019

View reviewed changes

staging/src/k8s.io/legacy-cloud-providers/azure/azure_controller_common.go Show resolved Hide resolved

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 19, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 20, 2019

k8s-ci-robot merged commit 1719ce7 into kubernetes:master Aug 20, 2019

k8s-ci-robot added this to the v1.16 milestone Aug 20, 2019

andyzhangx mentioned this pull request Aug 21, 2019

fix: azure disk name matching issue #81720

Merged

k8s-ci-robot added a commit that referenced this pull request Aug 21, 2019

Merge pull request #81706 from andyzhangx/automated-cherry-pick-of-#8…

7237839

…1266-upstream-release-1.13 Automated cherry pick of #81266: fix: detach azure disk issue using dangling error

k8s-ci-robot added a commit that referenced this pull request Aug 21, 2019

Merge pull request #81702 from andyzhangx/automated-cherry-pick-of-#8…

78aa543

…1266-upstream-release-1.15 Automated cherry pick of #81266: fix: detach azure disk issue using dangling error

k8s-ci-robot added a commit that referenced this pull request Aug 21, 2019

Merge pull request #81705 from andyzhangx/automated-cherry-pick-of-#8…

89c2668

…1266-upstream-release-1.14 Automated cherry pick of #81266: fix: detach azure disk issue using dangling error

andyzhangx mentioned this pull request Aug 22, 2019

Multi Attach Error Azure/AKS#477

Closed

andyzhangx mentioned this pull request Aug 30, 2019

Volume controller detach behavior #81097

Closed

andyzhangx mentioned this pull request Jan 6, 2020

Trouble attaching volume Azure/AKS#884

Closed

andyzhangx mentioned this pull request Jan 13, 2020

Disk attachment/mounting problems, all pods with PVCs stuck in ContainerCreating Azure/AKS#1278

Closed

This was referenced May 5, 2020

fix: azure disk dangling attach issue on VMSS which would cause API throttling #90749

Merged

azure disk dangling attach issue on VMSS which would cause API throttling #90762

Closed

stoyanr mentioned this pull request Jun 16, 2020

Dangling volume mechanism for CSI does not exist #80488

Open

stoyanr mentioned this pull request Jul 1, 2020

Remedy Azure disk ConflictingUserInput errors gardener/remedy-controller#1

Closed

1 task

andyzhangx mentioned this pull request Aug 11, 2020

Wrap InstanceNotFound into ErrorNotVmssInstance #93880

Closed

jsturtevant mentioned this pull request Sep 17, 2020

Unable to mount a volume on VMSS Azure/aks-engine#3838

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: detach azure disk issue using dangling error #81266

fix: detach azure disk issue using dangling error #81266

andyzhangx commented Aug 11, 2019 •

edited

Loading

andyzhangx commented Aug 11, 2019

gnufied commented Aug 12, 2019

andyzhangx commented Aug 13, 2019

gnufied commented Aug 13, 2019

andyzhangx commented Aug 13, 2019 •

edited

Loading

gnufied commented Aug 13, 2019

gnufied commented Aug 15, 2019

andyzhangx commented Aug 18, 2019 •

edited

Loading

k8s-ci-robot commented Aug 19, 2019

fejta-bot commented Aug 19, 2019

andyzhangx commented Aug 19, 2019

andyzhangx commented Aug 20, 2019

MSSedusch commented Aug 20, 2019

andyzhangx commented Aug 21, 2019

andyzhangx commented Aug 21, 2019

fix: detach azure disk issue using dangling error #81266

fix: detach azure disk issue using dangling error #81266

Conversation

andyzhangx commented Aug 11, 2019 • edited Loading

andyzhangx commented Aug 11, 2019

gnufied commented Aug 12, 2019

andyzhangx commented Aug 13, 2019

gnufied commented Aug 13, 2019

andyzhangx commented Aug 13, 2019 • edited Loading

gnufied commented Aug 13, 2019

gnufied commented Aug 15, 2019

andyzhangx commented Aug 18, 2019 • edited Loading

k8s-ci-robot commented Aug 19, 2019

fejta-bot commented Aug 19, 2019

andyzhangx commented Aug 19, 2019

andyzhangx commented Aug 20, 2019

MSSedusch commented Aug 20, 2019

andyzhangx commented Aug 21, 2019

andyzhangx commented Aug 21, 2019

andyzhangx commented Aug 11, 2019 •

edited

Loading

andyzhangx commented Aug 13, 2019 •

edited

Loading

andyzhangx commented Aug 18, 2019 •

edited

Loading