fix device name change issue for azure disk #60346

andyzhangx · 2018-02-24T04:14:18Z

What this PR does / why we need it:
fix device name change issue for azure disk due to default host cache setting changed from None to ReadWrite from v1.7, and default host cache setting in azure portal is None

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #60344, #57444
also fixes following issues:
Azure/acs-engine#1918
Azure/AKS#201

Special notes for your reviewer:
From v1.7, default host cache setting changed from None to ReadWrite, this would lead to device name change after attach multiple disks on azure vm, finally lead to disk unaccessiable from pod.
For an example:
statefulset with 8 replicas(each with an azure disk) on one node will always fail, according to my observation, add the 6th data disk will always make dev name change, some pod could not access data disk after that.

I have verified this fix on v1.8.4
Without this PR on one node(dev name changes):

azureuser@k8s-agentpool2-40588258-0:~$ tree /dev/disk/azure
...
â””â”€â”€ scsi1
    â”œâ”€â”€ lun0 -> ../../../sdk
    â”œâ”€â”€ lun1 -> ../../../sdj
    â”œâ”€â”€ lun2 -> ../../../sde
    â”œâ”€â”€ lun3 -> ../../../sdf
    â”œâ”€â”€ lun4 -> ../../../sdg
    â”œâ”€â”€ lun5 -> ../../../sdh
    â””â”€â”€ lun6 -> ../../../sdi

With this PR on one node(no dev name change):

azureuser@k8s-agentpool2-40588258-1:~$ tree /dev/disk/azure
...
â””â”€â”€ scsi1
    â”œâ”€â”€ lun0 -> ../../../sdc
    â”œâ”€â”€ lun1 -> ../../../sdd
    â”œâ”€â”€ lun2 -> ../../../sde
    â”œâ”€â”€ lun3 -> ../../../sdf
    â”œâ”€â”€ lun5 -> ../../../sdh
    â””â”€â”€ lun6 -> ../../../sdi

Following myvm-0, myvm-1 is crashing due to dev name change, after controller manager replacement, myvm2-x pods work well.

Every 2.0s: kubectl get po                                                                                                                                                   Sat Feb 24 04:16:26 2018

NAME      READY     STATUS             RESTARTS   AGE
myvm-0    0/1       CrashLoopBackOff   13         41m
myvm-1    0/1       CrashLoopBackOff   11         38m
myvm-2    1/1       Running            0          35m
myvm-3    1/1       Running            0          33m
myvm-4    1/1       Running            0          31m
myvm-5    1/1       Running            0          29m
myvm-6    1/1       Running            0          26m

myvm2-0   1/1       Running            0          17m
myvm2-1   1/1       Running            0          14m
myvm2-2   1/1       Running            0          12m
myvm2-3   1/1       Running            0          10m
myvm2-4   1/1       Running            0          8m
myvm2-5   1/1       Running            0          5m
myvm2-6   1/1       Running            0          3m

Release note:

fix disk unavailable issue when mounting multiple azure disks due to dev name change

/assign @karataliu
/sig azure
@feiskyer could you mark it as v1.10 milestone?
@brendandburns @khenidak @rootfs @jdumars FYI

Since it's a critical bug, I will cherry pick this fix to v1.7-v1.9, note that v1.6 does not have this issue since default cachingmode is None

feiskyer · 2018-02-24T04:54:45Z

/kind bug

feiskyer · 2018-02-24T04:58:38Z

From v1.7, default host cache setting changed from None to ReadWrite, this would lead to device name change after attach multiple disks on azure vm, finally lead to disk unaccessiable from pod.

Does this mean if user specified cachingmode=ReadWrite, then there are still such problems?

andyzhangx · 2018-02-24T08:54:12Z

@feiskyer Yes, if customer specify cachingmode=ReadWrite, there is still such issue, we could tell customer not to do that in current stage, linux kernel team are still investigating the root cause. While by default, we need to set cachingmode=None

feiskyer · 2018-02-24T08:58:37Z

Is there any performance penalty by changing this? Is there any documentation about this? I think caching should achieve much more better performance than none.

andyzhangx · 2018-02-24T09:38:13Z

@feiskyer Here is the Azure Disk Caching doc, host cache setting depends on the workload, we should let customers decide host cache settings if they really want to enable it according to their real workload. While by default, we should set as None, otherwise disk would be not accessiable.

Below are the doc about host cache settings:
When you attach data disks to a VM, with standard storage accounts, you have the option to enable caching at the Azure Host level, up to four disks per VM. If you have light read workload, you can enable it, but if you have big file contents (more than few tens of GB) and/or write intensive workload, it is recommended to disable this option.

With Premium storage accounts, all disks can have caching enabled, but only OS disks can have Read/Write, data disks can be set only to ReadOnly or None. For write intensive I/O, it’s recommended to set caching to None on the additional data disks.

feiskyer · 2018-02-24T12:52:57Z

@andyzhangx Thanks. Reasonable changing default to None.

feiskyer · 2018-02-24T12:53:02Z

/lgtm

k8s-ci-robot · 2018-02-24T12:53:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/volume/azure_dd/OWNERS~~ [andyzhangx,feiskyer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-github-robot · 2018-02-25T06:03:00Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-ci-robot · 2018-02-25T06:31:48Z

@andyzhangx: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-unit	`c3e8f68`	link	`/test pull-kubernetes-unit`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot · 2018-02-25T07:39:48Z

Automatic merge from submit-queue (batch tested with PRs 60346, 60135, 60289, 59643, 52640). If you want to cherry-pick this change to another branch, please follow the instructions here.

k8s-cherrypick-bot · 2018-02-26T03:48:31Z

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

…0346-upstream-release-1.7 Automatic merge from submit-queue. Automated cherry pick of #60346 Cherry pick of #60346 on release-1.7. #60346: fix device name change issue for azure disk **Release note**: ``` fix disk unavailable issue when mounting multiple azure disks due to dev name change ```

…0346-upstream-release-1.9 Automatic merge from submit-queue. Automated cherry pick of #60346: fix device name change issue for azure disk Cherry pick of #60346 on release-1.9. #60346: fix device name change issue for azure disk **Release note**: ``` fix disk unavailable issue when mounting multiple azure disks due to dev name change ```

…0346-upstream-release-1.8 Automatic merge from submit-queue. Automated cherry pick of #60346 Cherry pick of #60346 on release-1.8. #60346: fix device name change issue for azure disk **Release note**: ``` fix disk unavailable issue when mounting multiple azure disks due to dev name change ```

fix device name change issue for azure disk

c3e8f68

k8s-ci-robot assigned karataliu Feb 24, 2018

k8s-ci-robot requested review from jsafrane and msau42 February 24, 2018 04:14

feiskyer added this to the v1.10 milestone Feb 24, 2018

feiskyer added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 24, 2018

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 24, 2018

andyzhangx mentioned this pull request Feb 24, 2018

Disk error when pods are mounting a certain amount of volumes on a node Azure/AKS#201

Closed

k8s-ci-robot assigned feiskyer Feb 24, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 24, 2018

andyzhangx mentioned this pull request Feb 25, 2018

unable to use azure disk in StatefulSet since /dev/sd* changed after detach/attach disk Azure/acs-engine#1918

Closed

k8s-github-robot merged commit 71c2135 into kubernetes:master Feb 25, 2018

This was referenced Feb 26, 2018

Automated cherry pick of #60346: fix device name change issue for azure disk #60400

Merged

Automated cherry pick of #60346 #60401

Merged

andyzhangx mentioned this pull request Feb 26, 2018

Automated cherry pick of #60346 #60402

Merged

feiskyer added the cherrypick-candidate label Feb 26, 2018

k8s-cherrypick-bot removed the cherrypick-candidate label Feb 26, 2018

andyzhangx mentioned this pull request Mar 8, 2018

fix disk unavailable issue when mounting multiple azure disks due to dev name change Azure/acs-engine#2410

Merged

andyzhangx mentioned this pull request Apr 11, 2018

Input/output error when accessing PV Azure/AKS#297

Closed

andyzhangx deleted the fix-devname-change branch May 8, 2018 07:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix device name change issue for azure disk #60346

fix device name change issue for azure disk #60346

andyzhangx commented Feb 24, 2018 •

edited

feiskyer commented Feb 24, 2018

feiskyer commented Feb 24, 2018

andyzhangx commented Feb 24, 2018

feiskyer commented Feb 24, 2018

andyzhangx commented Feb 24, 2018

feiskyer commented Feb 24, 2018

feiskyer commented Feb 24, 2018

k8s-ci-robot commented Feb 24, 2018

k8s-github-robot commented Feb 25, 2018

k8s-ci-robot commented Feb 25, 2018 •

edited

k8s-github-robot commented Feb 25, 2018

k8s-cherrypick-bot commented Feb 26, 2018

fix device name change issue for azure disk #60346

fix device name change issue for azure disk #60346

Conversation

andyzhangx commented Feb 24, 2018 • edited

feiskyer commented Feb 24, 2018

feiskyer commented Feb 24, 2018

andyzhangx commented Feb 24, 2018

feiskyer commented Feb 24, 2018

andyzhangx commented Feb 24, 2018

feiskyer commented Feb 24, 2018

feiskyer commented Feb 24, 2018

k8s-ci-robot commented Feb 24, 2018

k8s-github-robot commented Feb 25, 2018

k8s-ci-robot commented Feb 25, 2018 • edited

k8s-github-robot commented Feb 25, 2018

k8s-cherrypick-bot commented Feb 26, 2018

andyzhangx commented Feb 24, 2018 •

edited

k8s-ci-robot commented Feb 25, 2018 •

edited