Rolling update not possible with a Persistent Volume on Azure. #52236

rtyler · 2017-09-09T17:06:25Z

/kind bug

What happened:

With an existing cluster, with existing workloads, I used the Azure command line tool (az) to add a new agent machine to the cluster. And some time later performed a rolling-update on a pod, which failed to ever complete.

What you expected to happen:

I expected the rolling-update to complete (obviously) and appropriately/successfully move the containers from the first agent to the newly created one.

How to reproduce it (as minimally and precisely as possible):

Deploy an Azure Container Service cluster (or any Azure Kubernetes cluster I presume), with 1 master and 1 agent.
Load it up with a number of Replication Controllers which require Persistent Volumes
Add another agent to the cluster (e.g. az acs scale -n my-cluster-name -g my-resource-group --new-agent-count 2)
Perform a rolling-update of a pod which uses a Persistent Volume (e.g. kubectl rolling-update my-persistent-pod)
Watch the controller's logs for items like:

I0909 16:07:26.840435       1 event.go:217] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"jenkins-codevalet-101cd79135bd4d881bcc0e86d1737f61-nffx5", UID:"8669a2e6-9578-11e7-b1c2-000d3a01b272", APIVersion:"v1", ResourceVersion:"4692710", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Failedto attach volume "pvc-b0fef60a-8898-11e7-b1c2-000d3a01b272" on node "k8s-agent-5a6192fc-1" with: Attach volume "codevalet-prod-k8s-master-dynamic-pvc-b0fef60a-8898-11e7-b1c2-000d3a01b272.vhd" to instance "k8s-agent-5A6192FC-1" failed with compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=200 -- Original Error: Long running operation terminated with status 'Failed': Code="AcquireDiskLeaseFailed" Message="Failed to acquire lease while creating disk 'codevalet-prod-k8s-master-dynamic-pvc-b0fef60a-8898-11e7-b1c2-000d3a01b272.vhd' using blob with URI https://00oxk5tljgfa3lcagnt0.blob.core.windows.net/vhds/codevalet-prod-k8s-master-dynamic-pvc-b0fef60a-8898-11e7-b1c2-000d3a01b272.vhd. Blob is already in use."

Anything else we need to know?:

The gist of what I understand to be happening here is that the Page Blobs in the Azure Storage Account are in a Leased state by the machine agent-0, because pod-alpha is utilizing a Persistent Volume.

When the rolling-update occurs, Kubernetes appropriately tries to provision the pod-beta on agent-1, which at this point is not running anything. Unfortunately, because the machine agent-0 already has Leased the Page Blob in Azure Storage, it cannot properly attach the Page Blob to the VM.

Work Around

The work-around I applied, which seem to work, as to fully delete the Replication Controller (not the Persistent Volume) which released the Lease on the Page Blob. Then I recreated the Replication Controller, which caused the Page Blob to be leased by agent-1.

That seemed to work, but incurred a few unacceptable minutes of downtime 🙁

Environment:

This environment was deployed on Azure via the Azure Container Service.

Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:57:25Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

FYI @kris-nova

The text was updated successfully, but these errors were encountered:

rtyler · 2017-09-09T17:11:43Z

For what it's worth, I'm not sure what kind of clean fix can be accomplished here.

An outage is effectively required in order to detach the Page Blob from one VM, and allow it to be re-attached to a new VM in Azure.

Perhaps the ideal fix would be for Azure to fix their crusty old Storage system 😈

rtyler · 2017-09-09T17:39:29Z

Honing in on this a bit more, I believe this isn't limited to only when a new agent VM is added to the cluster. I believe that any rolling-update command which would lead Kubernetes to relocate the Replication Controller from one agent VM to another VM in the cluster will fail on Azure, if that Replication Controller has an associated Persistent Volume.

itowlson · 2017-09-10T23:12:55Z

/sig azure

andyzhangx · 2017-10-09T08:53:58Z

@rtyler azure disk only support ReadWriteOnce(RWO), which means only one node mount is allowed for one azure disk pv. You could use azure file which supports RWX. See #26567 for more details:

jdumars · 2017-12-26T19:47:13Z

Closing as resolved.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 9, 2017

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 9, 2017

rtyler mentioned this issue Sep 9, 2017

Verify the Azure Container Service cluster can scale out CodeValet/codevalet#54

Closed

k8s-ci-robot added the sig/azure label Sep 10, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 10, 2017

rtyler mentioned this issue Sep 12, 2017

Switch kubernetes from Replication Controllers to Deployments CodeValet/codevalet#58

Closed

jdumars closed this as completed Dec 26, 2017

yasker mentioned this issue Jan 16, 2018

Mount command failed, status: Failure, The volume is existed and is attached longhorn/longhorn#30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling update not possible with a Persistent Volume on Azure. #52236

Rolling update not possible with a Persistent Volume on Azure. #52236

rtyler commented Sep 9, 2017

rtyler commented Sep 9, 2017

rtyler commented Sep 9, 2017

itowlson commented Sep 10, 2017

andyzhangx commented Oct 9, 2017 •

edited

jdumars commented Dec 26, 2017

Rolling update not possible with a Persistent Volume on Azure. #52236

Rolling update not possible with a Persistent Volume on Azure. #52236

Comments

rtyler commented Sep 9, 2017

rtyler commented Sep 9, 2017

rtyler commented Sep 9, 2017

itowlson commented Sep 10, 2017

andyzhangx commented Oct 9, 2017 • edited

jdumars commented Dec 26, 2017

andyzhangx commented Oct 9, 2017 •

edited