Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling update not possible with a Persistent Volume on Azure. #52236

Closed
rtyler opened this issue Sep 9, 2017 · 5 comments
Closed

Rolling update not possible with a Persistent Volume on Azure. #52236

rtyler opened this issue Sep 9, 2017 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@rtyler
Copy link

rtyler commented Sep 9, 2017

/kind bug

What happened:

With an existing cluster, with existing workloads, I used the Azure command line tool (az) to add a new agent machine to the cluster. And some time later performed a rolling-update on a pod, which failed to ever complete.

What you expected to happen:

I expected the rolling-update to complete (obviously) and appropriately/successfully move the containers from the first agent to the newly created one.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy an Azure Container Service cluster (or any Azure Kubernetes cluster I presume), with 1 master and 1 agent.
  2. Load it up with a number of Replication Controllers which require Persistent Volumes
  3. Add another agent to the cluster (e.g. az acs scale -n my-cluster-name -g my-resource-group --new-agent-count 2)
  4. Perform a rolling-update of a pod which uses a Persistent Volume (e.g. kubectl rolling-update my-persistent-pod)
  5. Watch the controller's logs for items like:
I0909 16:07:26.840435       1 event.go:217] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"jenkins-codevalet-101cd79135bd4d881bcc0e86d1737f61-nffx5", UID:"8669a2e6-9578-11e7-b1c2-000d3a01b272", APIVersion:"v1", ResourceVersion:"4692710", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Failedto attach volume "pvc-b0fef60a-8898-11e7-b1c2-000d3a01b272" on node "k8s-agent-5a6192fc-1" with: Attach volume "codevalet-prod-k8s-master-dynamic-pvc-b0fef60a-8898-11e7-b1c2-000d3a01b272.vhd" to instance "k8s-agent-5A6192FC-1" failed with compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=200 -- Original Error: Long running operation terminated with status 'Failed': Code="AcquireDiskLeaseFailed" Message="Failed to acquire lease while creating disk 'codevalet-prod-k8s-master-dynamic-pvc-b0fef60a-8898-11e7-b1c2-000d3a01b272.vhd' using blob with URI https://00oxk5tljgfa3lcagnt0.blob.core.windows.net/vhds/codevalet-prod-k8s-master-dynamic-pvc-b0fef60a-8898-11e7-b1c2-000d3a01b272.vhd. Blob is already in use."

Anything else we need to know?:

The gist of what I understand to be happening here is that the Page Blobs in the Azure Storage Account are in a Leased state by the machine agent-0, because pod-alpha is utilizing a Persistent Volume.

When the rolling-update occurs, Kubernetes appropriately tries to provision the pod-beta on agent-1, which at this point is not running anything. Unfortunately, because the machine agent-0 already has Leased the Page Blob in Azure Storage, it cannot properly attach the Page Blob to the VM.

Work Around

The work-around I applied, which seem to work, as to fully delete the Replication Controller (not the Persistent Volume) which released the Lease on the Page Blob. Then I recreated the Replication Controller, which caused the Page Blob to be leased by agent-1.

That seemed to work, but incurred a few unacceptable minutes of downtime 🙁

Environment:

This environment was deployed on Azure via the Azure Container Service.

Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:57:25Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

FYI @kris-nova

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 9, 2017
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 9, 2017
@rtyler
Copy link
Author

rtyler commented Sep 9, 2017

For what it's worth, I'm not sure what kind of clean fix can be accomplished here.

An outage is effectively required in order to detach the Page Blob from one VM, and allow it to be re-attached to a new VM in Azure.

Perhaps the ideal fix would be for Azure to fix their crusty old Storage system 😈

@rtyler
Copy link
Author

rtyler commented Sep 9, 2017

Honing in on this a bit more, I believe this isn't limited to only when a new agent VM is added to the cluster. I believe that any rolling-update command which would lead Kubernetes to relocate the Replication Controller from one agent VM to another VM in the cluster will fail on Azure, if that Replication Controller has an associated Persistent Volume.

@itowlson
Copy link
Contributor

/sig azure

@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 10, 2017
@andyzhangx
Copy link
Member

andyzhangx commented Oct 9, 2017

@rtyler azure disk only support ReadWriteOnce(RWO), which means only one node mount is allowed for one azure disk pv. You could use azure file which supports RWX. See #26567 for more details:

@jdumars
Copy link
Member

jdumars commented Dec 26, 2017

Closing as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

6 participants