Fix for Invalidation of DeviceMapping Cache when Detaching Volumes #14493

BugRoger · 2015-09-24T14:03:42Z

There is a deviceMapping cache that keeps the state of the block device mappings in local memory. When a device is detached the cache does not get invalidated. A subsequent attachment for the same volume then falsely assumes that the device is already attached. It skip the actual API call to attach the volume and gets stuck in an endless loop waiting for the attachment to finish. It eventually times out and immediately starts to wait again. This only resolves once the kubelet is restarted.

This fix releases the device from the deviceMapping cache when a volume is detached.

With this fix volume attach/detach operations work perfectly for me. I'm a bit unsure though how such a fundamental bug wasn't noticed before? Did this ever work and this is a regression? It could be that previous tests had the pods scheduled to a different kubelet with cold cache? Maybe @justinsb has some insight.

k8s-bot · 2015-09-24T14:05:08Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

k8s-github-robot · 2015-09-24T15:38:20Z

Labelling this PR as size/S

justinsb · 2015-09-27T15:30:58Z

Ouch - I think you are 100% right @BugRoger. It would be great to have an integration test that detects this, but I think that getting this into 1.1 is more important. (And likely cherry-picking to 1.0 also).

I think there's no real reason for it to be missing other than my mistake. Sorry.

The handling of the case where there is a timeout is tricky. It might be that the volume does eventually detach, or it might be that it never detaches. I think it would also be cleaner if we could reuse releaseMountDevice. But I think getting this fix in is more important. If you have time I would like to see:

a TODO comment to consider fully the timeout case
a TODO comment to reuse releaseMountDevice
a log-message if we call releaseVolume with a volume that isn't in the map (similar to releaseMountDevice)

If you don't have time, then no problem, I think this is bad enough that we should merge as is (the code looks great other than the above nits), and I can add those 3 nits later.

Thank you so much for tracking this down & providing a fix @BugRoger

(@smarterclayton - fine to assign this one to me if you'd prefer).

BugRoger · 2015-09-28T12:14:18Z

Alright, I addressed your comments.

In my case (private EC2 compatible cloud) the timeout of 60s is almost always too short and the detach operation times out. Fortunately, it does recover. When the volume gets attached again while it's still in detaching theres in an error/retry loop until it eventually succeeds (not sure it ever gives up). Though not nice, it does work for now.

justinsb · 2015-09-28T13:47:03Z

Going to close & reopen to try to get shippable to re-run tests.

k8s-bot · 2015-09-28T13:49:10Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

justinsb · 2015-09-28T13:50:04Z

ok to test

k8s-bot · 2015-09-28T13:59:39Z

Unit, integration and GCE e2e build/test failed for commit d90925bdf0fdac8df798a0e65e7cdc906e6bf7ec.

justinsb · 2015-09-28T14:18:43Z

pkg/cloudprovider/providers/aws/aws.go

+	// At this point we are waiting for the volume being detached. This
+	// releases the volume and invalidates the cache even when there is a timeout.
+	//
+	// TODO: A timeout leaves the cache in an inconsitent state. The volume is still


Typo nit: s/inconsitent/inconsistent

justinsb · 2015-09-28T14:19:35Z

I don't know what is wrong with shippable. I suggested a typo nit, which if you fix will have the nice side effect of getting shippable to try again :-)

k8s-bot · 2015-09-28T16:52:20Z

Unit, integration and GCE e2e test build/test passed for commit a383b843358b5f91bb0c8372e70d6a9390b4f215.

BugRoger · 2015-09-29T15:48:53Z

Unfortunately I found another invalidation bug. The cache is a map from single char to volumeID. As implemented here:

if strings.HasPrefix(mountpoint, "/dev/sd") {
    mountpoint = mountpoint[7:]
}
if strings.HasPrefix(mountpoint, "/dev/xvd") {
    mountpoint = mountpoint[8:]
}
deviceMappings[mountpoint] = orEmpty(blockDevice.EBS.VolumeID)

During the invalidation the whole mount path is passed in:

ec2Device := "/dev/xvd" + mountpoint
defer func() {
    if !attached {
        awsInstance.releaseMountDevice(disk.awsID, ec2Device)
    }
}()

This never works and in case AttachDisk runs into a timeout, the kubelet is yet in another endless loop. The fix is easy but there is still followup trouble here:
https://github.com/BugRoger/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1153

I even implemented that TODO. That also didn't fix the problem completely. To be quite honest I don't know where it still goes wrong, so I invalidate the whole cache upon timeouts in my fork:
BugRoger@b958735

To fix this properly I feel like we need (better) tests. Then refactor and model the state carefully, maybe with periodical reconciliation. That can go into a seperate PR, I hope.

k8s-bot · 2015-09-29T16:06:53Z

Unit, integration and GCE e2e test build/test passed for commit 860321c34cd3ad90f457732172157f3cc8e6cc40.

justinsb · 2015-10-05T14:29:28Z

This LGTM; going to close & reopen to kick shippable.

a-robinson · 2015-10-06T01:23:30Z

Shippable has been incredibly slow today - I'm going to take @justinsb's last comment as an LGTM and merge. I assume that this should also be cherry-picked into 1.1?

a-robinson · 2015-10-06T01:24:44Z

Before merging, would you mind squashing your commits, @BugRoger?

roberthbailey · 2015-12-09T19:25:52Z

@FestivalBobcats - it did not make the 1.1.1 release, as it was never cherry picked into the release branch.

roberthbailey · 2015-12-11T05:25:33Z

I just created a cherry pick of this PR into the 1.1 release branch (#18559). Once that gets merged, this should be resolved with the next patch release (1.1.4).

iameli · 2016-01-01T04:48:45Z

If I get a pod/node into this state, is there an easy workaround I can use until 1.1.4 drops?

BugRoger · 2016-01-03T21:58:45Z

If I get a pod/node into this state, is there an easy workaround I can use until 1.1.4 drops?

Restart the Kubelet 😦

…-#14493-upstream-release-1.1 Automated cherry pick of #14493

actionshrimp · 2016-01-12T10:26:35Z

Just upgraded to 1.1.4 and still getting this issue - had a look in the release and it looks like the PR didn't make it in?

antoineco · 2016-01-12T12:56:44Z

It should have. Did you build from the 1.1 release branch or are you using the beta?

actionshrimp · 2016-01-12T14:47:38Z

Thanks for the quick reply. I looked at the source in the 1.1.4 release bundle here https://github.com/kubernetes/kubernetes/releases/tag/v1.1.4, and I'm using the binaries hosted on here: https://storage.googleapis.com/kubernetes-release/release/v1.1.4/bin/

FestivalBobcats · 2016-01-12T16:57:36Z

I am also still seeing this issue on 1.1.4. It seems I can't ever re-attach the same volume to a node.

My team had been using 1.2.0-alpha3, and we had a frequent "/dev/xvdf is already in use" error. We thought this thread was relevant, but after building and deploying 1.1.4, we're now thinking that was a different error entirely.

On 1.1.4, we can't restart a pod with attached EBS without getting:

Error syncing pod, skipping: Timeout waiting for volume state

We're currently snapshotting and re-creating volumes as a workaround.

EDIT: Actually, it turns out we compiled without #18559. Testing that out now.

FestivalBobcats · 2016-01-12T19:57:48Z

Correction: I'm completely wrong. Didn't build the right tag.

antoineco · 2016-01-15T11:02:44Z

(k8s 1.1.4)
This did not fix #15073 unfortunately, I'm still in the same situation as before: detached volumes not attaching.

aws.go:1018] Waiting for volume state: actual=detached, desired=attached
...

After restarting kubelet:

kubelet[28901]: I0115 11:03:20.761060   28901 aws.go:898] Assigned mount device g -> volume vol-abcd1234
kubelet[28901]: I0115 11:03:21.131251   28901 aws.go:1132] AttachVolume request returned %v{
kubelet[28901]: AttachTime: 2016-01-15 11:03:21.117 +0000 UTC,
kubelet[28901]: Device: "/dev/xvdg",
kubelet[28901]: InstanceId: "i-abcd1234",
kubelet[28901]: State: "attaching",
kubelet[28901]: VolumeId: "vol-abcd1234"
kubelet[28901]: }

roberthbailey · 2016-01-27T23:17:29Z

@antoineco This wasn't cherry picked in time to catch the 1.1.4 release. You should see it fixed in the (pending) 1.1.7 release.

stepanstipl · 2016-02-10T00:37:19Z

I still seem to have this (or very similar) issue in 1.1.7. When pod with EBS backed block storage is recreated, it will never succeed to attach the EBS volume:

FirstSeen     LastSeen        Count   From                                                    SubobjectPath   Reason          Message
  ─────────     ────────        ─────   ────                                                    ─────────────   ──────          ───────
  22m           22m             1       {scheduler }                                                            Scheduled       Successfully assigned monitoring-influxdb-v2-1-0q6zy to ip-172-20-217-186.eu-west-1.compute.internal
  20m           20m             1       {kubelet ip-172-20-217-186.eu-west-1.compute.internal}                  FailedMount     Unable to mount volumes for pod "monitoring-influxdb-v2-1-0q6zy_kube-system": Timeout waiting for volume state
  20m           20m             1       {kubelet ip-172-20-217-186.eu-west-1.compute.internal}                  FailedSync      Error syncing pod, skipping: Timeout waiting for volume state
  22m           5s              131     {kubelet ip-172-20-217-186.eu-west-1.compute.internal}                  FailedMount     Unable to mount volumes for pod "monitoring-influxdb-v2-1-0q6zy_kube-system": Error attaching EBS volume: VolumeInUse: vol-9b981f59 is already attached to an instance status code: 400, request id:
  22m           5s              131     {kubelet ip-172-20-217-186.eu-west-1.compute.internal}          FailedSync      Error syncing pod, skipping: Error attaching EBS volume: VolumeInUse: vol-9b981f59 is already attached to an instance status code: 400, request id:

Morriz · 2016-02-26T13:33:42Z

I have tested this to still be an issue all the way up to 1.2.0-alpha.8. Can we get some more eyeballs on this?

Morriz · 2016-02-26T14:00:01Z

I should be more specific: I don't get more than the timeout message:

FirstSeen   LastSeen    Count   From                            SubobjectPath   Reason      Message
  ─────────   ────────    ───── ────                            ───────────── ──────      ───────
  11m       11m     1   {scheduler }                                Scheduled   Successfully assigned grafana-3n68f to ip-10-0-0-168.eu-central-1.compute.internal
  10m       28s     11  {kubelet ip-10-0-0-168.eu-central-1.compute.internal}           FailedMount Unable to mount volumes for pod "grafana-3n68f_kube-system": Timeout waiting for volume state
  10m       28s     11  {kubelet ip-10-0-0-168.eu-central-1.compute.internal}           FailedSync  Error syncing pod, skipping: Timeout waiting for volume state

justinsb · 2016-02-26T17:07:04Z

@Morriz can you post your pod description please? I don't think the out-of-the-box grafana pod uses an EBS volume...

kubectl describe pod --namespace=kube-system grafana-3n68f_kube-system -ojson

Morriz · 2016-02-27T22:28:56Z

sry, i am in a constant state of flux with the stack...trying to find the magic fit between components and their versions...so no backtracking there for now...maybe later

stepanstipl · 2016-02-28T00:14:57Z

@justinsb What is interesting that even if you delete and recreate rc, and delete and recreate actual EBS volume (so you end up both with different pod and different EBS volume id) it will still keep failing. I should be able to post here description of actual failing pod tomorrow or on Monday.

In case this would be useful in the meantime the only difference compared to out of the box grafana is this in the rc definition, where the one I'm using adds:

        volumeMounts:
        - name: secret-volume
          readOnly: true
          mountPath: /secrets
        - name: grafana-persistent-storage
          mountPath: /var/lib/grafana
      volumes:
      - name: secret-volume
        secret:
          secretName: monitoring-grafana
      - name: grafana-persistent-storage
        awsElasticBlockStore:
          volumeID: "aws://$AWS_ZONE/$AWS_EBS_ID"
          fsType: ext4

But although coincidentally in this case it's also happening with grafana pod, I have had this issue on other pods with different images using EBS volumes - for example with influxdb or elasticsearch images.

Morriz · 2016-02-28T00:18:53Z

@stepanstipl: that grafana ebs volume id has the old format, it is now just the volume-id (vol-1234567). Hope that was not causing the mount failure :)

Morriz · 2016-02-28T00:55:15Z

@justinsb: I had only changed the default emptyDir storage to an ebs volume. Now trying to see under what conditions the mounts fail...

james-thimont-bcgdv · 2016-02-29T11:53:44Z

@justinsb I have the same problem on v1.1.8.
Let me know if you need any more info

    Image ID:       
    QoS Tier:
      cpu:  Burstable
    Limits:
      cpu:  250m
    Requests:
      cpu:      150m
    State:      Waiting
      Reason:       ContainerCreating
    Ready:      False
    Restart Count:  0
    Environment Variables:
Conditions:
  Type      Status
  Ready     False 
Volumes:
  jenkins:
    Type:   AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://eu-west-1b/vol-3ca8ddf2
    FSType: ext4
    Partition:  0
    ReadOnly:   false
  ssl:
    Type:   Secret (a secret that should populate this volume)
    SecretName: jenkins-ssl
  default-token-bd5uj:
    Type:   Secret (a secret that should populate this volume)
    SecretName: default-token-bd5uj
Events:
  FirstSeen LastSeen    Count   From                            SubobjectPath   Reason      Message
  ─────────   ────────    ───── ────                            ───────────── ──────      ───────
  11m       11m     1   {scheduler }                                Scheduled   Successfully assigned jenkins-leader-i9kk1 to ip-10-0-0-94.eu-west-1.compute.internal
  9m        27s     10  {kubelet ip-10-0-0-94.eu-west-1.compute.internal}           FailedMount Unable to mount volumes for pod "jenkins-leader-i9kk1_default": Timeout waiting for volume state
  9m        27s     10  {kubelet ip-10-0-0-94.eu-west-1.compute.internal}           FailedSync  Error syncing pod, skipping: Timeout waiting for volume state

Morriz · 2016-03-24T13:10:56Z

James, you are using the old volume id notation, which was superseded by just the aws notation ('vol-1234abcd'), and which is reflected in the kubernetes documentation as well....

On 29 feb. 2016, at 12:54, James Thimont notifications@github.com wrote:

@justinsb https://github.com/justinsb I have the same problem on v1.1.8.
Let me know if you need any more info
Image ID:       
QoS Tier:
  cpu:  Burstable
Limits:
  cpu:  250m
Requests:
  cpu:      150m
State:      Waiting
  Reason:       ContainerCreating
Ready:      False
Restart Count:  0
Environment Variables:
Conditions:
Type Status
Ready False
Volumes:
jenkins:
Type: AWSElasticBlockStore (a Persistent Disk resource in AWS)
VolumeID: aws://eu-west-1b/vol-3ca8ddf2
FSType: ext4
Partition: 0
ReadOnly: false
ssl:
Type: Secret (a secret that should populate this volume)
SecretName: jenkins-ssl
default-token-bd5uj:
Type: Secret (a secret that should populate this volume)
SecretName: default-token-bd5uj
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
───────── ──────── ───── ──── ───────────── ────── ───────
11m 11m 1 {scheduler } Scheduled Successfully assigned jenkins-leader-i9kk1 to ip-10-0-0-94.eu-west-1.compute.internal
9m 27s 10 {kubelet ip-10-0-0-94.eu-west-1.compute.internal} FailedMount Unable to mount volumes for pod "jenkins-leader-i9kk1_default": Timeout waiting for volume state
9m 27s 10 {kubelet ip-10-0-0-94.eu-west-1.compute.internal} FailedSync Error syncing pod, skipping: Timeout waiting for volume state
—
Reply to this email directly or view it on GitHub #14493 (comment).

…ry-pick-of-#14493-upstream-release-1.1 Automated cherry pick of kubernetes#14493

googlebot added the cla: yes label Sep 24, 2015

k8s-github-robot assigned smarterclayton Sep 24, 2015

k8s-github-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 24, 2015

justinsb added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/platform/aws labels Sep 27, 2015

smarterclayton assigned justinsb and unassigned smarterclayton Sep 27, 2015

BugRoger force-pushed the fix_devicemapping_cache_invalidation branch from 09cb8dc to d90925b Compare September 28, 2015 12:26

justinsb closed this Sep 28, 2015

justinsb reopened this Sep 28, 2015

justinsb reviewed Sep 28, 2015
View reviewed changes

justinsb closed this Oct 5, 2015

justinsb reopened this Oct 5, 2015

justinsb mentioned this pull request Oct 5, 2015

AWS: handle case better when volume is 'creating' #15058

Closed

a-robinson added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2015

This was referenced Nov 13, 2015

Enable EBS (AWS cloud provider) UKHomeOffice/application-container-platform#153

Closed

Enabled NoAutomaticUpgrade for MultiResources UKHomeOffice/kb8or#44

Merged

sils mentioned this pull request Dec 8, 2015

Any possibility of using a persistent storage on AWS? #18361

Closed

mikedanese mentioned this pull request Dec 9, 2015

Available EBS volume remains in 'detached' state #15073

Closed

jsafrane mentioned this pull request Dec 9, 2015

UPSTREAM: 14493: Fix AWS volume cache invalidation. openshift/origin#6242

Closed

roberthbailey mentioned this pull request Dec 11, 2015

Automated cherry pick of #14493 #18559

Merged

brendandburns added a commit that referenced this pull request Jan 9, 2016

Merge pull request #18559 from roberthbailey/automated-cherry-pick-of…

f175451

…-#14493-upstream-release-1.1 Automated cherry pick of #14493

shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this pull request Dec 1, 2016

Merge pull request kubernetes#18559 from roberthbailey/automated-cher…

c38ff5e

…ry-pick-of-#14493-upstream-release-1.1 Automated cherry pick of kubernetes#14493

shouhong pushed a commit to shouhong/kubernetes that referenced this pull request Feb 14, 2017

Merge pull request kubernetes#18559 from roberthbailey/automated-cher…

03a30b8

…ry-pick-of-#14493-upstream-release-1.1 Automated cherry pick of kubernetes#14493

Fix for Invalidation of DeviceMapping Cache when Detaching Volumes #14493

Fix for Invalidation of DeviceMapping Cache when Detaching Volumes #14493

Conversation

BugRoger commented Sep 24, 2015

k8s-bot commented Sep 24, 2015

k8s-github-robot commented Sep 24, 2015

justinsb commented Sep 27, 2015

BugRoger commented Sep 28, 2015

justinsb commented Sep 28, 2015

k8s-bot commented Sep 28, 2015

justinsb commented Sep 28, 2015

k8s-bot commented Sep 28, 2015

justinsb Sep 28, 2015

Choose a reason for hiding this comment

justinsb commented Sep 28, 2015

k8s-bot commented Sep 28, 2015

BugRoger commented Sep 29, 2015

k8s-bot commented Sep 29, 2015

justinsb commented Oct 5, 2015

a-robinson commented Oct 6, 2015

a-robinson commented Oct 6, 2015

roberthbailey commented Dec 9, 2015

roberthbailey commented Dec 11, 2015

iameli commented Jan 1, 2016

BugRoger commented Jan 3, 2016

actionshrimp commented Jan 12, 2016

antoineco commented Jan 12, 2016

actionshrimp commented Jan 12, 2016

FestivalBobcats commented Jan 12, 2016

FestivalBobcats commented Jan 12, 2016

antoineco commented Jan 15, 2016

roberthbailey commented Jan 27, 2016

stepanstipl commented Feb 10, 2016

Morriz commented Feb 26, 2016

Morriz commented Feb 26, 2016

justinsb commented Feb 26, 2016

Morriz commented Feb 27, 2016

stepanstipl commented Feb 28, 2016

Morriz commented Feb 28, 2016

Morriz commented Feb 28, 2016

james-thimont-bcgdv commented Feb 29, 2016

Morriz commented Mar 24, 2016