Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Invalidation of DeviceMapping Cache when Detaching Volumes #14493

Conversation

BugRoger
Copy link
Contributor

There is a deviceMapping cache that keeps the state of the block device mappings in local memory. When a device is detached the cache does not get invalidated. A subsequent attachment for the same volume then falsely assumes that the device is already attached. It skip the actual API call to attach the volume and gets stuck in an endless loop waiting for the attachment to finish. It eventually times out and immediately starts to wait again. This only resolves once the kubelet is restarted.

This fix releases the device from the deviceMapping cache when a volume is detached.

With this fix volume attach/detach operations work perfectly for me. I'm a bit unsure though how such a fundamental bug wasn't noticed before? Did this ever work and this is a regression? It could be that previous tests had the pods scheduled to a different kubelet with cold cache? Maybe @justinsb has some insight.

@k8s-bot
Copy link

k8s-bot commented Sep 24, 2015

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@k8s-github-robot k8s-github-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 24, 2015
@k8s-github-robot
Copy link

Labelling this PR as size/S

@justinsb
Copy link
Member

Ouch - I think you are 100% right @BugRoger. It would be great to have an integration test that detects this, but I think that getting this into 1.1 is more important. (And likely cherry-picking to 1.0 also).

I think there's no real reason for it to be missing other than my mistake. Sorry.

The handling of the case where there is a timeout is tricky. It might be that the volume does eventually detach, or it might be that it never detaches. I think it would also be cleaner if we could reuse releaseMountDevice. But I think getting this fix in is more important. If you have time I would like to see:

  • a TODO comment to consider fully the timeout case
  • a TODO comment to reuse releaseMountDevice
  • a log-message if we call releaseVolume with a volume that isn't in the map (similar to releaseMountDevice)

If you don't have time, then no problem, I think this is bad enough that we should merge as is (the code looks great other than the above nits), and I can add those 3 nits later.

Thank you so much for tracking this down & providing a fix @BugRoger

(@smarterclayton - fine to assign this one to me if you'd prefer).

@justinsb justinsb added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/platform/aws labels Sep 27, 2015
@BugRoger
Copy link
Contributor Author

Alright, I addressed your comments.

In my case (private EC2 compatible cloud) the timeout of 60s is almost always too short and the detach operation times out. Fortunately, it does recover. When the volume gets attached again while it's still in detaching theres in an error/retry loop until it eventually succeeds (not sure it ever gives up). Though not nice, it does work for now.

@BugRoger BugRoger force-pushed the fix_devicemapping_cache_invalidation branch from 09cb8dc to d90925b Compare September 28, 2015 12:26
@justinsb
Copy link
Member

Going to close & reopen to try to get shippable to re-run tests.

@justinsb justinsb closed this Sep 28, 2015
@justinsb justinsb reopened this Sep 28, 2015
@k8s-bot
Copy link

k8s-bot commented Sep 28, 2015

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@justinsb
Copy link
Member

ok to test

@k8s-bot
Copy link

k8s-bot commented Sep 28, 2015

Unit, integration and GCE e2e build/test failed for commit d90925bdf0fdac8df798a0e65e7cdc906e6bf7ec.

// At this point we are waiting for the volume being detached. This
// releases the volume and invalidates the cache even when there is a timeout.
//
// TODO: A timeout leaves the cache in an inconsitent state. The volume is still
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo nit: s/inconsitent/inconsistent

@justinsb
Copy link
Member

I don't know what is wrong with shippable. I suggested a typo nit, which if you fix will have the nice side effect of getting shippable to try again :-)

@k8s-bot
Copy link

k8s-bot commented Sep 28, 2015

Unit, integration and GCE e2e test build/test passed for commit a383b843358b5f91bb0c8372e70d6a9390b4f215.

@BugRoger
Copy link
Contributor Author

Unfortunately I found another invalidation bug. The cache is a map from single char to volumeID. As implemented here:

if strings.HasPrefix(mountpoint, "/dev/sd") {
    mountpoint = mountpoint[7:]
}
if strings.HasPrefix(mountpoint, "/dev/xvd") {
    mountpoint = mountpoint[8:]
}
deviceMappings[mountpoint] = orEmpty(blockDevice.EBS.VolumeID)

During the invalidation the whole mount path is passed in:

ec2Device := "/dev/xvd" + mountpoint
defer func() {
    if !attached {
        awsInstance.releaseMountDevice(disk.awsID, ec2Device)
    }
}()

This never works and in case AttachDisk runs into a timeout, the kubelet is yet in another endless loop. The fix is easy but there is still followup trouble here:
https://github.com/BugRoger/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1153

I even implemented that TODO. That also didn't fix the problem completely. To be quite honest I don't know where it still goes wrong, so I invalidate the whole cache upon timeouts in my fork:
BugRoger@b958735

To fix this properly I feel like we need (better) tests. Then refactor and model the state carefully, maybe with periodical reconciliation. That can go into a seperate PR, I hope.

@k8s-bot
Copy link

k8s-bot commented Sep 29, 2015

Unit, integration and GCE e2e test build/test passed for commit 860321c34cd3ad90f457732172157f3cc8e6cc40.

@justinsb
Copy link
Member

justinsb commented Oct 5, 2015

This LGTM; going to close & reopen to kick shippable.

@a-robinson
Copy link
Contributor

Shippable has been incredibly slow today - I'm going to take @justinsb's last comment as an LGTM and merge. I assume that this should also be cherry-picked into 1.1?

@a-robinson a-robinson added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2015
@a-robinson
Copy link
Contributor

Before merging, would you mind squashing your commits, @BugRoger?

@roberthbailey
Copy link
Contributor

@FestivalBobcats - it did not make the 1.1.1 release, as it was never cherry picked into the release branch.

@roberthbailey
Copy link
Contributor

I just created a cherry pick of this PR into the 1.1 release branch (#18559). Once that gets merged, this should be resolved with the next patch release (1.1.4).

@iameli
Copy link

iameli commented Jan 1, 2016

If I get a pod/node into this state, is there an easy workaround I can use until 1.1.4 drops?

@BugRoger
Copy link
Contributor Author

BugRoger commented Jan 3, 2016

If I get a pod/node into this state, is there an easy workaround I can use until 1.1.4 drops?

Restart the Kubelet 😦

brendandburns added a commit that referenced this pull request Jan 9, 2016
…-#14493-upstream-release-1.1

Automated cherry pick of #14493
@actionshrimp
Copy link

Just upgraded to 1.1.4 and still getting this issue - had a look in the release and it looks like the PR didn't make it in?

@antoineco
Copy link
Contributor

It should have. Did you build from the 1.1 release branch or are you using the beta?

@actionshrimp
Copy link

Thanks for the quick reply. I looked at the source in the 1.1.4 release bundle here https://github.com/kubernetes/kubernetes/releases/tag/v1.1.4, and I'm using the binaries hosted on here: https://storage.googleapis.com/kubernetes-release/release/v1.1.4/bin/

@FestivalBobcats
Copy link

I am also still seeing this issue on 1.1.4. It seems I can't ever re-attach the same volume to a node.

My team had been using 1.2.0-alpha3, and we had a frequent "/dev/xvdf is already in use" error. We thought this thread was relevant, but after building and deploying 1.1.4, we're now thinking that was a different error entirely.

On 1.1.4, we can't restart a pod with attached EBS without getting:

Error syncing pod, skipping: Timeout waiting for volume state

We're currently snapshotting and re-creating volumes as a workaround.

EDIT: Actually, it turns out we compiled without #18559. Testing that out now.

@FestivalBobcats
Copy link

Correction: I'm completely wrong. Didn't build the right tag.

@antoineco
Copy link
Contributor

(k8s 1.1.4)
This did not fix #15073 unfortunately, I'm still in the same situation as before: detached volumes not attaching.

aws.go:1018] Waiting for volume state: actual=detached, desired=attached
...

After restarting kubelet:

kubelet[28901]: I0115 11:03:20.761060   28901 aws.go:898] Assigned mount device g -> volume vol-abcd1234
kubelet[28901]: I0115 11:03:21.131251   28901 aws.go:1132] AttachVolume request returned %v{
kubelet[28901]: AttachTime: 2016-01-15 11:03:21.117 +0000 UTC,
kubelet[28901]: Device: "/dev/xvdg",
kubelet[28901]: InstanceId: "i-abcd1234",
kubelet[28901]: State: "attaching",
kubelet[28901]: VolumeId: "vol-abcd1234"
kubelet[28901]: }

@roberthbailey
Copy link
Contributor

@antoineco This wasn't cherry picked in time to catch the 1.1.4 release. You should see it fixed in the (pending) 1.1.7 release.

@stepanstipl
Copy link

I still seem to have this (or very similar) issue in 1.1.7. When pod with EBS backed block storage is recreated, it will never succeed to attach the EBS volume:

FirstSeen     LastSeen        Count   From                                                    SubobjectPath   Reason          Message
  ─────────     ────────        ─────   ────                                                    ─────────────   ──────          ───────
  22m           22m             1       {scheduler }                                                            Scheduled       Successfully assigned monitoring-influxdb-v2-1-0q6zy to ip-172-20-217-186.eu-west-1.compute.internal
  20m           20m             1       {kubelet ip-172-20-217-186.eu-west-1.compute.internal}                  FailedMount     Unable to mount volumes for pod "monitoring-influxdb-v2-1-0q6zy_kube-system": Timeout waiting for volume state
  20m           20m             1       {kubelet ip-172-20-217-186.eu-west-1.compute.internal}                  FailedSync      Error syncing pod, skipping: Timeout waiting for volume state
  22m           5s              131     {kubelet ip-172-20-217-186.eu-west-1.compute.internal}                  FailedMount     Unable to mount volumes for pod "monitoring-influxdb-v2-1-0q6zy_kube-system": Error attaching EBS volume: VolumeInUse: vol-9b981f59 is already attached to an instance status code: 400, request id:
  22m           5s              131     {kubelet ip-172-20-217-186.eu-west-1.compute.internal}          FailedSync      Error syncing pod, skipping: Error attaching EBS volume: VolumeInUse: vol-9b981f59 is already attached to an instance status code: 400, request id:

@Morriz
Copy link

Morriz commented Feb 26, 2016

I have tested this to still be an issue all the way up to 1.2.0-alpha.8. Can we get some more eyeballs on this?

@Morriz
Copy link

Morriz commented Feb 26, 2016

I should be more specific: I don't get more than the timeout message:

FirstSeen   LastSeen    Count   From                            SubobjectPath   Reason      Message
  ─────────   ────────    ───── ────                            ───────────── ──────      ───────
  11m       11m     1   {scheduler }                                Scheduled   Successfully assigned grafana-3n68f to ip-10-0-0-168.eu-central-1.compute.internal
  10m       28s     11  {kubelet ip-10-0-0-168.eu-central-1.compute.internal}           FailedMount Unable to mount volumes for pod "grafana-3n68f_kube-system": Timeout waiting for volume state
  10m       28s     11  {kubelet ip-10-0-0-168.eu-central-1.compute.internal}           FailedSync  Error syncing pod, skipping: Timeout waiting for volume state

@justinsb
Copy link
Member

@Morriz can you post your pod description please? I don't think the out-of-the-box grafana pod uses an EBS volume...

kubectl describe pod --namespace=kube-system grafana-3n68f_kube-system -ojson

@Morriz
Copy link

Morriz commented Feb 27, 2016

sry, i am in a constant state of flux with the stack...trying to find the magic fit between components and their versions...so no backtracking there for now...maybe later

@stepanstipl
Copy link

@justinsb What is interesting that even if you delete and recreate rc, and delete and recreate actual EBS volume (so you end up both with different pod and different EBS volume id) it will still keep failing. I should be able to post here description of actual failing pod tomorrow or on Monday.

In case this would be useful in the meantime the only difference compared to out of the box grafana is this in the rc definition, where the one I'm using adds:

        volumeMounts:
        - name: secret-volume
          readOnly: true
          mountPath: /secrets
        - name: grafana-persistent-storage
          mountPath: /var/lib/grafana
      volumes:
      - name: secret-volume
        secret:
          secretName: monitoring-grafana
      - name: grafana-persistent-storage
        awsElasticBlockStore:
          volumeID: "aws://$AWS_ZONE/$AWS_EBS_ID"
          fsType: ext4

But although coincidentally in this case it's also happening with grafana pod, I have had this issue on other pods with different images using EBS volumes - for example with influxdb or elasticsearch images.

@Morriz
Copy link

Morriz commented Feb 28, 2016

@stepanstipl: that grafana ebs volume id has the old format, it is now just the volume-id (vol-1234567). Hope that was not causing the mount failure :)

@Morriz
Copy link

Morriz commented Feb 28, 2016

@justinsb: I had only changed the default emptyDir storage to an ebs volume. Now trying to see under what conditions the mounts fail...

@james-thimont-bcgdv
Copy link

@justinsb I have the same problem on v1.1.8.
Let me know if you need any more info

    Image ID:       
    QoS Tier:
      cpu:  Burstable
    Limits:
      cpu:  250m
    Requests:
      cpu:      150m
    State:      Waiting
      Reason:       ContainerCreating
    Ready:      False
    Restart Count:  0
    Environment Variables:
Conditions:
  Type      Status
  Ready     False 
Volumes:
  jenkins:
    Type:   AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://eu-west-1b/vol-3ca8ddf2
    FSType: ext4
    Partition:  0
    ReadOnly:   false
  ssl:
    Type:   Secret (a secret that should populate this volume)
    SecretName: jenkins-ssl
  default-token-bd5uj:
    Type:   Secret (a secret that should populate this volume)
    SecretName: default-token-bd5uj
Events:
  FirstSeen LastSeen    Count   From                            SubobjectPath   Reason      Message
  ─────────   ────────    ───── ────                            ───────────── ──────      ───────
  11m       11m     1   {scheduler }                                Scheduled   Successfully assigned jenkins-leader-i9kk1 to ip-10-0-0-94.eu-west-1.compute.internal
  9m        27s     10  {kubelet ip-10-0-0-94.eu-west-1.compute.internal}           FailedMount Unable to mount volumes for pod "jenkins-leader-i9kk1_default": Timeout waiting for volume state
  9m        27s     10  {kubelet ip-10-0-0-94.eu-west-1.compute.internal}           FailedSync  Error syncing pod, skipping: Timeout waiting for volume state

@Morriz
Copy link

Morriz commented Mar 24, 2016

James, you are using the old volume id notation, which was superseded by just the aws notation ('vol-1234abcd'), and which is reflected in the kubernetes documentation as well....

On 29 feb. 2016, at 12:54, James Thimont notifications@github.com wrote:

@justinsb https://github.com/justinsb I have the same problem on v1.1.8.
Let me know if you need any more info

Image ID:       
QoS Tier:
  cpu:  Burstable
Limits:
  cpu:  250m
Requests:
  cpu:      150m
State:      Waiting
  Reason:       ContainerCreating
Ready:      False
Restart Count:  0
Environment Variables:

Conditions:
Type Status
Ready False
Volumes:
jenkins:
Type: AWSElasticBlockStore (a Persistent Disk resource in AWS)
VolumeID: aws://eu-west-1b/vol-3ca8ddf2
FSType: ext4
Partition: 0
ReadOnly: false
ssl:
Type: Secret (a secret that should populate this volume)
SecretName: jenkins-ssl
default-token-bd5uj:
Type: Secret (a secret that should populate this volume)
SecretName: default-token-bd5uj
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
───────── ──────── ───── ──── ───────────── ────── ───────
11m 11m 1 {scheduler } Scheduled Successfully assigned jenkins-leader-i9kk1 to ip-10-0-0-94.eu-west-1.compute.internal
9m 27s 10 {kubelet ip-10-0-0-94.eu-west-1.compute.internal} FailedMount Unable to mount volumes for pod "jenkins-leader-i9kk1_default": Timeout waiting for volume state
9m 27s 10 {kubelet ip-10-0-0-94.eu-west-1.compute.internal} FailedSync Error syncing pod, skipping: Timeout waiting for volume state

Reply to this email directly or view it on GitHub #14493 (comment).

shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this pull request Dec 1, 2016
…ry-pick-of-#14493-upstream-release-1.1

Automated cherry pick of kubernetes#14493
shouhong pushed a commit to shouhong/kubernetes that referenced this pull request Feb 14, 2017
…ry-pick-of-#14493-upstream-release-1.1

Automated cherry pick of kubernetes#14493
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet