-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for Invalidation of DeviceMapping Cache when Detaching Volumes #14493
Fix for Invalidation of DeviceMapping Cache when Detaching Volumes #14493
Conversation
Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist") If this message is too spammy, please complain to ixdy. |
Labelling this PR as size/S |
Ouch - I think you are 100% right @BugRoger. It would be great to have an integration test that detects this, but I think that getting this into 1.1 is more important. (And likely cherry-picking to 1.0 also). I think there's no real reason for it to be missing other than my mistake. Sorry. The handling of the case where there is a timeout is tricky. It might be that the volume does eventually detach, or it might be that it never detaches. I think it would also be cleaner if we could reuse releaseMountDevice. But I think getting this fix in is more important. If you have time I would like to see:
If you don't have time, then no problem, I think this is bad enough that we should merge as is (the code looks great other than the above nits), and I can add those 3 nits later. Thank you so much for tracking this down & providing a fix @BugRoger (@smarterclayton - fine to assign this one to me if you'd prefer). |
Alright, I addressed your comments. In my case (private EC2 compatible cloud) the timeout of 60s is almost always too short and the detach operation times out. Fortunately, it does recover. When the volume gets attached again while it's still in |
09cb8dc
to
d90925b
Compare
Going to close & reopen to try to get shippable to re-run tests. |
Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist") If this message is too spammy, please complain to ixdy. |
ok to test |
Unit, integration and GCE e2e build/test failed for commit d90925bdf0fdac8df798a0e65e7cdc906e6bf7ec. |
// At this point we are waiting for the volume being detached. This | ||
// releases the volume and invalidates the cache even when there is a timeout. | ||
// | ||
// TODO: A timeout leaves the cache in an inconsitent state. The volume is still |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo nit: s/inconsitent/inconsistent
I don't know what is wrong with shippable. I suggested a typo nit, which if you fix will have the nice side effect of getting shippable to try again :-) |
Unit, integration and GCE e2e test build/test passed for commit a383b843358b5f91bb0c8372e70d6a9390b4f215. |
Unfortunately I found another invalidation bug. The cache is a map from single char to
During the invalidation the whole mount path is passed in:
This never works and in case I even implemented that TODO. That also didn't fix the problem completely. To be quite honest I don't know where it still goes wrong, so I invalidate the whole cache upon timeouts in my fork: To fix this properly I feel like we need (better) tests. Then refactor and model the state carefully, maybe with periodical reconciliation. That can go into a seperate PR, I hope. |
Unit, integration and GCE e2e test build/test passed for commit 860321c34cd3ad90f457732172157f3cc8e6cc40. |
This LGTM; going to close & reopen to kick shippable. |
Shippable has been incredibly slow today - I'm going to take @justinsb's last comment as an LGTM and merge. I assume that this should also be cherry-picked into 1.1? |
Before merging, would you mind squashing your commits, @BugRoger? |
@FestivalBobcats - it did not make the 1.1.1 release, as it was never cherry picked into the release branch. |
I just created a cherry pick of this PR into the 1.1 release branch (#18559). Once that gets merged, this should be resolved with the next patch release (1.1.4). |
If I get a pod/node into this state, is there an easy workaround I can use until 1.1.4 drops? |
Restart the Kubelet 😦 |
…-#14493-upstream-release-1.1 Automated cherry pick of #14493
Just upgraded to 1.1.4 and still getting this issue - had a look in the release and it looks like the PR didn't make it in? |
It should have. Did you build from the 1.1 release branch or are you using the beta? |
Thanks for the quick reply. I looked at the source in the 1.1.4 release bundle here https://github.com/kubernetes/kubernetes/releases/tag/v1.1.4, and I'm using the binaries hosted on here: https://storage.googleapis.com/kubernetes-release/release/v1.1.4/bin/ |
I am also still seeing this issue on 1.1.4. It seems I can't ever re-attach the same volume to a node. My team had been using 1.2.0-alpha3, and we had a frequent "/dev/xvdf is already in use" error. We thought this thread was relevant, but after building and deploying 1.1.4, we're now thinking that was a different error entirely. On 1.1.4, we can't restart a pod with attached EBS without getting:
We're currently snapshotting and re-creating volumes as a workaround. EDIT: Actually, it turns out we compiled without #18559. Testing that out now. |
Correction: I'm completely wrong. Didn't build the right tag. |
(k8s 1.1.4)
After restarting kubelet:
|
@antoineco This wasn't cherry picked in time to catch the 1.1.4 release. You should see it fixed in the (pending) 1.1.7 release. |
I still seem to have this (or very similar) issue in 1.1.7. When pod with EBS backed block storage is recreated, it will never succeed to attach the EBS volume:
|
I have tested this to still be an issue all the way up to 1.2.0-alpha.8. Can we get some more eyeballs on this? |
I should be more specific: I don't get more than the timeout message:
|
@Morriz can you post your pod description please? I don't think the out-of-the-box grafana pod uses an EBS volume...
|
sry, i am in a constant state of flux with the stack...trying to find the magic fit between components and their versions...so no backtracking there for now...maybe later |
@justinsb What is interesting that even if you delete and recreate rc, and delete and recreate actual EBS volume (so you end up both with different pod and different EBS volume id) it will still keep failing. I should be able to post here description of actual failing pod tomorrow or on Monday. In case this would be useful in the meantime the only difference compared to out of the box grafana is this in the rc definition, where the one I'm using adds:
But although coincidentally in this case it's also happening with grafana pod, I have had this issue on other pods with different images using EBS volumes - for example with influxdb or elasticsearch images. |
@stepanstipl: that grafana ebs volume id has the old format, it is now just the volume-id (vol-1234567). Hope that was not causing the mount failure :) |
@justinsb: I had only changed the default emptyDir storage to an ebs volume. Now trying to see under what conditions the mounts fail... |
@justinsb I have the same problem on v1.1.8.
|
James, you are using the old volume id notation, which was superseded by just the aws notation ('vol-1234abcd'), and which is reflected in the kubernetes documentation as well....
|
…ry-pick-of-#14493-upstream-release-1.1 Automated cherry pick of kubernetes#14493
…ry-pick-of-#14493-upstream-release-1.1 Automated cherry pick of kubernetes#14493
There is a
deviceMapping
cache that keeps the state of the block device mappings in local memory. When a device is detached the cache does not get invalidated. A subsequent attachment for the same volume then falsely assumes that the device is already attached. It skip the actual API call to attach the volume and gets stuck in an endless loop waiting for the attachment to finish. It eventually times out and immediately starts to wait again. This only resolves once the kubelet is restarted.This fix releases the device from the
deviceMapping
cache when a volume is detached.With this fix volume attach/detach operations work perfectly for me. I'm a bit unsure though how such a fundamental bug wasn't noticed before? Did this ever work and this is a regression? It could be that previous tests had the pods scheduled to a different kubelet with cold cache? Maybe @justinsb has some insight.