-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS EBS volume data is deleted in certain cases #11012
Comments
/cc @brendandburns |
Ouch; if this is correct this is both bad and indicates a fairly important missing e2e test! Looking into it. |
Issue kubernetes#11012 reported that disk contents may be lost (sometimes). We should have an e2e test to verify this is not happening. (If it is happening, we should then fix it!)
We should have an e2e test for this, I'm working on one here: #11128. So far I haven't been able to reproduce the problem. The events in kubectl describe aren't necessearily a problem (but it is very helpful to see them, so thank you!). It takes a while to detach the volume from the old host, but kubernetes will likely reschedule the pod before that is complete. We would expect to see the pod fail to start until the volume has been released, and it makes sense that we would see those events. This assumes that we don't actually start the pod until the volume is correctly mounted. If (as you suggested) we are somehow starting the pod without the volume being mounted, we would see this problem. Another hypothetical: if we are forcibly detaching or shutting down the pod and it isn't flushing to EBS, we might see data loss there. But so far I have yet to find these problems. Similarly, as long as we return an error from AttachVolume, I would expect the pod to fail to start. Kubernetes should retry until AttachVolume succeeds. I'm verifying that is indeed the case. But (as long as everything works as I understand it) there shouldn't be a problem as long as we don't silently ignore errors in AttachVolume. @erulabs Can you provide some more information about what files are on '/var/nginx', and what behaviour you see when the data is there vs is not there? |
@justinsb I apologize - I was traveling right after reporting this bug and haven't gotten back to it - I'm going to try to recreate on the newest release (This bug was on r19). While it did occur to me twice (once with a Jenkins installation and another time with just a few test text files (the /var/nginx)) example. It's possible this was me getting unlucky with AWS or the host boxes, and this time I'll do a deeper dive into the host nodes themselves if it happens again. Thanks for the test :D I'll close this issue if thats OK, and re-open if I can recreate reliably on the newest release. After reading a bunch of your AWS/EBS code, I'm starting to lean toward blaming cloud infrastructure rather than the handling code. |
Issue kubernetes#11012 reported that disk contents may be lost (sometimes). We should have an e2e test to verify this is not happening. (If it is happening, we should then fix it!)
Hello!
Kubernetes sometimes deletes all data from my EBS backed volume. Given a controller like so:
One would expect data in "/var/nginx" to persist, however, it doesn't appear to persist when the pod moves between hosts.
Running
kubectl stop pod nginx-ID
works most of the time, but when the pod moves between hosts I see an event like so:The interesting part is that the EBS volume does re-attach to the system the pod is moving to and it does mount properly - it just has no data in it when it is mounted (as if its been
rm -rf
ed)I am learning Go and eager to grok the Kubernetes codebase a bit, so I wanted to take a stab (more of a wild guess). My suspicion is this code: https://github.com/GoogleCloudPlatform/kubernetes/blob/530bff315ff034d0c9098004f35f01d26ab40aaa/pkg/cloudprovider/aws/aws.go#L1316-L1322
It seems to me that if we receive an error message we
return
, which prevents us from reachingwaitForAttachmentStatus
. I get the feeling this causes a race wherein the new pod boots up while the volume is being attached. This would account for my seeing this fairly rarely. It seems to me Kubernetes should block the entire pod creation process while it waits for the disk to be mounted to the correct Node. Instead, it sees the AWS failure and continues on its way.Any thoughts? Thanks for your time!
The text was updated successfully, but these errors were encountered: