Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS EBS volume data is deleted in certain cases #11012

Closed
erulabs opened this issue Jul 9, 2015 · 4 comments
Closed

AWS EBS volume data is deleted in certain cases #11012

erulabs opened this issue Jul 9, 2015 · 4 comments
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@erulabs
Copy link
Contributor

erulabs commented Jul 9, 2015

Hello!

Kubernetes sometimes deletes all data from my EBS backed volume. Given a controller like so:

apiVersion: v1
kind: ReplicationController
metadata:
  labels:
    name: nginx
  name: nginx
spec:
  replicas: 1
  selector:
    name: nginx
  template:
    metadata:
      labels:
        name: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
        volumeMounts:
        - name: nginx-store
          mountPath: "/var/nginx"
      volumes:
      - name: nginx-store
        awsElasticBlockStore:
          volumeID: aws://REGION/VOLUME_ID
          fsType: ext4

One would expect data in "/var/nginx" to persist, however, it doesn't appear to persist when the pod moves between hosts.

Running kubectl stop pod nginx-ID works most of the time, but when the pod moves between hosts I see an event like so:

Thu, 09 Jul 2015 20:02:57 +0000   Thu, 09 Jul 2015 20:02:57 +0000   1         nginx-f26t2   Pod                                                         failedSync         {kubelet REDACTED}   Error syncing pod, skipping: Error attaching EBS volume: VolumeInUse: VOLUME_ID is already attached to an instance
                                  status code: 400, request id: []
Thu, 09 Jul 2015 20:02:57 +0000   Thu, 09 Jul 2015 20:02:57 +0000   1         nginx-f26t2   Pod                 failedMount   {kubelet REDACTED}   Unable to mount volumes for pod "nginx-f26t2_default": Error attaching EBS volume: VolumeInUse: VOLUME_ID is already attached to an instance

The interesting part is that the EBS volume does re-attach to the system the pod is moving to and it does mount properly - it just has no data in it when it is mounted (as if its been rm -rfed)

I am learning Go and eager to grok the Kubernetes codebase a bit, so I wanted to take a stab (more of a wild guess). My suspicion is this code: https://github.com/GoogleCloudPlatform/kubernetes/blob/530bff315ff034d0c9098004f35f01d26ab40aaa/pkg/cloudprovider/aws/aws.go#L1316-L1322

It seems to me that if we receive an error message we return, which prevents us from reaching waitForAttachmentStatus. I get the feeling this causes a race wherein the new pod boots up while the volume is being attached. This would account for my seeing this fairly rarely. It seems to me Kubernetes should block the entire pod creation process while it waits for the disk to be mounted to the correct Node. Instead, it sees the AWS failure and continues on its way.

Any thoughts? Thanks for your time!

@roberthbailey
Copy link
Contributor

/cc @brendandburns

@justinsb
Copy link
Member

justinsb commented Jul 9, 2015

Ouch; if this is correct this is both bad and indicates a fairly important missing e2e test! Looking into it.

@vmarmol vmarmol added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/platform/aws labels Jul 10, 2015
justinsb added a commit to justinsb/kubernetes that referenced this issue Jul 12, 2015
Issue kubernetes#11012 reported that disk contents may be lost (sometimes).

We should have an e2e test to verify this is not happening.

(If it is happening, we should then fix it!)
@justinsb
Copy link
Member

We should have an e2e test for this, I'm working on one here: #11128. So far I haven't been able to reproduce the problem.

The events in kubectl describe aren't necessearily a problem (but it is very helpful to see them, so thank you!). It takes a while to detach the volume from the old host, but kubernetes will likely reschedule the pod before that is complete. We would expect to see the pod fail to start until the volume has been released, and it makes sense that we would see those events.

This assumes that we don't actually start the pod until the volume is correctly mounted. If (as you suggested) we are somehow starting the pod without the volume being mounted, we would see this problem. Another hypothetical: if we are forcibly detaching or shutting down the pod and it isn't flushing to EBS, we might see data loss there. But so far I have yet to find these problems.

Similarly, as long as we return an error from AttachVolume, I would expect the pod to fail to start. Kubernetes should retry until AttachVolume succeeds. I'm verifying that is indeed the case. But (as long as everything works as I understand it) there shouldn't be a problem as long as we don't silently ignore errors in AttachVolume.

@erulabs Can you provide some more information about what files are on '/var/nginx', and what behaviour you see when the data is there vs is not there?

@erulabs
Copy link
Contributor Author

erulabs commented Jul 13, 2015

@justinsb I apologize - I was traveling right after reporting this bug and haven't gotten back to it - I'm going to try to recreate on the newest release (This bug was on r19). While it did occur to me twice (once with a Jenkins installation and another time with just a few test text files (the /var/nginx)) example. It's possible this was me getting unlucky with AWS or the host boxes, and this time I'll do a deeper dive into the host nodes themselves if it happens again.

Thanks for the test :D I'll close this issue if thats OK, and re-open if I can recreate reliably on the newest release. After reading a bunch of your AWS/EBS code, I'm starting to lean toward blaming cloud infrastructure rather than the handling code.

@erulabs erulabs closed this as completed Jul 13, 2015
saad-ali pushed a commit to saad-ali/kubernetes that referenced this issue Jul 30, 2015
Issue kubernetes#11012 reported that disk contents may be lost (sometimes).

We should have an e2e test to verify this is not happening.

(If it is happening, we should then fix it!)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

4 participants