AWS EBS volume data is deleted in certain cases #11012

erulabs · 2015-07-09T20:37:12Z

Hello!

Kubernetes sometimes deletes all data from my EBS backed volume. Given a controller like so:

apiVersion: v1
kind: ReplicationController
metadata:
  labels:
    name: nginx
  name: nginx
spec:
  replicas: 1
  selector:
    name: nginx
  template:
    metadata:
      labels:
        name: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
        volumeMounts:
        - name: nginx-store
          mountPath: "/var/nginx"
      volumes:
      - name: nginx-store
        awsElasticBlockStore:
          volumeID: aws://REGION/VOLUME_ID
          fsType: ext4

One would expect data in "/var/nginx" to persist, however, it doesn't appear to persist when the pod moves between hosts.

Running kubectl stop pod nginx-ID works most of the time, but when the pod moves between hosts I see an event like so:

Thu, 09 Jul 2015 20:02:57 +0000   Thu, 09 Jul 2015 20:02:57 +0000   1         nginx-f26t2   Pod                                                         failedSync         {kubelet REDACTED}   Error syncing pod, skipping: Error attaching EBS volume: VolumeInUse: VOLUME_ID is already attached to an instance
                                  status code: 400, request id: []
Thu, 09 Jul 2015 20:02:57 +0000   Thu, 09 Jul 2015 20:02:57 +0000   1         nginx-f26t2   Pod                 failedMount   {kubelet REDACTED}   Unable to mount volumes for pod "nginx-f26t2_default": Error attaching EBS volume: VolumeInUse: VOLUME_ID is already attached to an instance

The interesting part is that the EBS volume does re-attach to the system the pod is moving to and it does mount properly - it just has no data in it when it is mounted (as if its been rm -rfed)

I am learning Go and eager to grok the Kubernetes codebase a bit, so I wanted to take a stab (more of a wild guess). My suspicion is this code: https://github.com/GoogleCloudPlatform/kubernetes/blob/530bff315ff034d0c9098004f35f01d26ab40aaa/pkg/cloudprovider/aws/aws.go#L1316-L1322

It seems to me that if we receive an error message we return, which prevents us from reaching waitForAttachmentStatus. I get the feeling this causes a race wherein the new pod boots up while the volume is being attached. This would account for my seeing this fairly rarely. It seems to me Kubernetes should block the entire pod creation process while it waits for the disk to be mounted to the correct Node. Instead, it sees the AWS failure and continues on its way.

Any thoughts? Thanks for your time!

The text was updated successfully, but these errors were encountered:

roberthbailey · 2015-07-09T20:47:13Z

/cc @brendandburns

justinsb · 2015-07-09T20:49:35Z

Ouch; if this is correct this is both bad and indicates a fairly important missing e2e test! Looking into it.

Issue kubernetes#11012 reported that disk contents may be lost (sometimes). We should have an e2e test to verify this is not happening. (If it is happening, we should then fix it!)

justinsb · 2015-07-12T23:22:41Z

We should have an e2e test for this, I'm working on one here: #11128. So far I haven't been able to reproduce the problem.

The events in kubectl describe aren't necessearily a problem (but it is very helpful to see them, so thank you!). It takes a while to detach the volume from the old host, but kubernetes will likely reschedule the pod before that is complete. We would expect to see the pod fail to start until the volume has been released, and it makes sense that we would see those events.

This assumes that we don't actually start the pod until the volume is correctly mounted. If (as you suggested) we are somehow starting the pod without the volume being mounted, we would see this problem. Another hypothetical: if we are forcibly detaching or shutting down the pod and it isn't flushing to EBS, we might see data loss there. But so far I have yet to find these problems.

Similarly, as long as we return an error from AttachVolume, I would expect the pod to fail to start. Kubernetes should retry until AttachVolume succeeds. I'm verifying that is indeed the case. But (as long as everything works as I understand it) there shouldn't be a problem as long as we don't silently ignore errors in AttachVolume.

@erulabs Can you provide some more information about what files are on '/var/nginx', and what behaviour you see when the data is there vs is not there?

erulabs · 2015-07-13T17:11:16Z

@justinsb I apologize - I was traveling right after reporting this bug and haven't gotten back to it - I'm going to try to recreate on the newest release (This bug was on r19). While it did occur to me twice (once with a Jenkins installation and another time with just a few test text files (the /var/nginx)) example. It's possible this was me getting unlucky with AWS or the host boxes, and this time I'll do a deeper dive into the host nodes themselves if it happens again.

Thanks for the test :D I'll close this issue if thats OK, and re-open if I can recreate reliably on the newest release. After reading a bunch of your AWS/EBS code, I'm starting to lean toward blaming cloud infrastructure rather than the handling code.

Issue kubernetes#11012 reported that disk contents may be lost (sometimes). We should have an e2e test to verify this is not happening. (If it is happening, we should then fix it!)

roberthbailey assigned justinsb Jul 9, 2015

vmarmol added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/platform/aws labels Jul 10, 2015

justinsb mentioned this issue Jul 12, 2015

e2e: Add test to pd that disk contents are preserved #11128

Merged

erulabs closed this as completed Jul 13, 2015

justinsb mentioned this issue Jul 16, 2015

Kubernetes/GCE corrupted PD volume #11231

Closed

erulabs unassigned justinsb Aug 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS EBS volume data is deleted in certain cases #11012

AWS EBS volume data is deleted in certain cases #11012

erulabs commented Jul 9, 2015

roberthbailey commented Jul 9, 2015

justinsb commented Jul 9, 2015

justinsb commented Jul 12, 2015

erulabs commented Jul 13, 2015

AWS EBS volume data is deleted in certain cases #11012

AWS EBS volume data is deleted in certain cases #11012

Comments

erulabs commented Jul 9, 2015

roberthbailey commented Jul 9, 2015

justinsb commented Jul 9, 2015

justinsb commented Jul 12, 2015

erulabs commented Jul 13, 2015