New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible data corruption on rbd storage #10462

Closed
farcaller opened this Issue Jun 28, 2015 · 13 comments

Comments

Projects
None yet
5 participants
@farcaller
Copy link

farcaller commented Jun 28, 2015

etcd cluster backing apiserver died spontaneously, after restart apiserver pulled up the pods on different machines. Node storage was backed by ext4 over rbd, which became mounted on several nodes at once. That could have possibly led to data corruption.

I guess that the proper way to mount rbd volume is to take a lock on it first, which will at least fail if it's locked somewhere else, requiring manual intervention, but not destroying data. The lock should be optional, in case a proper cluster fs is mounted over the rbd.

In this case there was clearly no need for accessing the data from several pods so ext4 is deemed a proper fs.

@thockin

This comment has been minimized.

Copy link
Member

thockin commented Jun 28, 2015

@rootfs

On Sun, Jun 28, 2015 at 11:49 AM, Vladimir Pouzanov <
notifications@github.com> wrote:

etcd cluster backing apiserver died spontaneously, after restart apiserver
pulled up the pods on different machines. Node storage was backed by ext4
over rbd, which became mounted on several nodes at once. That could have
possibly led to data corruption.

I guess that the proper way to mount rbd volume is to take a lock on it
first, which will at least fail if it's locked somewhere else, requiring
manual intervention, but not destroying data. The lock should be optional,
in case a proper cluster fs is mounted over the rbd.

In this case there was clearly no need for accessing the data from several
pods so ext4 is deemed a proper fs.


Reply to this email directly or view it on GitHub
#10462.

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jun 29, 2015

@farcaller want to get some detail. How is your pod-to-rbd mapping like? Since rbd image is a block device, it can be used by one pod at a time.

I am not sure if locking is a proper way to go - if the pod holds a lock but dies in the middle of mounting, the rbd image (or any other volume type) is then excluded from access. In such case, a high level coordination at apiserver is a better place to resolve this problem.

@farcaller

This comment has been minimized.

Copy link

farcaller commented Jun 29, 2015

Here's my rc:

apiVersion: v1beta3
kind: ReplicationController
metadata:
  name: prometheus
  labels:
    name: prometheus
spec:
  replicas: 1
  selector:
    name: prometheus
  template:
    metadata:
      name: prometheus
      labels:
        name: prometheus
    spec:
      containers:
        - name: prometheus
          image: docker-registry.default.svc.kube.local/prometheus
          ports:
            - containerPort: 9090
          volumeMounts:
            - name: prometheus-data
              mountPath: /prometheus
      volumes:
        - name: prometheus-data
          rbd:
            monitors:
              - 10.200.0.1:6789
              - 10.201.0.1:6789
              - 10.202.0.1:6789
            pool: kube-common
            image: prometheus-data
            fsType: ext4
            user: admin
            keyring: /etc/ceph/ceph.client.admin.keyring

I guess it's better to have a lock to be cleaned up manually rather than to have non-cluster fs to be mounted in several locations at once?

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jun 29, 2015

@farcaller thanks, this pod looks good to me.
@thockin what would cause multiple nodes run the same pod?

@farcaller

This comment has been minimized.

Copy link

farcaller commented Jun 29, 2015

I don't know the exact reason of why kubelet didn't unmount the old fs, but after apiserver and scheduler were restarted pod got allocated on different node.

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jun 29, 2015

It is a scary thing to see etcd down, it is meant to be reliable...

@farcaller

This comment has been minimized.

Copy link

farcaller commented Jun 29, 2015

Unfortunately I don't have long-term logging in test cluster so I have no idea why it actually died. but using kubedns as primary node resolver was clearly a bad idea, once it died, nodes stuck with no resolver at all.

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jun 29, 2015

@timstclair your thought on HA?

@timstclair

This comment has been minimized.

Copy link

timstclair commented Jun 29, 2015

Wrong Tim St. Clair? @timothysc

@thockin

This comment has been minimized.

Copy link
Member

thockin commented Jun 29, 2015

@rootfs - you HAVE to be ready to handle network partitions - the old node
might still be alive and even using the volume, but the apiserver can't see
the node, so it thinks it deletes the pod and starts it elsewhere. If RBD
corrupts itslef in this case, it is not useful and we should strip it from
the codebase.

On Mon, Jun 29, 2015 at 6:48 AM, Huamin Chen notifications@github.com
wrote:

@farcaller https://github.com/farcaller thanks, this pod looks good to
me.
@thockin https://github.com/thockin what would cause multiple nodes run
the same pod?


Reply to this email directly or view it on GitHub
#10462 (comment)
.

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jun 29, 2015

@thockin that makes sense. rbd doesn't corrupt itself. This is inconsistency happens when you have two pods mounting the same rbd image. How does GCE PD handle/prevent this case?

@thockin

This comment has been minimized.

Copy link
Member

thockin commented Jun 29, 2015

It doesn't allow more than one read-write attachment. Multiple read-only
attachments is OK. Better to fail than to allow corruption.

On Mon, Jun 29, 2015 at 11:07 AM, Huamin Chen notifications@github.com
wrote:

@thockin https://github.com/thockin that makes sense. rbd doesn't
corrupt itself. This is inconsistency happens when you have two pods
mounting the same rbd image. How does GCE PD handle/prevent this case?


Reply to this email directly or view it on GitHub
#10462 (comment)
.

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jun 29, 2015

That's neat. I have a lock fencing for rbd, testing now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment