New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd mount fencing #10563

Merged
merged 1 commit into from Jul 24, 2015

Conversation

Projects
None yet
9 participants
@rootfs
Copy link
Member

rootfs commented Jun 30, 2015

This aims to fix #10462.

In event of multiple pods mounting the same rbd volume, only one can succeed. This avoids multiple pod writing to the rbd volume and corrupt data.

rbd mount fencing uses rbd lock add|remove|list. Kubelet mounter checks if the rbd image has a lock, if lock exists and is held by others, then no more mount and kubelet fails the pod. Read-only mount can still go without checking lock.

Still running tests to rule out regression. Big appreciation for volunteers helping me test it.

@googlebot googlebot added the cla: yes label Jun 30, 2015

@k8s-bot

This comment has been minimized.

Copy link

k8s-bot commented Jun 30, 2015

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jun 30, 2015

@leseb @jsafrane appreciate if you can review and verify on your env.

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jul 1, 2015

my test case is described here

@zmerlynn

This comment has been minimized.

Copy link
Member

zmerlynn commented Jul 1, 2015

Assigning to @thockin for final review when the WIP comes off.

@rootfs rootfs changed the title WIP: rbd mount fencing rbd mount fencing Jul 1, 2015

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jul 1, 2015

So far so good, no regression found. @thockin your review is appreciated.

id: id,
keyring: keyring,
secret: secret,
Mon: source.CephMonitors,

This comment has been minimized.

@thockin

thockin Jul 4, 2015

Member

Why are all of these being made non-private?

This comment has been minimized.

@thockin

thockin Jul 4, 2015

Member

I see it's for JSON?

This comment has been minimized.

@thockin

thockin Jul 4, 2015

Member

A comment would help a lot.

This comment has been minimized.

@rootfs

rootfs Jul 6, 2015

Member

you are right, upper case for json. comments coming.

}

func (util *RBDUtil) persistRBD(rbd rbd, mnt string) error {
file := path.Join(mnt, "rbd.json")

This comment has been minimized.

@thockin

thockin Jul 4, 2015

Member

won't this write to the RBD volume itself? Shouldn't this be stored in host.GetPodPluginDir() instead (see secret for example)

This comment has been minimized.

@rootfs

rootfs Jul 6, 2015

Member

json won't be written to rbd volume, because mount is called afer persist.

call it poor man's data protection. I don't want the json to be exposed (so it gets protected). Best way to do it, IMHO, is to make it invisible. So persitRBD() is called before mount, and thus the new mountpoint masks content in the local directory.

This comment has been minimized.

@rootfs

rootfs Jul 6, 2015

Member

added comments about this behavior.

@rootfs rootfs force-pushed the rootfs:rbd-fencing branch from 127cdac to 6b3cfe1 Jul 6, 2015

@thockin

This comment has been minimized.

Copy link
Member

thockin commented Jul 6, 2015

I see. It feels brittle but I can't find a reason it won't work off the top of my head.

LGTM, but I don't think this can go in before 1.0

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jul 6, 2015

Thanks. Let's revisit it post 1.0.

@thockin

This comment has been minimized.

Copy link
Member

thockin commented Jul 6, 2015

Please see #10760 and add this as a known issue for 1.0.

@thockin thockin added the lgtm label Jul 6, 2015

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jul 6, 2015

NEW: If multiple Pods use the same RBD volume in read-write mode, it is possible data on the RBD volume could get corrupted. This problem has been found in environments where both apiserver and etcd rebooted and Pod were redistributed.

A workaround is to ensure there is no other Ceph client uses the RBD volume before mapping RBD image in read-write mode. For example, rados -p poolname listwatchers image_name.rbd can list RBD clients that are mapping the image.

@yujuhong yujuhong added this to the v1.0-post milestone Jul 7, 2015

@bgrant0607 bgrant0607 removed this from the v1.0-post milestone Jul 24, 2015

@mikedanese

This comment has been minimized.

Copy link
Member

mikedanese commented Jul 24, 2015

@rootfs please rebase.

fencing off multiple rbd mount
Signed-off-by: Huamin Chen <hchen@redhat.com>

@rootfs rootfs force-pushed the rootfs:rbd-fencing branch from 6b3cfe1 to fa8a2ef Jul 24, 2015

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Jul 24, 2015

@mikedanese here you go, thanks!

@mikedanese

This comment has been minimized.

Copy link
Member

mikedanese commented Jul 24, 2015

@k8s-bot ok to test

@k8s-bot

This comment has been minimized.

Copy link

k8s-bot commented Jul 24, 2015

GCE e2e build/test passed for commit fa8a2ef.

mikedanese added a commit that referenced this pull request Jul 24, 2015

@mikedanese mikedanese merged commit 18466bf into kubernetes:master Jul 24, 2015

4 checks passed

Jenkins GCE e2e 115 tests run, 51 skipped, 0 failed.
Details
Shippable Shippable builds completed
Details
cla/google All necessary CLAs are signed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@maklemenz

This comment has been minimized.

Copy link

maklemenz commented Aug 6, 2015

@rootfs Who is supposed to unlock the volume once the node/minion dies?

My pod was scheduled to node A and was running. I have shut down node A to simulate a crash and the pod got rescheduled to node B. Node B complains that the volume is locked and won't start my pod: "Error syncing pod, skipping: rbd: image pv1 is locked by other nodes".

Having to manually unlock all ceph volumes once a node dies would not be funny.

@rootfs

This comment has been minimized.

Copy link
Member

rootfs commented Aug 6, 2015

@maklemenz the node that locks the rbd volume should unlock it after the pod is deleted. In your situation the node went away and is thus unable to unlock the rbd. This is not ideal but at least prevent concurrent access to the same rbd and potential data corruption.

Discussions at #6084 may yield a blueprint for a generic solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment