Add support for accessing Ceph RBD with the rbd-nbd client #32266

Closed
rzarzynski opened this Issue Sep 8, 2016 · 18 comments

Comments

@rzarzynski

Objective

Add support for the recently emerged rbd-nbd feature-rich client to improve support for Ceph RBD in Kubernetes.

Background

Kubernetes already offers support for Ceph RBD based on the krbd kernel client. Unfortunately, due to obvious reasons, krbd can't use the librbd user-space space library that gets most of the development focus. This caused a feature gap when compared to VM-based environments.

At the moment krbd lacks exclusive locking infrastructure, image management facilities and multi-DC support (remote backups!). The detailed list can be obtained by comparing the definitions of RBD_FEATURES_ALL macro in librbd and kernel. While there are efforts on partially closing the gap (patches waiting for merge), librbd continuously gets new items like consistency groups or fencing improvements.

Beside the feature gap krbd exhibits additional drawback. Being entirely kernel space impacts fault-tolerance as any kernel panic affects a whole node -- not only a single Pod using RBD storage. According to the rbd-nbd pull request the reliability, especially on non-x86 platforms, was one of the driving factors behind the new client.

Those issues can be addressed by employing rbd-nbd -- a thin adapter (858 lines of code in Jewel) between NBD subsystem of Linux kernel and librbd.

The primary concern about rbd-nbd was performance. The authors claimed it's comparable to krbd. Together with Intel we verified those claims. Tests on typical HW where Ceph is deployed (SSDs for journals, HDDs as main storage) showed rbd-nbd delivers 70-80% of krbd performance. Though, we don't consider the raw performance as a deciding factor. Even fastest RBD client wouldn't be a solution-of-choice for performance critical use cases like RDBMS hosting.

Proposal

We would like to propose incorporating support for rbd-nbd. The integration can be accomplished in multiple ways.

  1. Extending the current in-tree plugin for RBD to use krbd or rbd-nbd depending on a configuration parameter. This would assure the excellent commonality. On the other hand we see differences where this might become an issue.
    The krbd-based plugin employs the advisory locking mechanism to provide fencing at the cost of HA -- in case of node crash a manual intervention is required. However, serving images marked to use exclusive locking feature would result in employing the same underlying infrastructure twice. At the moment this isn’t an issue because krbd simply doesn’t support exclusive lock. However, when it finally gets it, RBD images created with Ceph’s default capabilities set will be inaccessible in read-write mode. This was the reason why a similar solution has been refused in QEMU.
    Another problem we see are additional dependencies. After fencing capability is integrated directly in krbd, the dependency on user-space /usr/bin/rbd CLI utility can be removed. The helper is not required for mapping RBD images as the kernel interface may be used directly by the plugin.
  2. Dedicated in-tree plugin. Offers greatest flexibility in adopting new features (like dynamic provisioning of Persistent Volumes).
  3. Dedicated FlexVolume plugin. Simple but limited. Bringing support for e.g. dynamic provisioning would require extensive changes in FlexVolume’s interfaces.

At the moment the option # 2 looks most promising. Though we stay entirely open for the discussion.

Future work

Performance

rbd-nbd efficiency suffers from extra context switches and extensive memory copying between kernel and user-space boundaries. Although we can do nothing with the first factor, the second can be improved for large blocks transfers. Linux kernel has facilities for zero-copy page-aligned memory transfers between kernel and application via the combination of splice and vmsplice system calls.

Fencing

Actually Mirantis is working on bringing fencing to librbd on the top of exclusive lock infrastructure. This would allow to move from the flawed advisory lock mechanism that is currently used by rbd storage plugin.

Dynamic provisioning

Currently RBD images must be created ahead by a cluster administrator. We see possibility to implement the dynamic provisioning for RBD in the future.

@jsafrane

This comment has been minimized.

Show comment
Hide comment
@jsafrane

jsafrane Sep 12, 2016

Member

cc @kubernetes/sig-storage, especially @rootfs

Member

jsafrane commented Sep 12, 2016

cc @kubernetes/sig-storage, especially @rootfs

@rootfs

This comment has been minimized.

Show comment
Hide comment
@rootfs

rootfs Sep 29, 2016

Member
  • rbd-nbd is new to the Ceph 2. Existing Ceph 1.x installation doesn't have it. kbrbd is the common denominator.
  • There are other bridging technologies to get librbd to present rbd image as a block device, e.g. tcmu
  • The drawbacks mentioned are being addressed, e.g. #33660 to resolve fencing issue.
Member

rootfs commented Sep 29, 2016

  • rbd-nbd is new to the Ceph 2. Existing Ceph 1.x installation doesn't have it. kbrbd is the common denominator.
  • There are other bridging technologies to get librbd to present rbd image as a block device, e.g. tcmu
  • The drawbacks mentioned are being addressed, e.g. #33660 to resolve fencing issue.
@flyingcougar

This comment has been minimized.

Show comment
Hide comment
@flyingcougar

flyingcougar Sep 29, 2016

Contributor

@rootfs: What do you mean by Ceph 1.x installation? rbd-nbd is part of Jewel release which is supported i.e by ceph-ansible.

Contributor

flyingcougar commented Sep 29, 2016

@rootfs: What do you mean by Ceph 1.x installation? rbd-nbd is part of Jewel release which is supported i.e by ceph-ansible.

@rootfs

This comment has been minimized.

Show comment
Hide comment
@rootfs

rootfs Sep 29, 2016

Member

@flyingcougar I meant the client bits - this is where rbd-nbd is called to map the image.

Member

rootfs commented Sep 29, 2016

@flyingcougar I meant the client bits - this is where rbd-nbd is called to map the image.

@rzarzynski

This comment has been minimized.

Show comment
Hide comment
@rzarzynski

rzarzynski Sep 29, 2016

The drawbacks mentioned are being addressed, e.g. #33660 to resolve fencing issue.

@rootfs: #33660 is still based on the same advisory locking mechanism that has been refused in QEMU because of the conflict with exclusive-lock flag of an RBD image (enabled by default in Ceph Jewel LTS release). Please take a look on #33013. Ceph community already implemented a dedicated API for fencing.

In addition to that I would say that fencing isn't the sole drawback. The feature gap is still there. For instance, what about remote backups and multi-DC support in general?

The drawbacks mentioned are being addressed, e.g. #33660 to resolve fencing issue.

@rootfs: #33660 is still based on the same advisory locking mechanism that has been refused in QEMU because of the conflict with exclusive-lock flag of an RBD image (enabled by default in Ceph Jewel LTS release). Please take a look on #33013. Ceph community already implemented a dedicated API for fencing.

In addition to that I would say that fencing isn't the sole drawback. The feature gap is still there. For instance, what about remote backups and multi-DC support in general?

@rzarzynski

This comment has been minimized.

Show comment
Hide comment
@rzarzynski

rzarzynski Oct 7, 2016

There are other bridging technologies to get librbd to present rbd image as a block device, e.g. tcmu

Agreed, a couple options exist. I recently talked with @dillaman and in the long term perspective there is a plan to develop TCMU-based bridge for librbd. Anyway, I would expect that those bridges (rbd-nbd, rbd-tcmu, rbd-*) will be so similar (as basing on the same library and having the same features including fencing) that it makes sense to address them with one single plugin called e.g. rbd-userspace. The abstraction layer would be extremely thin covering basically map and unmap only.

There are other bridging technologies to get librbd to present rbd image as a block device, e.g. tcmu

Agreed, a couple options exist. I recently talked with @dillaman and in the long term perspective there is a plan to develop TCMU-based bridge for librbd. Anyway, I would expect that those bridges (rbd-nbd, rbd-tcmu, rbd-*) will be so similar (as basing on the same library and having the same features including fencing) that it makes sense to address them with one single plugin called e.g. rbd-userspace. The abstraction layer would be extremely thin covering basically map and unmap only.

@rzarzynski

This comment has been minimized.

Show comment
Hide comment
@rzarzynski

rzarzynski Oct 7, 2016

kbrbd is the common denominator.

I'm afraid this isn't entirely accurate as the kernel version is a significant degree of freedom. Encountering the version-related limitation tends to be game-over as upgrading the whole kernel to get newer krbd often isn't an option. Moreover, the problem might become much more visible soon as Ilya finishes merging some new features in krbd.

User-space solution wouldn't be locked to kernel version to such degree. Changing librbd and an adapter like rbd-nbd seems to be much more straightforward and isolated than global kernel upgrade.

kbrbd is the common denominator.

I'm afraid this isn't entirely accurate as the kernel version is a significant degree of freedom. Encountering the version-related limitation tends to be game-over as upgrading the whole kernel to get newer krbd often isn't an option. Moreover, the problem might become much more visible soon as Ilya finishes merging some new features in krbd.

User-space solution wouldn't be locked to kernel version to such degree. Changing librbd and an adapter like rbd-nbd seems to be much more straightforward and isolated than global kernel upgrade.

@rootfs

This comment has been minimized.

Show comment
Hide comment
@rootfs

rootfs Oct 10, 2016

Member

Not every bridge is the same. If librbd is implemented as a tcmu-runner
handler, we need targetcli to do the setup. The semantics would be
different from rbd-nbd.

On Fri, Oct 7, 2016 at 6:05 AM, Radoslaw Zarzynski <notifications@github.com

wrote:

There are other bridging technologies to get librbd to present rbd image
as a block device, e.g. tcmu

Agreed, a couple options exist. I recently talked with @dillaman
https://github.com/dillaman and in the long term perspective there is a
plan to develop TCMU-based bridge for librbd. Anyway, I would expect that
those bridges (rbd-nbd, rbd-tcmu, rbd-*) will be so similar (as basing on
the same library and having the same features including fencing) that it
makes sense to address them with one single plugin called e.g.
rbd-userspace. The abstraction layer would be extremely thin covering
basically map and unmap only.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#32266 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvDgHzv0fUBeLFFJv4uT1DXEXkUZNTeks5qxhlNgaJpZM4J3xLL
.

Member

rootfs commented Oct 10, 2016

Not every bridge is the same. If librbd is implemented as a tcmu-runner
handler, we need targetcli to do the setup. The semantics would be
different from rbd-nbd.

On Fri, Oct 7, 2016 at 6:05 AM, Radoslaw Zarzynski <notifications@github.com

wrote:

There are other bridging technologies to get librbd to present rbd image
as a block device, e.g. tcmu

Agreed, a couple options exist. I recently talked with @dillaman
https://github.com/dillaman and in the long term perspective there is a
plan to develop TCMU-based bridge for librbd. Anyway, I would expect that
those bridges (rbd-nbd, rbd-tcmu, rbd-*) will be so similar (as basing on
the same library and having the same features including fencing) that it
makes sense to address them with one single plugin called e.g.
rbd-userspace. The abstraction layer would be extremely thin covering
basically map and unmap only.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#32266 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvDgHzv0fUBeLFFJv4uT1DXEXkUZNTeks5qxhlNgaJpZM4J3xLL
.

@rzarzynski

This comment has been minimized.

Show comment
Hide comment
@rzarzynski

rzarzynski Oct 10, 2016

Not every bridge is the same. If librbd is implemented as a tcmu-runner handler, we need targetcli to do the setup. The semantics would be different from rbd-nbd.

@rootfs: this is why I mentioned the need for the abstraction layer for mapping and unmapping an RBD image.

Not every bridge is the same. If librbd is implemented as a tcmu-runner handler, we need targetcli to do the setup. The semantics would be different from rbd-nbd.

@rootfs: this is why I mentioned the need for the abstraction layer for mapping and unmapping an RBD image.

@fire

This comment has been minimized.

Show comment
Hide comment
@fire

fire Jul 14, 2017

Ping! Is there any new developments on this?

fire commented Jul 14, 2017

Ping! Is there any new developments on this?

@kfox1111

This comment has been minimized.

Show comment
Hide comment

ping

@elaron

This comment has been minimized.

Show comment
Hide comment
@elaron

elaron Nov 1, 2017

Ping! Is there any progress here?

elaron commented Nov 1, 2017

Ping! Is there any progress here?

@kaoet

This comment has been minimized.

Show comment
Hide comment
@kaoet

kaoet Nov 14, 2017

I also want this feature

kaoet commented Nov 14, 2017

I also want this feature

@hdksky

This comment has been minimized.

Show comment
Hide comment
@hdksky

hdksky Jan 15, 2018

Ping! Any progress?

hdksky commented Jan 15, 2018

Ping! Any progress?

@ianchakeres

This comment has been minimized.

Show comment
Hide comment
@ianchakeres

ianchakeres Jan 15, 2018

Member

I've started working on this issue with another team member at Salesforce. We'll update this issue when we have a new PR leveraging rbd-nbd.

/assign @ianchakeres

Member

ianchakeres commented Jan 15, 2018

I've started working on this issue with another team member at Salesforce. We'll update this issue when we have a new PR leveraging rbd-nbd.

/assign @ianchakeres

@dillaman

This comment has been minimized.

Show comment
Hide comment
@dillaman

dillaman Jan 15, 2018

Contributor

As an FYI, I believe there is also an effort just getting underway to implement Container Storage Interface drivers for RBD and CephFS for use by k8s [1] to eventually replace the in-tree drivers.

[1] https://github.com/ceph/ceph-csi

Contributor

dillaman commented Jan 15, 2018

As an FYI, I believe there is also an effort just getting underway to implement Container Storage Interface drivers for RBD and CephFS for use by k8s [1] to eventually replace the in-tree drivers.

[1] https://github.com/ceph/ceph-csi

@kfox1111

This comment has been minimized.

Show comment
Hide comment
@jberkus

This comment has been minimized.

Show comment
Hide comment
@jberkus

jberkus Feb 26, 2018

@saad-ali @ianchakeres Code Freeze begins in 6 hours. Will this feature be completed by then? It needs to be status/approved-for-milestone and at least priority/important-soon to stay open after code freeze. Otherwise, please punt it to 1.11.

jberkus commented Feb 26, 2018

@saad-ali @ianchakeres Code Freeze begins in 6 hours. Will this feature be completed by then? It needs to be status/approved-for-milestone and at least priority/important-soon to stay open after code freeze. Otherwise, please punt it to 1.11.

k8s-merge-robot added a commit that referenced this issue Feb 27, 2018

Merge pull request #58916 from ianchakeres/ceph-nbd
Automatic merge from submit-queue (batch tested with PRs 59674, 60059, 60220, 58916, 60336). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Support rbd-nbd for mounting operations on Ceph rbd volumes

**What this PR does / why we need it**: 
This PR improves Ceph RBD support by allowing the pkg/volume/rbd to use the rbd-nbd client. This client is based on the common and broadly adopted (librados) librbd library, and it is being actively developed and maintained&nbsp;as part of the ceph upstream code base, unlike krbd.

**Which issue(s) this PR Fixes**: 
Fixes #32266

**Special notes for your reviewer**:
rbd-nbd will be used for Ceph rbd volumes if rbd fails.

Some inspiration was pulled from these PRs #38936 & #55866.

**Test Description**: Tested against a k8s cluster with centos/7 as the host os. rbd-nbd installed from package rbd-nbd-10.2.3.rpm.

Tested:
1. Fall-through to current rbd map/unmap when no rbd-nbd tools are found.
2. Map/Unmap through rbd-nbd.
3. Detecting image already mapped to a nbd device and skipping additional mapping.
4. Detecting image already mapped to a rbd device and skipping additional mapping through nbd.
5. Unmap in hosts having mixed rbd and nbd devices (caused by fall-throughs for some images).
6. Map failure in rbd-nbd due to missing image.
7. Map failure in rbd-nbd due to unreachable mon.
8. Fall-through to current rbd map when rbd-nbd map fails for any reason.


**Release note**:

```release-note
K8s supports rbd-nbd for Ceph rbd volume mounts.
```

ianchakeres pushed a commit to ianchakeres/kubernetes that referenced this issue Apr 5, 2018

Merge pull request #62168 from piontec/hotfix/fix-rbd-nbd
Automatic merge from submit-queue (batch tested with PRs 62043, 62168). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

fix typo that redefines variable and breaks code

**What this PR does / why we need it**:
Fixes bug in `rbd_util.go`, where a variable is overriden in `if` scope.

**Which issue(s) this PR fixes**
Fixes #32266

**Special notes for your reviewer**:
This is related to the PR: kubernetes#58916 This can not work, as the variable `nbdToolsFound` is changed only in local scope of `if` and is always `false` outside.

**Release note**:
```release-note
Fixes bug in rbd-nbd utility when rbd is used.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment