New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add definitions for RDMA controller/cgroup of Linux kernel 4.11 #942

Merged
merged 1 commit into from Mar 1, 2018

Conversation

Projects
None yet
4 participants
@paravmellanox
Contributor

paravmellanox commented Dec 7, 2017

This request is for the inclusion of RDMA controller/cgroup which is added in Linux kernel 4.11.

Background:
RDMA Networking adapters provide low latency, direct access to hardware (HCA) to user space applications to send/receive messages and data in a clustered systems.
With latest HCAs, application to application level messages delivery latency can be as low as < 1usec.
RDMA kernel subsystem achieves this using RDMA resources such as QPs, MRs CQs objects.

RDMA cgroup allows limiting resources of the such RDMA HCA networking adapters.

RDMA resource creations such as hca_handle and hca_objects must be controlled so that certain set of processes doesn't take away all hardware resources.

RDMA controller allows limiting such HCA resource on per HCA basis.

Example implementation of this controller is already available below.
paravmellanox/cgroups@469ac1c

@wking

Once this settles down, it would be good to get JSON Schema entries too.

@@ -169,7 +169,7 @@ In addition to any devices configured with this setting, the runtime MUST also s
## <a name="configLinuxControlGroups" />Control groups
Also known as cgroups, they are used to restrict resource usage for a container and handle device access.
cgroups provide controls (through controllers) to restrict cpu, memory, IO, pids and network for the container.
cgroups provide controls (through controllers) to restrict cpu, memory, IO, pids, network and rdma resources for the container.

This comment has been minimized.

@wking

wking Dec 7, 2017

Contributor

The kernel docs aren't particularly consistent about this, but since RDMA is an acronym for remote direct memory access, I think we want to use RDMA here (and in other places, except for the literal rdma).

This comment has been minimized.

@paravmellanox

paravmellanox Dec 8, 2017

Contributor

ok. yes, I will change rdma to RDMA.

* **`hca_device`** *(string, REQUIRED)* - specifies the device name whose resources limit to be configured
* **`hca_handles`** *(uint32, REQUIRED)* - specifies the maximum number of hca_objects in the cgroup for a specified device
* **`hca_objects`** *(uint32, REQUIRED)* - specifies the maximum number of hca_handles in the cgroup for a specified device

This comment has been minimized.

@wking

wking Dec 7, 2017

Contributor

You have “Default is "no limit".” in the Go comments, but REQUIRED here. Reading the kernel docs, I think hca_handles and hca_objects should both be OPTIONAL. If you like, you can add a requirement like this

You MUST specify at least one of hca_handles or hca_objects in a given entry, and MAY specify both.

The kernel docs also make a distinction between leaving those unset (no change from the previous limit) or setting them to max (clearing a possible previous limit). Since the spec supports joining and tweaking existing cgroups, I think we need a way to distinguish between the two cases. The current policy seems to be using signed integers and using -1 for “explicitly remove any limits”.

This comment has been minimized.

@paravmellanox

paravmellanox Dec 8, 2017

Contributor

You are right. I initially had it optional but than made it required to simplify the -1. But I think its fine as it already exist.
I will make both of them as optional.
So when user doesn't want to configure any one of the value, it should set to -1. Correct?
Because at kernel level they are just optional KV pair, but at cgroups spec level it is a structure. So no_limit should be initialized as -1. Right?

This comment has been minimized.

@wking

wking Dec 8, 2017

Contributor

So when user doesn't want to configure any one of the value, it should set to -1. Correct?

No, leave it unset to not change the limit (nothing written to the control file for this parameter). Set to -1 to clear the limit (max written to the cgroup control file for this parameter).

This comment has been minimized.

@paravmellanox

paravmellanox Dec 8, 2017

Contributor

so when cgroups.New() is invokes, there are 4 valid values for hca_objects/hca_handles.

  1. 0 - no allocation possible
  2. +ve - valid positive value
  3. -1 to clear the limits
  4. do not touch the limits

If we make these two parameters optional, what should user set in specs.LinuxRdma.HcaObjects to convey - 'do not touch the limits'.
Setting 0 will disable it (which is valid value).

This comment has been minimized.

@wking

wking Dec 8, 2017

Contributor

... what should user set in specs.LinuxRdma.HcaObjects to convey - 'do not touch the limits'.

nil, which is why we want pointers.

This comment has been minimized.

@paravmellanox

paravmellanox Dec 8, 2017

Contributor

ok. Sounds good. I will revise the spec.

// Maximum number of HCA handles that can be opened. Default is "no limit".
HcaHandles uint32 `json:"hca_handles"`
// Maximum number of HCA objects that can be created. Default is "no limit".
HcaObjects uint32 `json:"hca_objects"`

This comment has been minimized.

@wking

wking Dec 7, 2017

Contributor

Are zero values allowed (e.g. for “you can't use any HCA handles”)? If not, we should say so in the Markdown spec. If so, we should make these omitempty pointers so we can distinguish between “set to zero” and “unset”.

This comment has been minimized.

@paravmellanox

paravmellanox Dec 7, 2017

Contributor

yes. setting zero is allowed.

@wking

Also on the “don't allow no-op noise” front would be requiring at least one of hca_handles or hca_objects to be set. If you add that contraint, no-op configs like "rdma": [{"hca_device": "m1x5_1"}] become illegal. I'm not strongly against no-op noise, but I am weakly against it to maintain consistency with the existing block I/O pattern.

The following parameters can be specified to set up the controller:
* **`hca_device`** *(string, OPTIONAL)* - specifies the device name whose resources limit to be configured

This comment has been minimized.

@wking

wking Dec 8, 2017

Contributor

I'd rather keep this REQUIRED, because there's no point in supporting "rdma": [{}]. It also lets you drop the later line about “You must specify a valid hca_device…”.

This comment has been minimized.

@paravmellanox

paravmellanox Dec 9, 2017

Contributor

ok. so if I understand correctly, "rdma": [{}] can be treated as illegal config.
"rdma": [{"hca_device": "m1x5_1"}] is valid. And other two parameters are optional anyway.

So you are saying if end user passes hca_device only then we create the rdma controller?
Won't Docker or other engines create all empty cgroup controllers as "rdma": [{}]?

This comment has been minimized.

@wking

wking Dec 9, 2017

Contributor

so if I understand correctly, "rdma": [{}] can be treated as illegal config

That's what you'd get if you REQUIRED hca_device (which I think is a good idea for consistency with rlimits[].type ).

"rdma": [{"hca_device": "m1x5_1"}] is valid…

As you have it now. But if you requiring at least one of hca_handles or hca_objects to be set, then that would be invalid too (and we'd be consistent with the current block I/O approach).

So you are saying if end user passes hca_device only then we create the rdma controller? Won't Docker or other engines create all empty cgroup controllers as "rdma": [{}]?

If you have rdma unset or set to [{}] (or [{"hca_device": "m1x5_1"}], etc.), you're not adding new limits. The spec position on that is:

Runtimes MAY attach the container process to additional cgroup controllers beyond those necessary to fulfill the resources settings.

which means that, for a config that does not adjust limits via linux.resources.rdma, the runtime can attach the container process to the rdma controller at cgroupsPath or not as it sees fit. See #493, #834, and opencontainers/runc#861 for more background.

In case it helps clarify, I think this section of text should look like:

Each entry has the following structure:

* **`hca_device`** *(string, REQUIRED)* - specifies the device name whose resources limit to be configured
* **`hca_handles`** *(uint32, OPTIONAL)* - specifies the maximum number of hca_objects in the cgroup for a specified device
* **`hca_objects`** *(uint32, OPTIONAL)* - specifies the maximum number of hca_handles in the cgroup for a specified device

You MUST specify at least one of `hca_handles` or `hca_objects` in a given entry, and MAY specify both.

To close another no-op loophole we could follow this and require rdma to, when set, contain at least one entry. That would be new wording though, since the spec doesn't currently do that for any OPTIONAL property.

This comment has been minimized.

@paravmellanox

paravmellanox Dec 9, 2017

Contributor

section of text that you described looks good to me. If we follow the text you described, it closes the no-op loophole too, isn't it for rdma controller?

This comment has been minimized.

@wking

wking Dec 9, 2017

Contributor

If we follow the text you described, it closes the no-op loophole too, isn't it for rdma controller?

You could still have "rdma": [] unless you REQUIRE the array to have at least one entry. We don't close that loophole for any other OTIONAL properties at the moment (e.g. "mounts": [] is ok), so we'd need maintainer input on whether it was worth setting a new precedent.

Another issue is the possibility of duplicate hca_device entries. My preferred approach to tgat would be to use an object:

"rdma": {
  "m1x5_1": {
    "hca_handles": 3
  }
}

My attempt to do that for rlimits was rejected (#583), but I'm not clear on why. It's possible that maintainers would support an object for this new property.

This comment has been minimized.

@paravmellanox

paravmellanox Dec 11, 2017

Contributor

This is case (1), where runtimes MAY create the cgroup (if necessary) and attach the container process to the cgroup. Apparently containerd/cgroups decides to do that. But runtimes don't have to. And not creating the cgroup (and not attaching the container process to the non-existent cgroup) is the lazy-cgroups approach discussed in opencontainers/runc#861. If containerd/cgroups decided to implement the lazy-cgroups approach on nil, their Create would instead look like:

I agree to what you described.
However current scheme (1) of runc-spec and its implementation sounds reasonble to me and fits the requirements of rdma too without #861.
When opencontainers/runc#861 scheme is adopted, rdma could possibly follow that.

In order to make progress, I propose to move forward with currently defined scheme and implementation of (1), (2) and (3).
(2) and (3) are already matches generic runc-spec definition.

(1) is implemented by all cgroups (not just pids) in https://github.com/containerd/cgroups.
If (1) needs change in runc-spec definition or wording, it is going to be different PR which will affect other resources too.

This comment has been minimized.

@wking

wking Dec 11, 2017

Contributor

If (1) needs change in runc-spec definition or wording, it is going to be different PR which will affect other resources too.

That makes sense to me. I had been under the impression (based on this comment), that you intended to start requiring cgroup creation (if necessary) and joining in the "rmda": {} case, which is not a requirement that the current spec has for other controllers. Currently the spec, as I read it, allows runtimes to decide whether they want to create/join or not for both the “unset” (e.g. "resources": {}) and “set empty” cases (e.g. "resources": {"memory": {}}) because of this clause. If you're ok using that same logic (for now) for rmda, then I'm back to preferring the wording I floated here.

This comment has been minimized.

@paravmellanox

paravmellanox Dec 11, 2017

Contributor

That makes sense to me. I had been under the impression (based on this comment), that you intended to start requiring cgroup creation (if necessary) and joining in the "rmda": {} case, which is not a requirement that the current spec has for other controllers. Currently the spec, as I read it, allows runtimes to decide whether they want to create/join or not for both the “unset” (e.g. "resources": {}) and “set empty” cases (e.g. "resources": {"memory": {}}) because of this clause. If you're ok using that same logic (for now) for rmda, then I'm back to preferring the wording I floated here.

Based on the clause you mention https://github.com/opencontainers/runtime-spec/blame/v1.0.1/config-linux.md#L197 and based on implementation at https://github.com/containerd/cgroups, which follows the MAY part, -> "rdma": [{}] and "rdma": [{"hca_device": "m1x5_1"}] is also valid similar to priorities array of objects in Network cgroup as better example than Pids. For rdma instead of array of objects, as you suggested, we will have map as

"rdma": {
  "m1x5_1": {
    "hca_handles": 3
  }
}

Are you ok with that?

This comment has been minimized.

@wking

wking Dec 11, 2017

Contributor

For rdma instead of array of objects, as you suggested, we will have map as

"rdma": {
  "m1x5_1": {
    "hca_handles": 3
  }
}

Are you ok with that?

Ah, right, that would change the wording. I'm in favor of the object/map approach, yes.

This comment has been minimized.

@paravmellanox

paravmellanox Dec 11, 2017

Contributor

ok. Great. So let me revise the spec to have treat them as optional and as map instead of array.

// LinuxRdma for Linux cgroup 'rdma' resource management (Linux 4.11)
type LinuxRdma struct {
// Hca device name whose resources to be restricted
HcaDevice string `json:"hca_device,omitempty"`

This comment has been minimized.

@wking

wking Dec 8, 2017

Contributor

If you go back to REQUIRing hca_device, you'll want to drop omitempty here.

@paravmellanox

Please review the updated changes. Reference implementation also updated to match these updates.
Once spec is approved, will send PR for https://github.com/paravmellanox/cgroups/blob/master/rdma.go

```json
"rdma": {
"limits": [
"mlx5_1": {

This comment has been minimized.

@wking

wking Dec 16, 2017

Contributor

This is not valid JSON. I think you want to drop limits and just make rdma an object with device-name keys and handles/objects-limit values.

type LinuxRdma struct {
// Limits are a set of key value pairs that define RDMA resource limits,
// where the key is device name and value is resource limits.
Limits map[string]LinuxRdmaLimit `json:"limits,omitempty"`

This comment has been minimized.

@wking

wking Dec 16, 2017

Contributor

This is whatI think you want for rdma; no need for a Limits. Just use:

type LinuxRdma map[string]LinuxRdmaLimit

Or skip LinuxRdma entirely and just inline map[string]LinuxRdmaLimitinLinuxResources`.

This comment has been minimized.

@paravmellanox

paravmellanox Dec 17, 2017

Contributor

Or skip LinuxRdma entirely and just inlinemap[string]LinuxRdmaLimitinLinuxResources`.
Simplification looks fine to me. Sending updates.

@paravmellanox

can you please approve the changes if they look good now, we need to get this going as users are requesting it.
Hi @wking , any more changes to be done or we are good to go. I believe we addressed all the comments.
We like to get this small part integrated so that we can make this available to users.

@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Jan 23, 2018

@wking any updates? Do you need any more information for this PR?

@wking

This comment has been minimized.

Contributor

wking commented Jan 24, 2018

@wking

This comment has been minimized.

Contributor

wking commented Jan 24, 2018

With paravmellanox#1 merged, the new content looks good to me. You might want to squash this PR down to a single commit. And passing tests are blocked on #945. Any maintainers want to look this over?

@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Jan 24, 2018

I am not sure that if I squash and force push the new squashed commit, will the discussions happened here will be lost or not.
I was hoping that If maintainer does squash merge, than discussion is preserved and final merged code also has single commit.
I little new in this area, so please guild me.

@wking

This comment has been minimized.

Contributor

wking commented Jan 24, 2018

@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Jan 26, 2018

Hi @wking, @mrunalp, and other reviewers,
I squashed all the commits and now have single commit for this PR.
Can you please review and merge it?
Code will follow soon sometime next week once this is merged.

@wking

wking approved these changes Jan 26, 2018

846209f looks good to me. We'll see what the maintainers think ;).

@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Jan 26, 2018

@wking , thanks a lot for the inputs, fixup and additional JSON definitions.

@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Feb 12, 2018

Hi,

@vishh @vbatts @philips @mrunalp @hqhq @dqminh @tianon @crosbymichael @caniszczyk

Can you please review this PR? @wking and I have went through several iterations. Feel that this is good start now and like to merge this PR at earliest to provide this feature to users waiting for it.

The name of the device to limit is the entry key.
Entry values are objects with the following properties:
* **`hca_handles`** *(uint32, OPTIONAL)* - specifies the maximum number of hca_objects in the cgroup

This comment has been minimized.

@crosbymichael

crosbymichael Feb 12, 2018

Member

There should be no _ in the json properties

@crosbymichael

This comment has been minimized.

Member

crosbymichael commented Feb 12, 2018

Overall, this looks good to me once the _ are removed from the json.

@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Feb 12, 2018

Removed the underscore and updated the PR.

@crosbymichael

This comment has been minimized.

Member

crosbymichael commented Feb 12, 2018

Why did you close?

@paravmellanox paravmellanox reopened this Feb 12, 2018

@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Feb 12, 2018

Closed by mistake. I am sorry.

Hi @crosbymichael, would you please marked as reviewed/approved?
Also who else from the group should complete the review?

@crosbymichael

This comment has been minimized.

Member

crosbymichael commented Feb 16, 2018

Can you please rebase on master to pick up the changes that fix the CI?

@crosbymichael

This comment has been minimized.

Member

crosbymichael commented Feb 16, 2018

Sorry, did you merge or rebase?

@crosbymichael

This comment has been minimized.

Member

crosbymichael commented Feb 16, 2018

git fetch origin && git reset --hard origin/master
git cherry-pick 5f778472aa6a2f168964e9105ecd797fe1426372

Then just push again
;)

@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Feb 16, 2018

Got closed during rebase process, must be something wrong I did from the GUI interface. Reopening it again.

@paravmellanox paravmellanox reopened this Feb 16, 2018

@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Feb 16, 2018

After the rebase, travis-ci still seems to fail while executing:
make -C schema test

ERROR: invalid document-file path: stat test/config/good/invalid-json.json: no such file or directory

testing test/config/bad/linux-rdma.json
The document is valid
received unexpected validation success
make: *** [test] Error 1

@wking or @crosbymichael, how to check which out of the 2 check failed the build and where to see what caused the failure if it is due to error in linux-rdma.json

@wking

This comment has been minimized.

Contributor

wking commented Feb 16, 2018

After the rebase…

Looks like you haven't pushed the cherry-pick:

$ git fetch origin
$ git log -1 --oneline --decorate origin/pr/942
5d9bbad (origin/pr/942, origin/pr/936) config: Dedent root paragraphs, since they aren't a list entry
@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Feb 16, 2018

I followed below steps, likely not optimal.

  1. Reset the head of paravmellanox/runtime-spec master branch to an older version.
  2. Rebase & merge from opencontainers/runtime-spec.
  3. git am <rdma-cgroup.patch>
  4. git push origin master

This updated the PR with 4 patches (patches after rebase and new patch).
I am not sure how to avoid these patches which came as rebase to not show up in old PR.

As a result of this push command, travis-ci build triggered.
It looked at test/config/good/invalid-json.json

schema/test/config/good directory contains only 3 files. minimal*.json and spec-example.json.
invalid-json.json file is not found. It is present in test/config/bad directory.

@wking

This comment has been minimized.

Contributor

wking commented Feb 16, 2018

@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Feb 20, 2018

Hi @wking, I did follow the steps as described. However I continue to see the error:

ERROR: invalid document-file path: stat test/config/good/invalid-json.json: no such file or directory

As I mentioned, this file doesn't exist in https://github.com/opencontainers/runtime-spec/tree/master/schema/test/config/good directory where Travis-CI is looking for a file.
It is instead located in https://github.com/opencontainers/runtime-spec/tree/master/schema/test/config/bad

Or may be the error is actually below one?

testing test/config/bad/linux-rdma.json
The document is valid
received unexpected validation success

@wking wking referenced this pull request Feb 20, 2018

Merged

schema/Makefile: fix test #947

@wking

This comment has been minimized.

Contributor

wking commented Feb 20, 2018

@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Feb 22, 2018

Hi @wking, @crosbymichael, Is there anything I can do to remove the bump of #947? I do not have much know-how of Travis-CI though, but I can attempt.

@wking

This comment has been minimized.

Contributor

wking commented Feb 22, 2018

@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Feb 22, 2018

Sure. @wking lets wait for another day to resolve #947.

@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Feb 25, 2018

In order to make progress to make use of this functionality, I updated PR to drop schema/test/config/bad/linux-rdma.json for now. @crosbymichael and others please continue to review and merge the PR.

#### Example
```json
"rdmaLimits": {

This comment has been minimized.

@crosbymichael

crosbymichael Feb 26, 2018

Member

is there a reason why you named this rdmaLimits instead of just rdma?

This comment has been minimized.

@wking

wking Feb 27, 2018

Contributor

is there a reason why you named this rdmaLimits instead of just rdma?

I like rdma better, although I don't care much. Previously, @paravmellanox cited hugepageLimits for precedent, although the spec obviously also contains limiting properties without a Limits suffix.

@wking

This comment has been minimized.

Contributor

wking commented Feb 27, 2018

In order to make progress to make use of this functionality, I updated PR to drop schema/test/config/bad/linux-rdma.json for now.

#947 landed today, so you can restore the bad example.

specs-go/config: Define RDMA cgroup
Linux kernel 4.11 adds support for RDMA cgroup resource controller.
This allows limiting maximum number of open hca_handle and maximum number of
hca_objects which can be created by processes.

config-linux: Add documentation for Linux RDMA cgroup

Add documentation, example and link to kernel documentation for
Linux RDMA cgroup.

additionalProperties is defined for the JSON Schema draft-04 in [1]
with clearer documentation in draft-07 [2]. It is supportd by
gojsonschema since xeipuuv/gojsonschema@0572d9d (added
additionalProperties with inner schema, 2013-06-21).

[1]: https://tools.ietf.org/html/draft-fge-json-schema-validation-00#section-5.4.4
[2]: https://tools.ietf.org/html/draft-handrews-json-schema-validation-00#section-6.5.6

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: W. Trevor King <wking@tremily.us>
@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Feb 27, 2018

both changes done. @wking , @crosbymichael
rdmaLimits -> rdma.
Added back bad/linux-rdma.json file.
Travis-CI build is passing now.

@crosbymichael

This comment has been minimized.

Member

crosbymichael commented Feb 27, 2018

LGTM

Approved with PullApprove

@wking

wking approved these changes Feb 27, 2018

@paravmellanox

This comment has been minimized.

Contributor

paravmellanox commented Mar 1, 2018

@crosbymichael, who else from the group should be reviewing these changes?

@crosbymichael

This comment has been minimized.

Member

crosbymichael commented Mar 1, 2018

@mrunalp

This comment has been minimized.

Contributor

mrunalp commented Mar 1, 2018

LGTM

Approved with PullApprove

@mrunalp mrunalp merged commit fa4b36a into opencontainers:master Mar 1, 2018

2 checks passed

code-review/pullapprove Approved by crosbymichael, mrunalp
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment