-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Isolation: added support for per-application cgroups. #734
Conversation
One open question is how to handle failures to create or add processes to cgroups, I see three options. a) Ignore failures and carry on I'm leaning towards c. |
If a client can trigger the failure, we shouldn't fail, but I don't expect that to be possible. This is a [re]configuration-time failure, right? I'd just fail hard (and that's what we do in most cases, AFAIK) (by fail hard I mean don't allow the reconfiguration and keep the old one). |
but do you have a diff to see what would mean fail hard vs log & continue? |
On Mon, 18 Jul 2022 15:39:29 -0700 Alejandro Colomar ***@***.***> wrote:
If a client can trigger the failure, we shouldn't fail, but I don't expect that to be possible. This is a [re]configuration-time failure, right? I'd just fail hard (and that's what we do in most cases, AFAIK).
Yes, start up and reconfigure is when cgroups could be
created/populated.
Actually I haven't noticed if there is any process monitoring code, so
say if an application process dies, unit would re-create it... then it
would become a runtime failure, but as I say I haven't noticed that
there is any such code.
|
On Mon, 18 Jul 2022 15:49:00 -0700 Alejandro Colomar ***@***.***> wrote:
@alejandro-colomar commented on this pull request.
> +static nxt_int_t
+nxt_cgroup_create(const nxt_task_t *task, nxt_process_t *process)
+{
+ char cgroot[PATH_MAX];
+ char cgpath[PATH_MAX];
+
+ if (task->thread->runtime->type != NXT_PROCESS_MAIN
+ || nxt_process_type(process) != NXT_PROCESS_PROTOTYPE) {
+ return NXT_OK;
+ }
+
+ if (process->isolation.cgroup.name == NULL) {
+ return NXT_OK;
+ }
+
+ nxt_make_cgpaths(process, cgpath, sizeof(cgpath), cgroot, sizeof(cgroot));
I have a longstanding concern about this kind of sizeof usage:
You know, I might just get rid of that function, I wasn't particularly
happy with it myself and only the cleanup function actually needs both
cgpath and cgroot.
|
This updated commit should address all the above feedback (except the error handling). Notable changes are
|
Apart from most the above, the biggest change here is to unconditionally include the cgroup_cleanup member in nxt_process_isolation_t, that way we can get rid of the #if/#endif from nxt_main_process_cleanup(). |
ba83e0c
to
226cb99
Compare
Rebased with master |
Changes :-
|
What do you think about this simplification? Since we know truncation is not possible, it makes sense to simply not check for it. Let's reconsider; we almost shipped a bug that was in code supposedly preventing bugs... Simple code has less bugs. And given that snprintf(3) is so misdesigned that it's hard to use it correctly, I prefer going for stpcpy(3). However, if we apply this patch, I'd write a clear note in the commit message that truncation is not possible, and detail that we already check for it in the config validation code (and BTW, it makes sense to me to validate only once). Cheers, diff --git a/src/nxt_cgroup.c b/src/nxt_cgroup.c
index 20497a45..0d1ec4f0 100644
--- a/src/nxt_cgroup.c
+++ b/src/nxt_cgroup.c
@@ -8,9 +8,9 @@
#include <nxt_cgroup.h>
-static int nxt_mk_cgpath_relative(nxt_task_t *task, const char *dir,
+static char *nxt_mk_cgpath_relative(nxt_task_t *task, const char *dir,
char *cgpath);
-static nxt_int_t nxt_mk_cgpath(nxt_task_t *task, const char *dir,
+static char *nxt_mk_cgpath(nxt_task_t *task, const char *dir,
char *cgpath);
@@ -18,6 +18,7 @@ nxt_int_t
nxt_cgroup_proc_add(nxt_task_t *task, nxt_process_t *process)
{
int len;
+ char *p;
char cgprocs[PATH_MAX];
FILE *fp;
nxt_int_t ret;
@@ -29,8 +30,8 @@ nxt_cgroup_proc_add(nxt_task_t *task, nxt_process_t *process)
return NXT_OK;
}
- ret = nxt_mk_cgpath(task, process->isolation.cgroup.path, cgprocs);
- if (nxt_slow_path(ret == NXT_ERROR)) {
+ p = nxt_mk_cgpath(task, process->isolation.cgroup.path, cgprocs);
+ if (nxt_slow_path(p == NULL)) {
return NXT_ERROR;
}
@@ -39,13 +40,7 @@ nxt_cgroup_proc_add(nxt_task_t *task, nxt_process_t *process)
return NXT_ERROR;
}
- len = strlen(cgprocs);
-
- len = snprintf(cgprocs + len, PATH_MAX - len, "/cgroup.procs");
- if (nxt_slow_path(len >= PATH_MAX - len)) {
- nxt_errno = ENAMETOOLONG;
- return NXT_ERROR;
- }
+ stpcpy(p, "/cgroup.procs");
fp = nxt_file_fopen(task, cgprocs, "we");
if (nxt_slow_path(fp == NULL)) {
@@ -67,17 +62,16 @@ nxt_cgroup_proc_add(nxt_task_t *task, nxt_process_t *process)
void
nxt_cgroup_cleanup(nxt_task_t *task, const nxt_process_t *process)
{
- char *ptr;
- char cgroot[PATH_MAX], cgpath[PATH_MAX];
- nxt_int_t ret;
+ char *ptr, *ret;
+ char cgroot[PATH_MAX], cgpath[PATH_MAX];
ret = nxt_mk_cgpath(task, "", cgroot);
- if (nxt_slow_path(ret == NXT_ERROR)) {
+ if (nxt_slow_path(ret == NULL)) {
return;
}
ret = nxt_mk_cgpath(task, process->isolation.cgroup.path, cgpath);
- if (nxt_slow_path(ret == NXT_ERROR)) {
+ if (nxt_slow_path(ret == NULL)) {
return;
}
@@ -89,11 +83,11 @@ nxt_cgroup_cleanup(nxt_task_t *task, const nxt_process_t *process)
}
-static int
+static char *
nxt_mk_cgpath_relative(nxt_task_t *task, const char *dir, char *cgpath)
{
- int i, len;
- char *buf, *ptr;
+ int i;
+ char *buf, *ptr, *p;
FILE *fp;
size_t size;
ssize_t nread;
@@ -101,10 +95,10 @@ nxt_mk_cgpath_relative(nxt_task_t *task, const char *dir, char *cgpath)
fp = nxt_file_fopen(task, "/proc/self/cgroup", "re");
if (nxt_slow_path(fp == NULL)) {
- return -1;
+ return NULL;
}
- len = -1;
+ p = NULL;
buf = NULL;
found = 0;
while ((nread = getline(&buf, &size, fp)) != -1) {
@@ -133,21 +127,22 @@ nxt_mk_cgpath_relative(nxt_task_t *task, const char *dir, char *cgpath)
ptr++;
}
- len = snprintf(cgpath, PATH_MAX, NXT_CGROUP_ROOT "%s/%s", ptr, dir);
+ p = stpcpy(cgpath, NXT_CGROUP_ROOT);
+ p = stpcpy(p, ptr);
+ *p++ = '/';
+ p = stpcpy(p, dir);
out_free_buf:
nxt_free(buf);
- return len;
+ return p;
}
-static nxt_int_t
+static char *
nxt_mk_cgpath(nxt_task_t *task, const char *dir, char *cgpath)
{
- int len;
-
/*
* If the path from the config is relative, we need to make
* the cgroup path include the main unit processes cgroup. I.e
@@ -155,19 +150,8 @@ nxt_mk_cgpath(nxt_task_t *task, const char *dir, char *cgpath)
* NXT_CGROUP_ROOT/<main process cgroup>/<cgroup path>
*/
if (dir[0] != '/') {
- len = nxt_mk_cgpath_relative(task, dir, cgpath);
+ return nxt_mk_cgpath_relative(task, dir, cgpath);
} else {
- len = snprintf(cgpath, PATH_MAX, NXT_CGROUP_ROOT "%s", dir);
+ return stpcpy(stpcpy(cgpath, NXT_CGROUP_ROOT), dir);
}
-
- if (len == -1) {
- return NXT_ERROR;
- }
-
- if (len >= PATH_MAX) {
- nxt_errno = ENAMETOOLONG;
- return NXT_ERROR;
- }
-
- return NXT_OK;
} $ git diff --stat
src/nxt_cgroup.c | 62 +++++++++++++++++++++++---------------------------------------
1 file changed, 23 insertions(+), 39 deletions(-) |
While we do check at config time (you could argue it's superfluous, but for the absolute path case it's a quick sanity check before things go too far) for the absolute path case, we don't for the relative path case, so truncation checks are still required. I think it's best to leave it as is at this late stage. |
Yes, I have to agree. We can merge the patch as is, which looks good to me (ugly but good), and then improve it afterwards. Cheers, Alex |
Hi,
I suggest we finish the feature one time.
|
The initial cgroups support is done. There is no 1.30 version, extra features will likely be added but there is no time frame. As agreed this will be committed imminently. If you waited until things were perfect you'd never ship anything! |
My improvement is related to the bug that I reported last week about buffer overruns. I just found another case of a buffer overrun in another part of Unit. Since this case from Andrew was safe (for some particular meaning of safe) so far, I'm fine with merging it, and fixing the bigger issue separately in 1.30. The only remaining issue before releasing 1.29 is the isolation bug that I fixed, so I prefer focusing on that. I'll soon post my tests, and we can maybe develop some pytests for that. |
There are no issues as we discussed, maybe we can remove some unnecessary checks. |
I mean I found another issue in a different part of Unit. I'll share, since I found it's not exploitable remotely. To reproduce, you need to control the app name. |
Ok, welcome to create individual issues :) |
Firstly, this is not to be confused with CLONE_NEWCGROUP which unit
already supports and is related to namespaces. To re-cap, namespaces
allow processes to have different views of various parts of the system
such as filesystem mounts, networking, hostname etc.
Whereas cgroup0 is a Linux kernel facility for collecting a bunch of
processes together to perform some task on the group as a whole, for
example to implement resource limits.
There are two parts to cgroup, the core part of organising processes
into a hierarchy and the controllers which are responsible for enforcing
resource limits etc.
There are currently two versions of the cgroup sub-system, the original
cgroup and a version 21 introduced in 3.16 (August 2014) and marked
stable in 4.5 (March 2016).
This commit supports the cgroup V2 API and implements the ability to
place applications into their own cgroup on a per-application basis.
You can put them each into their cgroup or you can group some together.
The ability to set resource limits can easily be added in future.
The initial use case of this would be to aid in observability of unit
applications which becomes much easier if you can just monitor them on a
per cgroup basis.
One thing to note about cgroup, is that unlike namespaces which are
controlled via system calls such as clone(2) and unshare(2), cgroups are
setup and controlled through the cgroupfs pseudo-filesystem.
cgroup is Linux only and this support will only be enabled if configure
finds the cgroup2 filesystem mount, e.g
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate,memory_recursiveprot)
To make use of this in unit a new "cgroup" section has been added to the
isolation configuration.
e.g
Will place the "python" application into its own cgroup under
CGROUP_ROOT/unit/python
If you know of an already existing cgroup where you's like to place it,
you can, e.g
"name": "system.slice/unit/python"
Where system.slice has already been created by systemd and may already
have some overall system limits applied which would also apply to unit.
Limits apply down the hierarchy and lower groups can't exceed the
previous group limits.
So what does this actually look like? Lets take the unit-calculator
application2 and have each of its applications placed into their own
cgroup. If we give each application a new section like
changing the name for each one, we can visualise the result with the
systemd-cgls command, e.g
We can see that the main unit processes are in the same cgroup as the
shell from where they were started, by default child process are placed
into the same cgroup as the parent.
Then we can see that each application has been placed into its own
cgroup.
The cgroups are removed on shutdown or as required on reconfiguration.