Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Isolation: added support for per-application cgroups. #734

Closed
wants to merge 0 commits into from

Conversation

ac000
Copy link
Member

@ac000 ac000 commented Jul 18, 2022

Firstly, this is not to be confused with CLONE_NEWCGROUP which unit
already supports and is related to namespaces. To re-cap, namespaces
allow processes to have different views of various parts of the system
such as filesystem mounts, networking, hostname etc.

Whereas cgroup0 is a Linux kernel facility for collecting a bunch of
processes together to perform some task on the group as a whole, for
example to implement resource limits.

There are two parts to cgroup, the core part of organising processes
into a hierarchy and the controllers which are responsible for enforcing
resource limits etc.

There are currently two versions of the cgroup sub-system, the original
cgroup and a version 21 introduced in 3.16 (August 2014) and marked
stable in 4.5 (March 2016).

This commit supports the cgroup V2 API and implements the ability to
place applications into their own cgroup on a per-application basis.
You can put them each into their cgroup or you can group some together.
The ability to set resource limits can easily be added in future.

The initial use case of this would be to aid in observability of unit
applications which becomes much easier if you can just monitor them on a
per cgroup basis.

One thing to note about cgroup, is that unlike namespaces which are
controlled via system calls such as clone(2) and unshare(2), cgroups are
setup and controlled through the cgroupfs pseudo-filesystem.

cgroup is Linux only and this support will only be enabled if configure
finds the cgroup2 filesystem mount, e.g

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate,memory_recursiveprot)

To make use of this in unit a new "cgroup" section has been added to the
isolation configuration.

e.g

  "applications": {
      "python": {
          "type": "python",
          "processes": 5,
          "path": "/opt/unit/unit-cgroup-test/",
          "module": "app",

          "isolation": {
              "cgroup": {
                  "name": "unit/python"
              }
          }
      }
  }

Will place the "python" application into its own cgroup under
CGROUP_ROOT/unit/python

If you know of an already existing cgroup where you's like to place it,
you can, e.g

"name": "system.slice/unit/python"

Where system.slice has already been created by systemd and may already
have some overall system limits applied which would also apply to unit.
Limits apply down the hierarchy and lower groups can't exceed the
previous group limits.

So what does this actually look like? Lets take the unit-calculator
application2 and have each of its applications placed into their own
cgroup. If we give each application a new section like

  "isolation": {
      "cgroup": {
          "name": "unit/unit-calculator/add"
      }
  }

changing the name for each one, we can visualise the result with the
systemd-cgls command, e.g

  │   └─session-5.scope (#4561)
  │     ├─  6667 sshd: andrew [priv]
  │     ├─  6684 sshd: andrew@pts/0
  │     ├─  6685 -bash
  │     ├─ 12632 unit: main v1.28.0 [/opt/unit/sbin/unitd --control 127.0.0.1:808>
  │     ├─ 12634 unit: controller
  │     ├─ 12635 unit: router
  │     ├─ 13550 systemd-cgls
  │     └─ 13551 less
  ├─unit (#4759)
  │ └─unit-calculator (#5037)
  │   ├─subtract (#5069)
  │   │ ├─ 12650 unit: "subtract" prototype
  │   │ └─ 12651 unit: "subtract" application
  │   ├─multiply (#5085)
  │   │ ├─ 12653 unit: "multiply" prototype
  │   │ └─ 12654 unit: "multiply" application
  │   ├─divide (#5101)
  │   │ ├─ 12671 unit: "divide" prototype
  │   │ └─ 12672 node divide.js
  │   ├─sqroot (#5117)
  │   │ ├─ 12679 unit: "sqroot" prototype
  │   │ └─ 12680 /home/andrew/src/unit-calculator/sqroot/sqroot
  │   └─add (#5053)
  │     ├─ 12648 unit: "add" prototype
  │     └─ 12649 unit: "add" application

We can see that the main unit processes are in the same cgroup as the
shell from where they were started, by default child process are placed
into the same cgroup as the parent.

Then we can see that each application has been placed into its own
cgroup.

The cgroups are removed on shutdown or as required on reconfiguration.

@ac000
Copy link
Member Author

ac000 commented Jul 18, 2022

One open question is how to handle failures to create or add processes to cgroups, I see three options.

a) Ignore failures and carry on
b) Fail hard
c) Carry on but log failures

I'm leaning towards c.

@alejandro-colomar
Copy link
Contributor

alejandro-colomar commented Jul 18, 2022

If a client can trigger the failure, we shouldn't fail, but I don't expect that to be possible. This is a [re]configuration-time failure, right? I'd just fail hard (and that's what we do in most cases, AFAIK) (by fail hard I mean don't allow the reconfiguration and keep the old one).

src/nxt_cgroup.c Outdated Show resolved Hide resolved
auto/cgroup Outdated Show resolved Hide resolved
src/nxt_cgroup.c Outdated Show resolved Hide resolved
src/nxt_cgroup.c Outdated Show resolved Hide resolved
src/nxt_cgroup.c Outdated Show resolved Hide resolved
src/nxt_cgroup.c Outdated Show resolved Hide resolved
src/nxt_cgroup.h Outdated Show resolved Hide resolved
src/nxt_cgroup.h Outdated Show resolved Hide resolved
@alejandro-colomar
Copy link
Contributor

If a client can trigger the failure, we shouldn't fail, but I don't expect that to be possible. This is a [re]configuration-time failure, right? I'd just fail hard (and that's what we do in most cases, AFAIK).

but do you have a diff to see what would mean fail hard vs log & continue?

@ac000
Copy link
Member Author

ac000 commented Jul 18, 2022 via email

@ac000
Copy link
Member Author

ac000 commented Jul 19, 2022 via email

@ac000
Copy link
Member Author

ac000 commented Jul 21, 2022

This updated commit should address all the above feedback (except the error handling).

Notable changes are

  • The removal of nxt_make_cgpaths(), it was introduced for a problem that no longer exists.
  • The auto/cgroup script has been re-worked to make it POSIX compliant.

src/nxt_cgroup.c Outdated Show resolved Hide resolved
src/nxt_cgroup.c Outdated Show resolved Hide resolved
src/nxt_process.c Outdated Show resolved Hide resolved
src/nxt_cgroup.c Outdated Show resolved Hide resolved
src/nxt_cgroup.c Outdated Show resolved Hide resolved
src/nxt_cgroup.c Outdated Show resolved Hide resolved
src/nxt_cgroup.h Outdated Show resolved Hide resolved
@ac000
Copy link
Member Author

ac000 commented Aug 13, 2022

Apart from most the above, the biggest change here is to unconditionally include the cgroup_cleanup member in nxt_process_isolation_t, that way we can get rid of the #if/#endif from nxt_main_process_cleanup().

src/nxt_cgroup.c Outdated Show resolved Hide resolved
src/nxt_cgroup.c Outdated Show resolved Hide resolved
src/nxt_cgroup.c Outdated Show resolved Hide resolved
src/nxt_cgroup.c Outdated Show resolved Hide resolved
src/nxt_cgroup.c Outdated Show resolved Hide resolved
src/nxt_cgroup.c Outdated Show resolved Hide resolved
@ac000 ac000 force-pushed the cgroup branch 2 times, most recently from ba83e0c to 226cb99 Compare December 9, 2022 14:14
@ac000
Copy link
Member Author

ac000 commented Dec 9, 2022

Rebased with master

@ac000
Copy link
Member Author

ac000 commented Dec 9, 2022

Changes :-

  • PATH_MAX -> NXT_MAX_PATH_LEN
  • Better termination of the prototype processes in case of error

@alejandro-colomar
Copy link
Contributor

alejandro-colomar commented Dec 9, 2022

What do you think about this simplification? Since we know truncation is not possible, it makes sense to simply not check for it. Let's reconsider; we almost shipped a bug that was in code supposedly preventing bugs... Simple code has less bugs. And given that snprintf(3) is so misdesigned that it's hard to use it correctly, I prefer going for stpcpy(3).

However, if we apply this patch, I'd write a clear note in the commit message that truncation is not possible, and detail that we already check for it in the config validation code (and BTW, it makes sense to me to validate only once).

Cheers,
Alex

diff --git a/src/nxt_cgroup.c b/src/nxt_cgroup.c
index 20497a45..0d1ec4f0 100644
--- a/src/nxt_cgroup.c
+++ b/src/nxt_cgroup.c
@@ -8,9 +8,9 @@
 #include <nxt_cgroup.h>
 
 
-static int nxt_mk_cgpath_relative(nxt_task_t *task, const char *dir,
+static char *nxt_mk_cgpath_relative(nxt_task_t *task, const char *dir,
     char *cgpath);
-static nxt_int_t nxt_mk_cgpath(nxt_task_t *task, const char *dir,
+static char *nxt_mk_cgpath(nxt_task_t *task, const char *dir,
     char *cgpath);
 
 
@@ -18,6 +18,7 @@ nxt_int_t
 nxt_cgroup_proc_add(nxt_task_t *task, nxt_process_t *process)
 {
     int        len;
+    char       *p;
     char       cgprocs[PATH_MAX];
     FILE       *fp;
     nxt_int_t  ret;
@@ -29,8 +30,8 @@ nxt_cgroup_proc_add(nxt_task_t *task, nxt_process_t *process)
         return NXT_OK;
     }
 
-    ret = nxt_mk_cgpath(task, process->isolation.cgroup.path, cgprocs);
-    if (nxt_slow_path(ret == NXT_ERROR)) {
+    p = nxt_mk_cgpath(task, process->isolation.cgroup.path, cgprocs);
+    if (nxt_slow_path(p == NULL)) {
         return NXT_ERROR;
     }
 
@@ -39,13 +40,7 @@ nxt_cgroup_proc_add(nxt_task_t *task, nxt_process_t *process)
         return NXT_ERROR;
     }
 
-    len = strlen(cgprocs);
-
-    len = snprintf(cgprocs + len, PATH_MAX - len, "/cgroup.procs");
-    if (nxt_slow_path(len >= PATH_MAX - len)) {
-        nxt_errno = ENAMETOOLONG;
-        return NXT_ERROR;
-    }
+    stpcpy(p, "/cgroup.procs");
 
     fp = nxt_file_fopen(task, cgprocs, "we");
     if (nxt_slow_path(fp == NULL)) {
@@ -67,17 +62,16 @@ nxt_cgroup_proc_add(nxt_task_t *task, nxt_process_t *process)
 void
 nxt_cgroup_cleanup(nxt_task_t *task, const nxt_process_t *process)
 {
-    char       *ptr;
-    char       cgroot[PATH_MAX], cgpath[PATH_MAX];
-    nxt_int_t  ret;
+    char  *ptr, *ret;
+    char  cgroot[PATH_MAX], cgpath[PATH_MAX];
 
     ret = nxt_mk_cgpath(task, "", cgroot);
-    if (nxt_slow_path(ret == NXT_ERROR)) {
+    if (nxt_slow_path(ret == NULL)) {
         return;
     }
 
     ret = nxt_mk_cgpath(task, process->isolation.cgroup.path, cgpath);
-    if (nxt_slow_path(ret == NXT_ERROR)) {
+    if (nxt_slow_path(ret == NULL)) {
         return;
     }
 
@@ -89,11 +83,11 @@ nxt_cgroup_cleanup(nxt_task_t *task, const nxt_process_t *process)
 }
 
 
-static int
+static char *
 nxt_mk_cgpath_relative(nxt_task_t *task, const char *dir, char *cgpath)
 {
-    int         i, len;
-    char        *buf, *ptr;
+    int         i;
+    char        *buf, *ptr, *p;
     FILE        *fp;
     size_t      size;
     ssize_t     nread;
@@ -101,10 +95,10 @@ nxt_mk_cgpath_relative(nxt_task_t *task, const char *dir, char *cgpath)
 
     fp = nxt_file_fopen(task, "/proc/self/cgroup", "re");
     if (nxt_slow_path(fp == NULL)) {
-        return -1;
+        return NULL;
     }
 
-    len = -1;
+    p = NULL;
     buf = NULL;
     found = 0;
     while ((nread = getline(&buf, &size, fp)) != -1) {
@@ -133,21 +127,22 @@ nxt_mk_cgpath_relative(nxt_task_t *task, const char *dir, char *cgpath)
         ptr++;
     }
 
-    len = snprintf(cgpath, PATH_MAX, NXT_CGROUP_ROOT "%s/%s", ptr, dir);
+    p = stpcpy(cgpath, NXT_CGROUP_ROOT);
+    p = stpcpy(p, ptr);
+    *p++ = '/';
+    p = stpcpy(p, dir);
 
 out_free_buf:
 
     nxt_free(buf);
 
-    return len;
+    return p;
 }
 
 
-static nxt_int_t
+static char *
 nxt_mk_cgpath(nxt_task_t *task, const char *dir, char *cgpath)
 {
-    int  len;
-
     /*
      * If the path from the config is relative, we need to make
      * the cgroup path include the main unit processes cgroup. I.e
@@ -155,19 +150,8 @@ nxt_mk_cgpath(nxt_task_t *task, const char *dir, char *cgpath)
      *   NXT_CGROUP_ROOT/<main process cgroup>/<cgroup path>
      */
     if (dir[0] != '/') {
-        len = nxt_mk_cgpath_relative(task, dir, cgpath);
+        return nxt_mk_cgpath_relative(task, dir, cgpath);
     } else {
-        len = snprintf(cgpath, PATH_MAX, NXT_CGROUP_ROOT "%s", dir);
+        return stpcpy(stpcpy(cgpath, NXT_CGROUP_ROOT), dir);
     }
-
-    if (len == -1) {
-        return NXT_ERROR;
-    }
-
-    if (len >= PATH_MAX) {
-        nxt_errno = ENAMETOOLONG;
-        return NXT_ERROR;
-    }
-
-    return NXT_OK;
 }
$ git diff --stat
 src/nxt_cgroup.c | 62 +++++++++++++++++++++++---------------------------------------
 1 file changed, 23 insertions(+), 39 deletions(-)

@ac000
Copy link
Member Author

ac000 commented Dec 9, 2022

While we do check at config time (you could argue it's superfluous, but for the absolute path case it's a quick sanity check before things go too far) for the absolute path case, we don't for the relative path case, so truncation checks are still required.

I think it's best to leave it as is at this late stage.

@alejandro-colomar
Copy link
Contributor

While we do check at config time (you could argue it's superfluous, but for the absolute path case it's a quick sanity check before things go too far) for the absolute path case, we don't for the relative path case, so truncation checks are still required.

I think it's best to leave it as is at this late stage.

Yes, I have to agree. We can merge the patch as is, which looks good to me (ugly but good), and then improve it afterwards.

Cheers,

Alex

@hongzhidao
Copy link
Contributor

hongzhidao commented Dec 9, 2022

Hi,

If we don't test here, that's fine, but we'll need to know that we need to truncate later. So, my previous simplification proposal would be invalid, and we'd need to add stpecpy() at some point. We can do that after the next release.

Yes, I have to agree. We can merge the patch as is, which looks good to me (ugly but good), and then improve it afterwards.

I suggest we finish the feature one time.

  1. The feature is clear now and the same for its logic.
  2. I believe the 1.30 version is more challenging, if a feature is released, we should move on and focus on something new.
  3. It seems a bit strange to do the same feature again, I understand maybe it's an improvement, but it's better to finish it together.

@ac000
Copy link
Member Author

ac000 commented Dec 10, 2022

The initial cgroups support is done. There is no 1.30 version, extra features will likely be added but there is no time frame.

As agreed this will be committed imminently. If you waited until things were perfect you'd never ship anything!

@alejandro-colomar
Copy link
Contributor

@hongzhidao

My improvement is related to the bug that I reported last week about buffer overruns. I just found another case of a buffer overrun in another part of Unit. Since this case from Andrew was safe (for some particular meaning of safe) so far, I'm fine with merging it, and fixing the bigger issue separately in 1.30. The only remaining issue before releasing 1.29 is the isolation bug that I fixed, so I prefer focusing on that. I'll soon post my tests, and we can maybe develop some pytests for that.

@hongzhidao
Copy link
Contributor

My improvement is related to the bug that I reported last week about buffer overruns.

There are no issues as we discussed, maybe we can remove some unnecessary checks.
Do you mean there are some other issues except those reported last week?

@alejandro-colomar
Copy link
Contributor

I mean I found another issue in a different part of Unit. I'll share, since I found it's not exploitable remotely. To reproduce, you need to control the app name.

@hongzhidao
Copy link
Contributor

Ok, welcome to create individual issues :)

@ac000 ac000 deleted the cgroup branch December 13, 2022 18:33
@lcrilly lcrilly moved this from In Development to Released in NGINX Unit roadmap-tracker Jan 13, 2023
@tippexs tippexs moved this from 🚀 Released to 🚀 Released1 in NGINX Unit roadmap-tracker Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🚀 Released
NGINX Unit roadmap-tracker
  
🚀 Released
Development

Successfully merging this pull request may close these issues.

None yet

4 participants