Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall: add PidFD, CgroupFD, and UseCgroupFD options for Linux clone to SysProcAttr #51246

Closed
kolyshkin opened this issue Feb 17, 2022 · 48 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Proposal Proposal-Accepted
Milestone

Comments

@kolyshkin
Copy link
Contributor

Linux 5.3 added a new system call, clone3(), which provides a superset of the functionality of the older clone(). In particular, it adds a way for a child to be spawned in a different cgroup, which solves a number of issues (see CLONE_INTO_CGROUP flag in clone3(2) man page for details).

It would be great for some software written in Go to benefit from this new functionality. Obviously, calling clone3() syscall directly won't work since Go runtime needs to perform specific steps around fork and exec.

So, the support for clone3 needs to be in the syscall package, where clone() is called.

Looks like this can be implemented by amending linux SysProcAttr structure with a pointer to CloneArgs (a new structure that mirrors that of C.clone_args). Then, forkAndExecInChild would use clone3 instead of clone if this pointer is non-nil. This can be used from e.g. os/exec can set cmd.SysProcArgs.CloneArgs.

@gopherbot gopherbot added this to the Proposal milestone Feb 17, 2022
@cuiweixie
Copy link
Contributor

i think this kernel version is too new, the minimal kernel version is still 2.6.23。

@kolyshkin
Copy link
Contributor Author

think this kernel version is too new, the minimal kernel version is still 2.6.23。

This is not a proposal to switch to the new syscall, this is a proposal to make its functionality available to userspace.

@OneOfOne
Copy link
Contributor

i think this kernel version is too new, the minimal kernel version is still 2.6.23。

Just to add perspective, 5.3 came out in Sept 2019.

@rsc rsc changed the title proposal: syscall: support clone3 / clone_args on Linux proposal: syscall: add CloneArgs struct, add SysProcAttr.CloneArgs field, use clone3 on Linux Feb 23, 2022
@rsc
Copy link
Contributor

rsc commented Feb 23, 2022

Retitled. There are three parts to this:

  1. Add CloneArgs struct
  2. Add CloneArgs *CloneArgs field to SysProcAttr
  3. When SysProcAttr.CloneArgs is set, use clone3 during syscall.ForkExec.

All of this seems unobjectionable.

@rsc
Copy link
Contributor

rsc commented Feb 23, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@prattmic
Copy link
Member

Is the intention for SysProcAttr.CloneArgs to be the entirety of the full C clone_args structure?

struct clone_args {
  u64 flags;        /* Flags bit mask */
  u64 pidfd;        /* Where to store PID file descriptor (int *) */
  u64 child_tid;    /* Where to store child TID, in child's memory (pid_t *) */
  u64 parent_tid;   /* Where to store child TID, in parent's memory (pid_t *) */
  u64 exit_signal;  /* Signal to deliver to parent on child termination */
  u64 stack;        /* Pointer to lowest byte of stack */
  u64 stack_size;   /* Size of stack */
  u64 tls;          /* Location of new TLS */
  u64 set_tid;      /* Pointer to a pid_t array (since Linux 5.5) */
  u64 set_tid_size; /* Number of elements in set_tid (since Linux 5.5) */
  u64 cgroup;       /* File descriptor for target cgroup of child (since Linux 5.7) */
};

This is a superset of the arguments to clone(2), and it seems like a significant departure from the existing SysProcAttr to allow unfettered access, rather than providing specific desired features.

Many of these arguments cannot be safely set by Go code. e.g., setting stack, tls, or certain flags is likely to cause the child (or parent!) to crash before it can complete syscall.ForkExec.

As an alternative, we could add specific useful features enabled by this syscall. e.g.,

  • SysProcAttr.PidFD *int which is assigned to the pidfd of the child.
  • SysProcAttr.CgroupFD *int which the fd for target cgroup of child.

@kolyshkin
Copy link
Contributor Author

kolyshkin commented Feb 23, 2022

Is the intention for SysProcAttr.CloneArgs to be the entirety of the full C clone_args structure?

Definitely not.

Indeed, this proposal can be reduced to adding PidFD and CgroupFD, and making use of clone3 CgroupFD is set (note that PidFD can be obtained via clone(2) already, so it's not clear if the code should switch to using clone3 in this case).

I fully agree with your analysis, and should have proposed what you had.

(Not really related to the proposal, but as far as the implementation goes, my biggest roadblock is obsoleted code for generating per-arch syscall numbers in src/syscall. The files that are supposed to be auto-generated are now changed manually so I am not exactly sure how to add a new syscall number.)

@prattmic prattmic changed the title proposal: syscall: add CloneArgs struct, add SysProcAttr.CloneArgs field, use clone3 on Linux proposal: syscall: add pidFD and cgroupFD options from Linux clone3 to SysProcAttr Feb 24, 2022
@prattmic prattmic changed the title proposal: syscall: add pidFD and cgroupFD options from Linux clone3 to SysProcAttr proposal: syscall: add pidFD and cgroupFD options from Linux clone to SysProcAttr Feb 24, 2022
@prattmic
Copy link
Member

(Not really related to the proposal, but as far as the implementation goes, my biggest roadblock is obsoleted code for generating per-arch syscall numbers in src/syscall. The files that are supposed to be auto-generated are now changed manually so I am not exactly sure how to add a new syscall number.)

Indeed, our system call code is a mess. We hope to clean it up (see #51087 and #15282), but right now it is a pain to change.

@prattmic
Copy link
Member

Indeed, this proposal can be reduced to adding PidFD and CgroupFD.

Thanks. In some sense this is really two proposals, I'm not sure if we should split it up.

Following the API we came to in #47049 (comment), I'll adjust the proposed API to adding the following to SysProcAttr:

PidFD *int // set to pid FD of child if not nil
UseCgroup bool
Cgroup int // fd of cgroup to place child in

@prattmic
Copy link
Member

prattmic commented Feb 24, 2022

Support for the cgroup FD seems fine and uncontroversial to me.

The pid FD case is a bit more interesting. I certainly think it should be possible to get a pid FD. In fact, I think we should use a pid FD whenever possible. I wonder if instead of providing only a low-level SysProcAttr field, we should automatically make os.StartProcess return an os.Process backed by a pid FD rather than the raw pid (Wait, Kill, etc perform pid FD operations). A new os.Process.Fd() could return the pid FD for additional direct use. os.Process closes the pid FD in os.Process.Release (which is also called with a finalizer).

Down a level, syscall.StartProcess could return the pid FD in the handle return slot, which is currently unused on Linux. Some extra argument or SysProcAttr field would be needed to request this, as callers other than os.StartProcess wouldn't know to close the FD by default. (syscall.ForkExec doesn't have this return value and would need a new variant or SysProcAttr field).

One possibility to merge the limitations above would be to add PidFD *int to SysProcAttr, but also make os.StartProcess automatically set this field if not set by callers.

@kolyshkin
Copy link
Contributor Author

PidFD *int // set to pid FD of child if not nil

So, I played with this a bit, and it seems it's easier to have something like this:

       // PidFD is set to a PID file descriptor referring to the child process,
       // if CLONE_PIDFD flag is set in Cloneflags. Available since Linux 5.2.
       PidFD int

The only thing about it is a slightly unusual way of returning a value. The gist of implementation, using existing clone() call, is this:

@@ -213,12 +218,12 @@ func forkAndExecInChild1(argv0 *byte, argv, envv []*byte, chroot, dir *byte, att
        runtime_BeforeFork()
        locked = true
        switch {
-       case sys.Cloneflags&CLONE_NEWUSER == 0 && sys.Unshareflags&CLONE_NEWUSER == 0:
+       case sys.Cloneflags&(CLONE_NEWUSER|CLONE_PIDFD) == 0 && sys.Unshareflags&CLONE_NEWUSER == 0:
                r1, err1 = rawVforkSyscall(SYS_CLONE, uintptr(SIGCHLD|CLONE_VFORK|CLONE_VM)|sys.Cloneflags)
        case runtime.GOARCH == "s390x":
-               r1, _, err1 = RawSyscall6(SYS_CLONE, 0, uintptr(SIGCHLD)|sys.Cloneflags, 0, 0, 0, 0)
+               r1, _, err1 = RawSyscall6(SYS_CLONE, 0, uintptr(SIGCHLD)|sys.Cloneflags, 0, uintptr(unsafe.Point
er(&sys.PidFD)), 0, 0)
        default:
-               r1, _, err1 = RawSyscall6(SYS_CLONE, uintptr(SIGCHLD)|sys.Cloneflags, 0, 0, 0, 0, 0)
+               r1, _, err1 = RawSyscall6(SYS_CLONE, uintptr(SIGCHLD)|sys.Cloneflags, 0, uintptr(unsafe.Pointer(
&sys.PidFD)), 0, 0, 0)
        }
        if err1 != 0 || r1 != 0 {
                // If we're in the parent, we must return immediately

We can certainly do it with a pointer, too, if that is preferred.

The only problem is adding new constants; at this point, I guess, I'd rather patch all the needed files in place, since the generator is outright broken and fixing this is very out of scope for this.

@ianlancetaylor
Copy link
Contributor

Currently SysProcAttr is not changed by StartProcess. I would be reluctant to introduce such a case. So I think that a pointer is better. Also, using a pointer makes clear when the caller expects the new descriptor; we don't want to create it if the caller doesn't want it.

@rsc rsc changed the title proposal: syscall: add pidFD and cgroupFD options from Linux clone to SysProcAttr proposal: syscall: add PidFD, UseCgroup, and Cgroup options from Linux clone to SysProcAttr Mar 2, 2022
@rsc
Copy link
Contributor

rsc commented Mar 2, 2022

Retitled per discussion above. The current proposal is #51246 (comment), to add PidFD, Cgroup, Cgroup.

Certainly you would expect "Pid *int" to be a process ID, not a process ID file descriptor. Hence "PidFD *int".
What about Cgroup? Are there integer IDs for cgroups already? Should it be CgroupFD?

(In contrast, we have "Ctty int" but I think most people would expect a tty int to be a file descriptor, not a tty number. The question is whether Cgroup is more like Ctty or more like Pid.)

@prattmic
Copy link
Member

prattmic commented Mar 2, 2022

Certainly you would expect "Pid *int" to be a process ID, not a process ID file descriptor. Hence "PidFD *int". What about Cgroup? Are there integer IDs for cgroups already? Should it be CgroupFD?

As far as I know there are no standard integer IDs for cgroups. cgroups are primarily managed as directories within a cgroupfs mount. Technically those directories would have integer inode IDs, but I've never heard of those being used for anything.

Even proc files identify cgroups with their string path within the cgroupfs. e.g., the last field below,

$ cat /proc/225775/cgroup 
13:freezer:/
12:memory:/user.slice/user-200669.slice/session-c16.scope
11:perf_event:/
10:pids:/user.slice/user-200669.slice/session-c16.scope
9:misc:/
8:rdma:/
7:hugetlb:/
6:devices:/user.slice
5:net_cls,net_prio:/
4:cpuset:/
3:cpu,cpuacct:/user.slice/user-200669.slice/session-c16.scope
2:blkio:/user.slice
1:name=systemd:/user.slice/user-200669.slice/session-c16.scope
0::/user.slice/user-200669.slice/session-c16.scope

The first field is the "hierarchy ID", which is a numeric description for the cgroup type:

$ cat /proc/cgroups
#subsys_name    hierarchy       num_cgroups     enabled
cpuset  4       2       1
cpu     3       96      1
cpuacct 3       96      1
blkio   2       87      1
memory  12      124     1
devices 6       87      1
freezer 13      2       1
net_cls 5       2       1
perf_event      11      2       1
net_prio        5       2       1
hugetlb 7       2       1
pids    10      96      1
rdma    8       1       1
misc    9       1       1

I suppose that could be confusing, but again this describes a cgroup type, not a specific cgroup.

Despite all that, in hindsight I think CgroupFD is a better name. I suggested Cgroup because that is the name used in Linux's clone_args, but CgroupFD IMO only adds clarity.

Additionally, as far as I know, "cgroup FDs" are not a common concept [1]. In fact, I'm not certain if there is anything you can do with this FD besides pass it to clone. Most cgroup operations are performed on files in the cgroup directory. e.g., an existing process is added to a cgroup by opening the cgroup.procs file in the directory and writing a PID. So I think adding clarity in the name is helpful. (The documentation on the field should also explain how to get this FD).

[1] A "cgroup FD" in the context of clone is just open(O_RDONLY) of the cgroup directory.

@prattmic
Copy link
Member

prattmic commented Mar 2, 2022

In fact, I'm not certain if there is anything you can do with this FD besides pass it to clone.

Oh, of course, the other obvious thing you could do is use this with openat (e.g., openat(cgroupFD, "cgroup.procs")) to avoid needing to know the name of the cgroup (or even the cgroupfs mount point). But I'm not sure if that is something commonly done.

@kolyshkin
Copy link
Contributor Author

kolyshkin commented Mar 7, 2022

@prattmic what you look at is cgroup v1, which is a set of cgroups per controller, i.e. we have a forest. This is being superseded by cgroup v2, where we have a single unified cgroup directory for each process (IOW a tree, not a forest). Cgroup v2 is enabled by default in some modern distros, e.g. recent Fedora. For Ubuntu, you need a kernel options

The cgroup parameter to clone3 only supports cgroup v2, and it's a file description to cgroup directory.

You're right that there is not much use for cgroupfd outside of this clone3. One other use I can think of is openat2, but implementing openat2 is a separate unrelated topic. (Update: I missed your last comment, @prattmic, where you say the same about openat)

@rsc
Copy link
Contributor

rsc commented Mar 9, 2022

It sounds like there is enough confusion about cgroups that calling it CgroupFD would help clarify matters.

@rsc rsc changed the title proposal: syscall: add PidFD, UseCgroup, and Cgroup options from Linux clone to SysProcAttr proposal: syscall: add PidFD, CgroupFD, and UseCgroupFD options for Linux clone to SysProcAttr Mar 9, 2022
@rsc
Copy link
Contributor

rsc commented Mar 9, 2022

Retitled. Does anyone object to adding PidFD, CgroupFD, and UseCgroupFD to the Linux SysProcAttr?

@rsc
Copy link
Contributor

rsc commented Mar 9, 2022

@kolyshkin, we might want to reserve access to the pidfd to Go itself, so that we have the ability to use it for signal delivery on new enough Linux systems. If we add this option to SysProcAttr, we are essentially giving away the pidfd and giving up the ability to use it in the Go runtime or os package in the future. We might not want to do that.

What use for pidfd did you have? Perhaps we should figure out an API that would let you do what you need while the Go runtime or os package still owns the pidfd.

(Or maybe if we need two in the future we could use Dup.)

@rsc
Copy link
Contributor

rsc commented Mar 16, 2022

@kolyshkin We are still interested in whether the pidfd itself needs to be exposed, and for what use. Thanks!

@kolyshkin
Copy link
Contributor Author

Sorry for not replying earlier, was on vacay.

Support for pidfd (I mean full and transparent support, such as embedding it into os.Process and using from e.g. Wait (that would be waitid(P_PIDFD, ...)) and Kill methods) would be nice to have, since using it will protect from pid reuse, and I've seen a few projects that have their own semi-crappy ways of dealing with it (usually by storing and checking process start time).

Obviously, those projects need a way to see if pidfd is supported by the runtime/kernel.

Some low-level software, like runc, may find other uses for pidfd (I'm not sure what would be it exactly, perhaps pidfd_getfd(2), and I guess future kernels may add more ways to use pidfd), so having some method to get it (hidden under Sys() or the like) would be nice to have, too, and could also be used as a way to check if pidfd is supported.

gopherbot pushed a commit that referenced this issue Aug 19, 2022
The constants for these were auto-generated from the C includes
into zerrors_linux* files quite some time ago. The generator is
currently broken, but some new flags need to be added nevertheless.

As the flags won't change and the values are the same for all
architectures, we can just define them statically (as it's already
done in the runtime package):

 - remove the CLONE_* constants from zerrors_linux_*.go;
 - patch mkerrors.sh to not generate CLONE_ constants
   (in case it will be fixed and used in the future);
 - add the constants and some comments about them to exec_linux.go,
   using Linux v5.17 include/uapi/sched.h as the ultimate source.

This adds the following new flags:

 - CLONE_CLEAR_SIGHAND
 - CLONE_INTO_CGROUP
 - CLONE_NEWCGROUP
 - CLONE_NEWTIME
 - CLONE_PIDFD

For #51246.

Change-Id: I0c635723926218bd403d37e113ee4d62194463a8
Reviewed-on: https://go-review.googlesource.com/c/go/+/407574
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Joedian Reid <joedian@golang.org>
Auto-Submit: Ian Lance Taylor <iant@google.com>
gopherbot pushed a commit that referenced this issue Sep 9, 2022
Implement CLONE_INTO_CGROUP feature, allowing to put a child in a
specified cgroup in a clean and simple way. Note that the feature only
works for cgroup v2, and requires Linux kernel 5.7 or newer.

Using the feature requires a new syscall, clone3. Currently this is the
only reason to use clone3, but the code is structured in a way so that
other cases may be easily added in the future.

Add a test case.

While at it, try to simplify the syscall calling code in
forkAndExecInChild1, which became complicated over time because:

1. It was using either rawVforkSyscall or RawSyscall6 depending on
   whether CLONE_NEWUSER was set.

2. On Linux/s390, the first two arguments to clone(2) system call are
   swapped (which deserved a mention in Linux ABI hall of shame). It
   was worked around in rawVforkSyscall on s390, but had to be
   implemented via a switch/case when using RawSyscall6, making the code
   less clear.

Let's

 - modify rawVforkSyscall to have two arguments (which is also required
   for clone3);

 - remove the arguments workaround from s390 asm, instead implementing
   arguments swap in the caller (which still looks ugly but at least
   it's done once and is clearly documented now);

 - use rawVforkSyscall for all cases (since it is essentially similar to
   RawSyscall6, except for having less parameters, not returning r2, and
   saving/restoring the return address before/after syscall on 386 and
   amd64).

Updates #51246.

Change-Id: Ifcd418ebead9257177338ffbcccd0bdecb94474e
Reviewed-on: https://go-review.googlesource.com/c/go/+/417695
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Run-TryBot: Kirill Kolyshkin <kolyshkin@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@zhuangqh
Copy link

zhuangqh commented Aug 2, 2023

Can we backport this feature to go1.19? Because linux5.3 has been released in 2019 and go1.20 is too new for us. thanks
@rsc @kolyshkin

@kolyshkin
Copy link
Contributor Author

Can we backport this feature to go1.19? Because linux5.3 has been released in 2019 and go1.20 is too new for us. thanks @rsc @kolyshkin

I'm not a Go maintainer, but I do maintain a bunch of Go packages. From that perspective, I can suggest you to

  • move to Go 1.20 ASAP (since Go 1.19 won't be supported once Go 1.21 is released later this month);
  • in general, be proactive about supporting new Go versions (before they are out -- this is what betas and RCs are for);
  • do not add comments not directly related to the issue being discussed (doing a backport is clearly a separate issue)

zhuangqh pushed a commit to zhuangqh/go that referenced this issue Aug 2, 2023
The constants for these were auto-generated from the C includes
into zerrors_linux* files quite some time ago. The generator is
currently broken, but some new flags need to be added nevertheless.

As the flags won't change and the values are the same for all
architectures, we can just define them statically (as it's already
done in the runtime package):

 - remove the CLONE_* constants from zerrors_linux_*.go;
 - patch mkerrors.sh to not generate CLONE_ constants
   (in case it will be fixed and used in the future);
 - add the constants and some comments about them to exec_linux.go,
   using Linux v5.17 include/uapi/sched.h as the ultimate source.

This adds the following new flags:

 - CLONE_CLEAR_SIGHAND
 - CLONE_INTO_CGROUP
 - CLONE_NEWCGROUP
 - CLONE_NEWTIME
 - CLONE_PIDFD

For golang#51246.

Change-Id: I0c635723926218bd403d37e113ee4d62194463a8
Reviewed-on: https://go-review.googlesource.com/c/go/+/407574
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Joedian Reid <joedian@golang.org>
Auto-Submit: Ian Lance Taylor <iant@google.com>
zhuangqh pushed a commit to zhuangqh/go that referenced this issue Aug 2, 2023
Implement CLONE_INTO_CGROUP feature, allowing to put a child in a
specified cgroup in a clean and simple way. Note that the feature only
works for cgroup v2, and requires Linux kernel 5.7 or newer.

Using the feature requires a new syscall, clone3. Currently this is the
only reason to use clone3, but the code is structured in a way so that
other cases may be easily added in the future.

Add a test case.

While at it, try to simplify the syscall calling code in
forkAndExecInChild1, which became complicated over time because:

1. It was using either rawVforkSyscall or RawSyscall6 depending on
   whether CLONE_NEWUSER was set.

2. On Linux/s390, the first two arguments to clone(2) system call are
   swapped (which deserved a mention in Linux ABI hall of shame). It
   was worked around in rawVforkSyscall on s390, but had to be
   implemented via a switch/case when using RawSyscall6, making the code
   less clear.

Let's

 - modify rawVforkSyscall to have two arguments (which is also required
   for clone3);

 - remove the arguments workaround from s390 asm, instead implementing
   arguments swap in the caller (which still looks ugly but at least
   it's done once and is clearly documented now);

 - use rawVforkSyscall for all cases (since it is essentially similar to
   RawSyscall6, except for having less parameters, not returning r2, and
   saving/restoring the return address before/after syscall on 386 and
   amd64).

Updates golang#51246.

Change-Id: Ifcd418ebead9257177338ffbcccd0bdecb94474e
Reviewed-on: https://go-review.googlesource.com/c/go/+/417695
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Run-TryBot: Kirill Kolyshkin <kolyshkin@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@ianlancetaylor
Copy link
Contributor

@zhuangqh Sorry, we don't backport new features like this to older Go releases. See https://go.dev/wiki/MinorReleases for our backport policy.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/520266 mentions this issue: syscall: add PidFD support to SysProcAttr

gopherbot pushed a commit that referenced this issue Sep 7, 2023
Add PidFD support, so that if the PidFD pointer in SysProcAttr is not
nil, ForkExec (and thus all its users) obtains a pidfd from the kernel
during clone(), and writes the result (or -1, if the functionality
is not supported by the kernel) into *PidFD.

The functionality to get pidfd is implemented for both clone3 and clone.
For the latter, an extra argument to rawVforkSyscall is needed, thus the
change in asm files.

Add a trivial test case checking the obtained pidfd can be used to send
a signal to a process, using pidfd_send_signal. To test clone3 code path,
add a flag available to tests only.

Updates #51246.

Change-Id: I2212b69e1a657163c31b4a6245b076bc495777a3
Reviewed-on: https://go-review.googlesource.com/c/go/+/520266
Auto-Submit: Ian Lance Taylor <iant@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Kirill Kolyshkin <kolyshkin@gmail.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/528436 mentions this issue: internal/syscall/unix: add PidFDSendSignal for linux

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/528439 mentions this issue: os: use PidfdSendSignal on linux if possible

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/528798 mentions this issue: os: make use of pidfd on linux

@kolyshkin
Copy link
Contributor Author

Support for the cgroup FD seems fine and uncontroversial to me.

The pid FD case is a bit more interesting. I certainly think it should be possible to get a pid FD. In fact, I think we should use a pid FD whenever possible. I wonder if instead of providing only a low-level SysProcAttr field, we should automatically make os.StartProcess return an os.Process backed by a pid FD rather than the raw pid (Wait, Kill, etc perform pid FD operations). A new os.Process.Fd() could return the pid FD for additional direct use. os.Process closes the pid FD in os.Process.Release (which is also called with a finalizer).

Down a level, syscall.StartProcess could return the pid FD in the handle return slot, which is currently unused on Linux. Some extra argument or SysProcAttr field would be needed to request this, as callers other than os.StartProcess wouldn't know to close the FD by default. (syscall.ForkExec doesn't have this return value and would need a new variant or SysProcAttr field).

One possibility to merge the limitations above would be to add PidFD *int to SysProcAttr, but also make os.StartProcess automatically set this field if not set by callers.

@prattmic Opened https://go.dev/issue/62654 based on this, and wrote a partial implementation in https://go-review.googlesource.com/c/go/+/528438 and https://go-review.googlesource.com/c/go/+/528798/2?usp=related-change

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/542698 mentions this issue: syscall: fix getting pidfd when using CLONE_NEWUSER

gopherbot pushed a commit that referenced this issue Nov 21, 2023
While working on CL 528798, I found out that sys.PidFD field (added
in CL 520266) is not filled in when CLONE_NEWUSER is used.

This happens because the code assumed that the parent and the child
run in the same memory space. This assumption is right only when
CLONE_VM is used for clone syscall, and the code only sets CLONE_VM
when CLONE_NEWUSER is not used.

Fix this, and add a test case (which fails before the fix).

Updates #51246.

Change-Id: I805203c1369cadd63d769568b132a9ffd92cc184
Reviewed-on: https://go-review.googlesource.com/c/go/+/542698
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/544318 mentions this issue: syscall: check SyscallIsNotSupported in TestPidFDWithUserNS

gopherbot pushed a commit that referenced this issue Nov 21, 2023
For #51246.

Change-Id: Ief2e2e14f039123a6580cb60be7ee74f4a20a649
Reviewed-on: https://go-review.googlesource.com/c/go/+/544318
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
gopherbot pushed a commit that referenced this issue Nov 21, 2023
CL 520266 added pidfd_send_signal linux syscall numbers to the
syscall package for the sake of a unit test.

As pidfd_send_signal will be used from the os package, let's revert the
changes to syscall package, add the pidfd_send_signal syscall numbers
and the implementation to internal/syscall/unix, and change the above
test to use it.

Updates #51246.
For #62654.

Change-Id: I862174c3c1a64baf1080792bdb3a1c1d1b417bb4
Reviewed-on: https://go-review.googlesource.com/c/go/+/528436
Run-TryBot: Kirill Kolyshkin <kolyshkin@gmail.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/570681 mentions this issue: os: make FindProcess use pidfd on Linux

@kolyshkin
Copy link
Contributor Author

I think this could be closed now since the proposal is fully implemented as of Go 1.22. The continuation of pidfd use from golang runtime discussion happens in #62654.

gopherbot pushed a commit that referenced this issue May 21, 2024
This is a continuation of CL 570036.

Amend FindProcess to use pidfdFind, and make it return a special
Process with Pid of pidDone (-2) if the process is not found.

Amend Wait and Signal to return ErrProcessDone if pid == pidDone.

The alternative to the above would be to make FindProcess return
ErrProcessDone, but this is unexpected and incompatible API change,
as discussed in #65866 and #51246.

For #62654.

Rework of CL 542699 (which got reverted in CL 566476).

Change-Id: Ifb4cd3ad1433152fd72ee685d0b85d20377f8723
Reviewed-on: https://go-review.googlesource.com/c/go/+/570681
TryBot-Bypass: Dmitri Shuralyov <dmitshur@golang.org>
Run-TryBot: Kirill Kolyshkin <kolyshkin@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
@dmitshur dmitshur modified the milestones: Backlog, Go1.22 May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Proposal Proposal-Accepted
Projects
Status: Accepted
Development

No branches or pull requests

10 participants