New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
syscall: add PidFD, CgroupFD, and UseCgroupFD options for Linux clone to SysProcAttr #51246
Comments
i think this kernel version is too new, the minimal kernel version is still 2.6.23。 |
This is not a proposal to switch to the new syscall, this is a proposal to make its functionality available to userspace. |
Just to add perspective, 5.3 came out in Sept 2019. |
Retitled. There are three parts to this:
All of this seems unobjectionable. |
This proposal has been added to the active column of the proposals project |
Is the intention for
This is a superset of the arguments to Many of these arguments cannot be safely set by Go code. e.g., setting As an alternative, we could add specific useful features enabled by this syscall. e.g.,
|
Definitely not. Indeed, this proposal can be reduced to adding PidFD and CgroupFD, and making use of clone3 CgroupFD is set (note that PidFD can be obtained via clone(2) already, so it's not clear if the code should switch to using clone3 in this case). I fully agree with your analysis, and should have proposed what you had. (Not really related to the proposal, but as far as the implementation goes, my biggest roadblock is obsoleted code for generating per-arch syscall numbers in src/syscall. The files that are supposed to be auto-generated are now changed manually so I am not exactly sure how to add a new syscall number.) |
Indeed, our system call code is a mess. We hope to clean it up (see #51087 and #15282), but right now it is a pain to change. |
Thanks. In some sense this is really two proposals, I'm not sure if we should split it up. Following the API we came to in #47049 (comment), I'll adjust the proposed API to adding the following to
|
Support for the cgroup FD seems fine and uncontroversial to me. The pid FD case is a bit more interesting. I certainly think it should be possible to get a pid FD. In fact, I think we should use a pid FD whenever possible. I wonder if instead of providing only a low-level Down a level, One possibility to merge the limitations above would be to add |
So, I played with this a bit, and it seems it's easier to have something like this: // PidFD is set to a PID file descriptor referring to the child process,
// if CLONE_PIDFD flag is set in Cloneflags. Available since Linux 5.2.
PidFD int The only thing about it is a slightly unusual way of returning a value. The gist of implementation, using existing clone() call, is this: @@ -213,12 +218,12 @@ func forkAndExecInChild1(argv0 *byte, argv, envv []*byte, chroot, dir *byte, att
runtime_BeforeFork()
locked = true
switch {
- case sys.Cloneflags&CLONE_NEWUSER == 0 && sys.Unshareflags&CLONE_NEWUSER == 0:
+ case sys.Cloneflags&(CLONE_NEWUSER|CLONE_PIDFD) == 0 && sys.Unshareflags&CLONE_NEWUSER == 0:
r1, err1 = rawVforkSyscall(SYS_CLONE, uintptr(SIGCHLD|CLONE_VFORK|CLONE_VM)|sys.Cloneflags)
case runtime.GOARCH == "s390x":
- r1, _, err1 = RawSyscall6(SYS_CLONE, 0, uintptr(SIGCHLD)|sys.Cloneflags, 0, 0, 0, 0)
+ r1, _, err1 = RawSyscall6(SYS_CLONE, 0, uintptr(SIGCHLD)|sys.Cloneflags, 0, uintptr(unsafe.Point
er(&sys.PidFD)), 0, 0)
default:
- r1, _, err1 = RawSyscall6(SYS_CLONE, uintptr(SIGCHLD)|sys.Cloneflags, 0, 0, 0, 0, 0)
+ r1, _, err1 = RawSyscall6(SYS_CLONE, uintptr(SIGCHLD)|sys.Cloneflags, 0, uintptr(unsafe.Pointer(
&sys.PidFD)), 0, 0, 0)
}
if err1 != 0 || r1 != 0 {
// If we're in the parent, we must return immediately We can certainly do it with a pointer, too, if that is preferred. The only problem is adding new constants; at this point, I guess, I'd rather patch all the needed files in place, since the generator is outright broken and fixing this is very out of scope for this. |
Currently |
Retitled per discussion above. The current proposal is #51246 (comment), to add PidFD, Cgroup, Cgroup. Certainly you would expect "Pid *int" to be a process ID, not a process ID file descriptor. Hence "PidFD *int". (In contrast, we have "Ctty int" but I think most people would expect a tty int to be a file descriptor, not a tty number. The question is whether Cgroup is more like Ctty or more like Pid.) |
As far as I know there are no standard integer IDs for cgroups. cgroups are primarily managed as directories within a Even proc files identify cgroups with their string path within the cgroupfs. e.g., the last field below,
The first field is the "hierarchy ID", which is a numeric description for the cgroup type:
I suppose that could be confusing, but again this describes a cgroup type, not a specific cgroup. Despite all that, in hindsight I think Additionally, as far as I know, "cgroup FDs" are not a common concept [1]. In fact, I'm not certain if there is anything you can do with this FD besides pass it to clone. Most cgroup operations are performed on files in the cgroup directory. e.g., an existing process is added to a cgroup by opening the [1] A "cgroup FD" in the context of clone is just open(O_RDONLY) of the cgroup directory. |
Oh, of course, the other obvious thing you could do is use this with openat (e.g., |
@prattmic what you look at is cgroup v1, which is a set of cgroups per controller, i.e. we have a forest. This is being superseded by cgroup v2, where we have a single unified cgroup directory for each process (IOW a tree, not a forest). Cgroup v2 is enabled by default in some modern distros, e.g. recent Fedora. For Ubuntu, you need a kernel options The cgroup parameter to clone3 only supports cgroup v2, and it's a file description to cgroup directory. You're right that there is not much use for cgroupfd outside of this clone3. One other use I can think of is openat2, but implementing openat2 is a separate unrelated topic. (Update: I missed your last comment, @prattmic, where you say the same about openat) |
It sounds like there is enough confusion about cgroups that calling it CgroupFD would help clarify matters. |
Retitled. Does anyone object to adding PidFD, CgroupFD, and UseCgroupFD to the Linux SysProcAttr? |
@kolyshkin, we might want to reserve access to the pidfd to Go itself, so that we have the ability to use it for signal delivery on new enough Linux systems. If we add this option to SysProcAttr, we are essentially giving away the pidfd and giving up the ability to use it in the Go runtime or os package in the future. We might not want to do that. What use for pidfd did you have? Perhaps we should figure out an API that would let you do what you need while the Go runtime or os package still owns the pidfd. (Or maybe if we need two in the future we could use Dup.) |
@kolyshkin We are still interested in whether the pidfd itself needs to be exposed, and for what use. Thanks! |
Sorry for not replying earlier, was on vacay. Support for pidfd (I mean full and transparent support, such as embedding it into Obviously, those projects need a way to see if pidfd is supported by the runtime/kernel. Some low-level software, like runc, may find other uses for pidfd (I'm not sure what would be it exactly, perhaps |
Note that the os package already uses I think the relevant question is: if we add an option to |
What I meant is not just waitid, but waitid with P_PIDFD. The current code is Line 32 in a682a5c
That process id may or may not refer to the original process by the time Complete support for pidfd should include switching to |
I am not very familiar with the code, but it seems Linux can reuse the Now,
If all this is out of scope for this proposal, please let me know if you want me to file a new issue regarding pidfd support, and we can limit this one to cgroupfd. |
For this proposal we don't have to discuss how the os package should use pidfd. What I'm asking is this: if we add It seems clearly safer to reserve the pidfd value for the os package. But maybe it's OK to also let the user code get the pidfd. I don't know. (And of course user code can always call |
That's a great but tough question. I thought about various scenarios and I do not foresee any issues. Exposing In addition, a user could do the following operations on
Mostly probably I have missed something, and might make other mistakes -- feel free to correct/contribute. For item 1, though, making the field private and adding an accessor method (e.g. |
Ownership seems to be the biggest obvious problem. We need to define who is responsible for closing the FD. If Stepping back a bit, from #51246 (comment) it seems that the primary desire is use within |
OTOH we already have
Right.
I agree. The only thing is, apps would need a way to know if |
If we did want to let the os package use the pidfd and a user package wanted it too, is there any problem with just calling dup to get two fds and let them both close it when done? |
For what it's worth, I have done a bit of playing around with pidfds recently at https://github.com/mdlayher/pidfd. Here's my use case: My intent with this package is to ultimately use
Since these processes are long-running, I don't expect to run into any races between spawning a process, having it die, and then accidentally getting a pidfd handle to a process which has reused the same PID as my previous one. As for spawning a new process from Go and getting a pidfd assigned immediately by If it turns out to be too tricky to expose today, I believe the approach I've highlighted above should work for all but the shortest-lived processes spawned from Go. |
It sounds like everyone agrees with adding CgroupFD and UseCgroupFD. |
For PidFD, it sounds like there should be no problem handing it out to users right now, and if the Go runtime wants a copy, we can call dup on it to get two in the future. So it sounds like we should add PidFD too. Do I have that right? |
Based on the discussion above, this proposal seems like a likely accept. |
No change in consensus, so accepted. |
Change https://go.dev/cl/407574 mentions this issue: |
Linux 5.3 added a new system call,
clone3()
, which provides a superset of the functionality of the olderclone()
. In particular, it adds a way for a child to be spawned in a different cgroup, which solves a number of issues (seeCLONE_INTO_CGROUP
flag in clone3(2) man page for details).It would be great for some software written in Go to benefit from this new functionality. Obviously, calling
clone3()
syscall directly won't work since Go runtime needs to perform specific steps around fork and exec.So, the support for clone3 needs to be in the syscall package, where
clone()
is called.Looks like this can be implemented by amending linux
SysProcAttr
structure with a pointer toCloneArgs
(a new structure that mirrors that ofC.clone_args
). Then,forkAndExecInChild
would useclone3
instead ofclone
if this pointer is non-nil. This can be used from e.g.os/exec
can setcmd.SysProcArgs.CloneArgs
.The text was updated successfully, but these errors were encountered: