runc worker: fix sigkill handling #3754

coryb · 2023-03-28T08:24:13Z

This fixes the incorrect kill handling introduced in b76f8c0. We need to send the
SIGKILL to the in-container process, not the runc process. This patch adds an abstraction over the kill handling:

for runc run processes use runc kill
for runc exec processes, read pid (in host PID namespace) from pidfile created by runc exec, then send the signal directly to that process.

Also use the kill abstraction when we receive a SIGKILL over the signal channel for containers created by gateway NewContainer

The pid returned on the started channel from runc.(Run|Exec) seems to be the pid of the runc monitor process, not the pid of the process run in the container. I have confirmed the pid written to the pidfile is the host pid of the process in the container, so create tmp pid files for runc to write to so we can extract the correct pid to signal.

~~I would have sworn that the pid returned on the channel used to be the for the in-container process, so I am not sure if runc changed or I am just imagining things.~~

this fixes #3751

edit: @kolyshkin thanks for confirming I am just imagining things. The pids returned from go-runc are the pids of the runc process, the pidfiles written by the runc process contain the initial PID of an in-container process, in the host PID namespace.

coryb · 2023-03-28T15:47:45Z

Ughh, this change is working for the test cases was working with, but causing other issues, will keep looking...

coryb

The basic issue is that the started channel passed into runc.(Run|Exec) returns the pid of the runc process, but generally we need to send signals to the process monitored by runc in the container. We can get this process id via a pidfile that runc will write to. Unfortunately the pidfile creation is sometime after the runc pid is returned on the started channel, so we have to loop waiting for the pidfile to have valid contents.

There is an exception in that we need to send SIGWINCH signals to runc rather than the process inside the container, I suspect that is because the runc process is controlling the tty.

This seems to be the only way that we can manage processes launched via runc exec. With runc run we can use runc kill to signal the in-container process, but runc kill does not work with processes started via runc exec. We could abstract the code a bit to use runc kill for pid1 and the pidfile + process signal for exec'd processes, but not sure that adds any value here. Either way we still have to send SIGWINCH signals to the runc process. (cant use runc kill with SIGWINCH)

coryb · 2023-03-28T22:31:29Z

executor/runcexecutor/executor.go

+						return errors.Errorf("context cancelled before runc wrote to pidfile")
+					}
+					return errors.Wrap(err, "context cancelled before we found valid runc pid")
+				case <-time.After(50 * time.Millisecond):


not sure what value to use here for retry, in my experience it was taking ~30-40ms for the pidfile to be updated. We can shorten this duration, not sure there is much harm in making this loop faster, maybe 5-10ms?

coryb · 2023-03-28T22:33:32Z

executor/runcexecutor/executor.go

+			pidData, err := os.ReadFile(p.pidfile)
+			if err != nil {
+				return errors.Wrap(err, "unable to read pidfile after runc process started")
+			}
+			pid, err = strconv.Atoi(string(pidData))


In theory, runc could partially write the pidfile while we are reading it? Not sure if this is actually possible. Also not sure how we could know, perhaps we could just add the os.FindProcess call fto the retry loop?

runc creates pidfile atomically (create + rename) so no partial read is possible.

AFAIK os.FIndProcess() is a no-op on Linux (it merely wraps a pid in a structure and returns it). Meaning, it doesn't actually check whether the process exists, it never returns any errors. This might be changed in the future of course.

coryb · 2023-03-28T22:35:16Z

worker/tests/common.go

+
+	go func() {
+		defer close(pid2Done)
+		// TODO why doesn't Exec allow for started channel??  Fake it for now


I have a branch with adding a started channel to exec, it was not too complex. I will follow up with another PR for this after this one is resolved.

tonistiigi · 2023-03-28T23:06:17Z

We could abstract the code a bit to use runc kill for pid1 and the pidfile + process signal for exec'd processes, but not sure that adds any value here.

I'm not sure if I agree with this. I see no mention of signal codepath changes in #3722 description. The new code looks quite a lot more complicated, and these retry loops are not really clean. I think it would make sense to keep the old codepath and if it had problems with exec(that afaics do not have issues with capturing pid) then add fixes for it.

We are also dealing with regression in the patch release. We can't take chances on complicated patches with a lot of new code. So the preference is for simpler solutions that have already proven to be stable in the past, compared to maximum code sharing between the run and exec codepath. If the latter is the only way to fix the regression, we need to consider reverting #3722 instead from v0.11.

thaJeztah · 2023-03-28T23:15:39Z

I would have sworn that the pid returned on the channel used to be the for the in-container process, so I am not sure if runc changed or I am just imagining things.

@kolyshkin @AkihiroSuda any ideas if anything could've changed on that ?

coryb · 2023-03-28T23:30:04Z

If the latter is the only way to fix the regression, we need to consider reverting #3722 instead from v0.11.

At this point, reverting probably makes the most sense, I will open another PR for that. I am unable to find any other way to get the pid for in-container process via runc exec without the polling/retry logic. Not really sure how else to proceed with a fix for handling cancels on the runc exec process.

kolyshkin · 2023-03-29T00:25:05Z

executor/runcexecutor/executor.go

+		if err != nil {
+			return errors.Wrapf(err, "unable to find runc monitor process for pid %d", runcPid)
+		}


if this code is linux-specific you can safely do just p.MonitorProcess, _ = os.FindProcess(runcPid). See below for an explanation.

kolyshkin · 2023-03-29T00:40:06Z

I would have sworn that the pid returned on the channel used to be the for the in-container process, so I am not sure if runc changed or I am just imagining things.

@kolyshkin @AkihiroSuda any ideas if anything could've changed on that ?

I seriously doubt runc ever returns the in-container PID (as, except for exec, this would always be 1).

The "pid returned on the channel" bit is not about runc, since we do not have such API.

I guess this is about github.com/containerd/go-runc package. If that's correct, then it writes the PID of runc run command into opts.Created channel, if it exists. See func (r *Runc) Run. This is neither a container PID on the host, nor the container PID inside the container -- it is the PID of runc run invocation.

The initial PID of an in-container process, in the host PID namespace, is written into opts.PidFile, if set.

kolyshkin · 2023-03-29T00:44:00Z

executor/runcexecutor/executor_common.go

+	pidfile, err := os.CreateTemp("", "runc.*.pid")
+	if err != nil {
+		return errors.Wrap(err, "failed to create runc pid file")
+	}
+	pidfile.Close()
+	defer os.Remove(pidfile.Name())


This create/remove dance is racy. Perhaps it's better to create a temp directory and use it to store a pid file.

kolyshkin · 2023-03-29T00:45:16Z

executor/runcexecutor/executor_linux.go

+	pidfile, err := os.CreateTemp("", "runc.*.pid")
+	if err != nil {
+		return errors.Wrap(err, "failed to create runc pid file")
+	}
+	pidfile.Close()
+	defer os.Remove(pidfile.Name())


kolyshkin · 2023-03-29T00:52:32Z

I would have sworn that the pid returned on the channel used to be the for the in-container process, so I am not sure if runc changed or I am just imagining things.

You probably mean go-runc not runc.
This functionality was added by you in allow for optional started channel to receive runc pid containerd/go-runc#69 and apparently hasn't changed since.
Perhaps it makes sense to document Started field of runc.CreateOpts to make sure everyone understands that this is the pid of runc itself, not the container init PID. Same for PidFile field.

kolyshkin · 2023-03-29T00:53:36Z

Overall, the logic in this PR seems legit, but the description needs to be fixed.

kolyshkin · 2023-03-29T01:57:59Z

executor/runcexecutor/executor.go

+
+		// the pid reported from the started channel is the pid of the runc
+		// monitor process, but we will need to send signals to the process
+		// running in the container, which we can read the pidfile.


nit: s/read the pidfile/read from the pidfile/

kolyshkin · 2023-03-29T01:58:44Z

executor/runcexecutor/executor.go

+				if os.IsNotExist(err) {
+					select {
+					case <-ctx.Done():
+						return errors.Errorf("context cancelled before runc wrote pidfile")


nit: can use errors.New instead.

kolyshkin · 2023-03-29T02:02:39Z

executor/runcexecutor/executor.go

+		// SIGWINCH signals to it, if we send them to the process inside the
+		// container nothing happens.


I would change that to

// SIGWINCH signals to it, as terminal resizing is done in runc.

tonistiigi · 2023-03-29T03:09:40Z

I'm still confused about the overall design of this. We are using go-runc library for controlling runc containers. And that library has a Kill method https://github.com/containerd/go-runc/blob/v1.0.0/runc.go#L321 for sending signals to the container. Why are we not using that method and instead have a loop with sleeps in the container startup path(that gets run on any container, even if they don't send any signals at all)?

coryb · 2023-03-29T06:01:53Z

Why are we not using that method and instead have a loop

We can only use the runc kill (via that go-runc function) to signal the pid1 process. For the processes created via run exec there is no runc built-in command that will allow us to signal it. So we need the pidfile/loop in that scenario.

I have updated the code to use runc kill for the processes launched via runc run and will use the pidfile/loop for processes launched from runc exec. This should allow for the most optimal scenario for the pid1 case which is clearly the most common case. I think I could refactor to lazy-load the pidfile/loop logic when we actually try to send a signal, which I would guess is going to be after the pidfile is written in nearly every case.

I have been trying to simplify this code, but pretty clearly have not succeeded. The runc run vs runc exec code paths are nearly identical, but not quite, and those differences seem to keep causing problems (mostly through bad assumptions and my lack of understanding). I really appreciate the comments here, for example the difference between SIGWINCH vs other signals has been a source of confusion for me for a while.

tonistiigi · 2023-03-29T06:24:24Z

For the processes created via run exec there is no runc built-in command that will allow us to signal it. So we need the pidfile/loop in that scenario.

Doesn't the runc process forward the signals in this case? https://github.com/opencontainers/runc/blob/main/exec.go#L191 https://github.com/opencontainers/runc/blob/0d62b950e60f6980b54fe3bafd9a9c608dc1df17/utils_linux.go#L245 https://github.com/opencontainers/runc/blob/0d62b950e60f6980b54fe3bafd9a9c608dc1df17/signals.go#L31

coryb · 2023-03-29T16:54:42Z

Doesn't the runc process forward the signals in this case

Yeah, you are correct. I keep getting confused because I naively expected there to be a single way to send all signals to the processes in the container, but it seems like we have 3 distinct use cases for signals for our runc workers, with slightly different requirements for runc run vs runc exec.

user generated signals (SIGTERM, SIGINT etc)
1. for runc run:
  1. use runc kill
  2. send the signal directly to the process via host PID namespace (read from pidfile)
  3. send the signal to the runc process directly which will be propagated to the in-container process
2. for runc exec
  1. cannot use runc kill
  2. send the signal directly to the process via host PID namespace (read from pidfile)
  3. send the signal to the runc process directly which will be propagated to the in-container process
SIGKILL to terminate the in-container process on context cancel.
1. for runc run
  1. use runc kill
  2. send the signal directly to the process via host PID namespace (read from pidfile)
  3. cannot send the signal to the runc process, results in unreaped child process.
2. for runc exec
  1. cannot use runc kill
  2. send the signal directly to the process via host PID namespace (read from pidfile)
  3. cannot send the signal to the runc process, results in unreaped child process
SIGWINCH
1. send SIGWINCH to the runc process for both runc run and runc exec processes.
2. cannot use runc kill
3. cannot send to process via host PID namespace (read from pidfile)

I think I kept getting confused because generally signals are propagated (case 1.) and our tests were mostly working when we used runc to propagate signals, except for case 2. where it is obviously not possible to propagate SIGKILL. Sending SIGKILL to the runc process allowed for the test cases to exit, but unreaped processes were left behind.

tonistiigi · 2023-03-29T17:20:40Z

@coryb Thanks for the explanation. It looks like the best option could be send to the runc pid directly always, except for SIGKILL that would need custom handling. The implementation should be such that it doesn't affect performance of containers that do not receive SIGKILL.

The alternative would be to use runc kill for main process (except SIGWINCH) and hand-made solution for exec (+SIGWINCH).

I'm not sure if it is completely impossible for runc to clean up the child processes itself after SIGKILL with pdeathsig/subreaper config. But that would be a discussion for the upstream repo.

coryb · 2023-03-29T22:30:06Z

It looks like the best option could be send to the runc pid directly always, except for SIGKILL that would need custom handling. The implementation should be such that it doesn't affect performance of containers that do not receive SIGKILL.

I have gone ahead and refactored to send all signals to the runc process except when we try to Kill on cancel. For that I have added a procKiller abstraction to either call runc kill or read from the pidfile and then send the signal to that pid.

I have also fixed an edge case where a user sends SIGKILL over the Signal channel for containers created by gateway NewContainer, previously this would have killed the runc process rather than the in-container process.

I have also cleaned up a few comments that were ambiguous about which process we were working with.

aaronlehmann · 2023-03-30T17:56:22Z

executor/runcexecutor/executor.go

+	}()
+
+	if k.pidfile == "" {
+		// for `runc runc` process we use `runc kill` to terminate the process


Should this be runc run?

sipsma

Just some nitpicks, nothing blocking, LGTM. Thank you for fixing this @coryb! I confirmed that this commit fixes the problem impacting some Dagger users currently.

cc @tonistiigi if you have time would appreciate getting this merged into master soon if it looks good to you too. Dagger relies on some commits only in the master branch of buildkit right now and we're trying to avoid the overhead of creating a fork of buildkit with patches applied as much as possible 🙏

sipsma · 2023-03-31T16:27:24Z

executor/runcexecutor/executor.go

+				// never send SIGKILL directly to runc, it needs to go to the
+				// process in-container
+				if err := runcProcess.killer.Kill(ctx); err != nil {
+					bklog.G(ctx).Errorf("failed to kill the process: %+v", err)


nit: I think this log will mostly be a duplicate of the log on line 480

thanks fixed.

sipsma · 2023-03-31T16:32:29Z

executor/runcexecutor/executor.go

+		if err != nil {
+			if os.IsNotExist(err) {
+				select {
+				case <-ctx.Done():


I feel like it might be more robust to use a context like waitCtx, cancel := context.WithTimeout(ctx, 10*time.Seconds) or similar that's created in this function.

Reason being that we're now relying on the provided context to always have timeouts/cancellations setup properly, and even if that's true currently it seems like something that could be missed in any future modifications. So adding one extra layer of protection that's derived from the parent context probably doesn't hurt IMO.

This is a huge nitpick though, cases where this would matter are likely quite obscure.

Maybe I am missing something, but I think this edge case is covered. The ctx that is passed into the Kill function is created on like 564 like:

killCtx, timeout := context.WithTimeout(context.Background(), 7*time.Second)

This is the kill that is triggered on the request context being canceled. There is also the client requested kill where client send in a SIGKILL on the Signal channel, that one will be using the cancelable Background context created in runcProcessHandle on line 543. So it is possible for the client requested SIGKILL via the Signal channel to block, but the client will have a request context to cancel, which would trigger us to call kill again with the 7s timeout.

Yeah I totally agree that at this point there wouldn't be a need for a timeout inside this function. I was just thinking that as changes to the rest of the code happen in the future it would be easy to miss something like this and accidentally provide a context that could potentially result in blocking.

But again, this is not important for functionality at this exact moment and a pretty obscure case to begin with, so no need to block merging on this point if you don't think it's worth it.

I see what you are saying, seems fine to me, added it as a fail-safe.

tonistiigi · 2023-03-31T19:24:52Z

executor/runcexecutor/executor_linux.go

@@ -33,22 +34,34 @@ func (w *runcExecutor) run(ctx context.Context, id, bundle string, process execu
 }

 func (w *runcExecutor) exec(ctx context.Context, id, bundle string, specsProcess *specs.Process, process executor.ProcessInfo, started func()) error {
-	return w.callWithIO(ctx, id, bundle, process, started, func(ctx context.Context, started chan<- int, io runc.IO) error {
+	isExec := true


This is somewhat weird now as we are passing a callback that does exec but then we are also passing a boolean to call the related behavior. Should the callback return the killer implementation as well?

Yeah, you are correct, I think I can pass in the correct killer now, will update.

This fixes the incorrect kill handling introduced in b76f8c0. We need to send the SIGKILL to the in-container process, not the runc process. This patch adds an abstraction over the kill handling: * for `runc run` processes use `runc kill` * for `runc exec` processes, read pid (in host PID namespace) from pidfile created by `runc exec`, then send the signal directly to that process. Also use the kill abstraction when we receive a SIGKILL over the signal channel for containers created by gateway NewContainer Signed-off-by: coryb <cbennett@netflix.com>

thaJeztah · 2023-03-31T22:18:31Z

@tonistiigi does this one need a cherry-pick label?

tonistiigi · 2023-04-01T00:17:28Z

I think we were more on the side that because this is a big patch, we will not pick it and instead revert the previous fix in v0.11.

coryb force-pushed the issue-3751 branch 2 times, most recently from 26c29f3 to 393db6f Compare March 28, 2023 15:27

coryb force-pushed the issue-3751 branch 3 times, most recently from db5d937 to a2c70fa Compare March 28, 2023 21:30

coryb commented Mar 28, 2023

View reviewed changes

coryb requested a review from tonistiigi March 28, 2023 22:36

kolyshkin reviewed Mar 29, 2023

View reviewed changes

coryb mentioned this pull request Mar 29, 2023

v0.11: Revert "fix process termination handling for runc exec" #3757

Merged

coryb force-pushed the issue-3751 branch from a2c70fa to f749a06 Compare March 29, 2023 01:50

kolyshkin reviewed Mar 29, 2023

View reviewed changes

coryb force-pushed the issue-3751 branch from f749a06 to 3fafaac Compare March 29, 2023 02:08

coryb force-pushed the issue-3751 branch from 3fafaac to 171bc19 Compare March 29, 2023 05:37

coryb force-pushed the issue-3751 branch 2 times, most recently from 8c6e182 to 758706f Compare March 29, 2023 22:27

coryb changed the title ~~runc worker: read pid from pidfile~~ runc worker: fix sigkill handling Mar 29, 2023

aaronlehmann reviewed Mar 30, 2023

View reviewed changes

coryb force-pushed the issue-3751 branch from 758706f to 3c9fc39 Compare March 30, 2023 19:58

alafanechere mentioned this pull request Mar 31, 2023

connectors-ci: test java connectors airbytehq/airbyte#24225

Merged

sipsma approved these changes Mar 31, 2023

View reviewed changes

coryb force-pushed the issue-3751 branch from 3c9fc39 to a2133e3 Compare March 31, 2023 19:10

tonistiigi approved these changes Mar 31, 2023

View reviewed changes

coryb force-pushed the issue-3751 branch from a2133e3 to 6073f58 Compare March 31, 2023 19:56

tonistiigi approved these changes Mar 31, 2023

View reviewed changes

coryb merged commit 3187d2d into moby:master Mar 31, 2023

		// SIGWINCH signals to it, if we send them to the process inside the
		// container nothing happens.

runc worker: fix sigkill handling #3754

runc worker: fix sigkill handling #3754

Conversation

coryb commented Mar 28, 2023 • edited Loading

coryb commented Mar 28, 2023

coryb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonistiigi commented Mar 28, 2023

thaJeztah commented Mar 28, 2023

coryb commented Mar 28, 2023

Choose a reason for hiding this comment

kolyshkin commented Mar 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolyshkin commented Mar 29, 2023

kolyshkin commented Mar 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonistiigi commented Mar 29, 2023

coryb commented Mar 29, 2023

tonistiigi commented Mar 29, 2023

coryb commented Mar 29, 2023 • edited Loading

tonistiigi commented Mar 29, 2023

coryb commented Mar 29, 2023

Choose a reason for hiding this comment

sipsma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thaJeztah commented Mar 31, 2023

tonistiigi commented Apr 1, 2023

coryb commented Mar 28, 2023 •

edited

Loading

coryb commented Mar 29, 2023 •

edited

Loading