Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc worker: fix sigkill handling #3754

Merged
merged 1 commit into from
Mar 31, 2023
Merged

runc worker: fix sigkill handling #3754

merged 1 commit into from
Mar 31, 2023

Conversation

coryb
Copy link
Collaborator

@coryb coryb commented Mar 28, 2023

This fixes the incorrect kill handling introduced in b76f8c0. We need to send the
SIGKILL to the in-container process, not the runc process. This patch adds an abstraction over the kill handling:

  • for runc run processes use runc kill
  • for runc exec processes, read pid (in host PID namespace) from pidfile created by runc exec, then send the signal directly to that process.

Also use the kill abstraction when we receive a SIGKILL over the signal channel for containers created by gateway NewContainer

The pid returned on the started channel from runc.(Run|Exec) seems to be the pid of the runc monitor process, not the pid of the process run in the container. I have confirmed the pid written to the pidfile is the host pid of the process in the container, so create tmp pid files for runc to write to so we can extract the correct pid to signal.

I would have sworn that the pid returned on the channel used to be the for the in-container process, so I am not sure if runc changed or I am just imagining things.

this fixes #3751

edit: @kolyshkin thanks for confirming I am just imagining things. The pids returned from go-runc are the pids of the runc process, the pidfiles written by the runc process contain the initial PID of an in-container process, in the host PID namespace.

@coryb coryb force-pushed the issue-3751 branch 2 times, most recently from 26c29f3 to 393db6f Compare March 28, 2023 15:27
@coryb
Copy link
Collaborator Author

coryb commented Mar 28, 2023

Ughh, this change is working for the test cases was working with, but causing other issues, will keep looking...

@coryb coryb force-pushed the issue-3751 branch 3 times, most recently from db5d937 to a2c70fa Compare March 28, 2023 21:30
Copy link
Collaborator Author

@coryb coryb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic issue is that the started channel passed into runc.(Run|Exec) returns the pid of the runc process, but generally we need to send signals to the process monitored by runc in the container. We can get this process id via a pidfile that runc will write to. Unfortunately the pidfile creation is sometime after the runc pid is returned on the started channel, so we have to loop waiting for the pidfile to have valid contents.

There is an exception in that we need to send SIGWINCH signals to runc rather than the process inside the container, I suspect that is because the runc process is controlling the tty.

This seems to be the only way that we can manage processes launched via runc exec. With runc run we can use runc kill to signal the in-container process, but runc kill does not work with processes started via runc exec. We could abstract the code a bit to use runc kill for pid1 and the pidfile + process signal for exec'd processes, but not sure that adds any value here. Either way we still have to send SIGWINCH signals to the runc process. (cant use runc kill with SIGWINCH)

return errors.Errorf("context cancelled before runc wrote to pidfile")
}
return errors.Wrap(err, "context cancelled before we found valid runc pid")
case <-time.After(50 * time.Millisecond):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what value to use here for retry, in my experience it was taking ~30-40ms for the pidfile to be updated. We can shorten this duration, not sure there is much harm in making this loop faster, maybe 5-10ms?

Comment on lines 559 to 563
pidData, err := os.ReadFile(p.pidfile)
if err != nil {
return errors.Wrap(err, "unable to read pidfile after runc process started")
}
pid, err = strconv.Atoi(string(pidData))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, runc could partially write the pidfile while we are reading it? Not sure if this is actually possible. Also not sure how we could know, perhaps we could just add the os.FindProcess call fto the retry loop?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. runc creates pidfile atomically (create + rename) so no partial read is possible.
  2. AFAIK os.FIndProcess() is a no-op on Linux (it merely wraps a pid in a structure and returns it). Meaning, it doesn't actually check whether the process exists, it never returns any errors. This might be changed in the future of course.


go func() {
defer close(pid2Done)
// TODO why doesn't Exec allow for started channel?? Fake it for now
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a branch with adding a started channel to exec, it was not too complex. I will follow up with another PR for this after this one is resolved.

@coryb coryb requested a review from tonistiigi March 28, 2023 22:36
@tonistiigi
Copy link
Member

We could abstract the code a bit to use runc kill for pid1 and the pidfile + process signal for exec'd processes, but not sure that adds any value here.

I'm not sure if I agree with this. I see no mention of signal codepath changes in #3722 description. The new code looks quite a lot more complicated, and these retry loops are not really clean. I think it would make sense to keep the old codepath and if it had problems with exec(that afaics do not have issues with capturing pid) then add fixes for it.

We are also dealing with regression in the patch release. We can't take chances on complicated patches with a lot of new code. So the preference is for simpler solutions that have already proven to be stable in the past, compared to maximum code sharing between the run and exec codepath. If the latter is the only way to fix the regression, we need to consider reverting #3722 instead from v0.11.

@thaJeztah
Copy link
Member

I would have sworn that the pid returned on the channel used to be the for the in-container process, so I am not sure if runc changed or I am just imagining things.

@kolyshkin @AkihiroSuda any ideas if anything could've changed on that ?

@coryb
Copy link
Collaborator Author

coryb commented Mar 28, 2023

If the latter is the only way to fix the regression, we need to consider reverting #3722 instead from v0.11.

At this point, reverting probably makes the most sense, I will open another PR for that. I am unable to find any other way to get the pid for in-container process via runc exec without the polling/retry logic. Not really sure how else to proceed with a fix for handling cancels on the runc exec process.

Comment on lines 548 to 550
if err != nil {
return errors.Wrapf(err, "unable to find runc monitor process for pid %d", runcPid)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this code is linux-specific you can safely do just p.MonitorProcess, _ = os.FindProcess(runcPid). See below for an explanation.

@kolyshkin
Copy link
Contributor

I would have sworn that the pid returned on the channel used to be the for the in-container process, so I am not sure if runc changed or I am just imagining things.

@kolyshkin @AkihiroSuda any ideas if anything could've changed on that ?

I seriously doubt runc ever returns the in-container PID (as, except for exec, this would always be 1).

The "pid returned on the channel" bit is not about runc, since we do not have such API.

I guess this is about github.com/containerd/go-runc package. If that's correct, then it writes the PID of runc run command into opts.Created channel, if it exists. See func (r *Runc) Run. This is neither a container PID on the host, nor the container PID inside the container -- it is the PID of runc run invocation.

The initial PID of an in-container process, in the host PID namespace, is written into opts.PidFile, if set.

Comment on lines 56 to 61
pidfile, err := os.CreateTemp("", "runc.*.pid")
if err != nil {
return errors.Wrap(err, "failed to create runc pid file")
}
pidfile.Close()
defer os.Remove(pidfile.Name())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This create/remove dance is racy. Perhaps it's better to create a temp directory and use it to store a pid file.

Comment on lines 49 to 54
pidfile, err := os.CreateTemp("", "runc.*.pid")
if err != nil {
return errors.Wrap(err, "failed to create runc pid file")
}
pidfile.Close()
defer os.Remove(pidfile.Name())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@kolyshkin
Copy link
Contributor

I would have sworn that the pid returned on the channel used to be the for the in-container process, so I am not sure if runc changed or I am just imagining things.

  1. You probably mean go-runc not runc.
  2. This functionality was added by you in allow for optional started channel to receive runc pid containerd/go-runc#69 and apparently hasn't changed since.
  3. Perhaps it makes sense to document Started field of runc.CreateOpts to make sure everyone understands that this is the pid of runc itself, not the container init PID. Same for PidFile field.

@kolyshkin
Copy link
Contributor

Overall, the logic in this PR seems legit, but the description needs to be fixed.


// the pid reported from the started channel is the pid of the runc
// monitor process, but we will need to send signals to the process
// running in the container, which we can read the pidfile.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/read the pidfile/read from the pidfile/

if os.IsNotExist(err) {
select {
case <-ctx.Done():
return errors.Errorf("context cancelled before runc wrote pidfile")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can use errors.New instead.

Comment on lines 578 to 579
// SIGWINCH signals to it, if we send them to the process inside the
// container nothing happens.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change that to

		// SIGWINCH signals to it, as terminal resizing is done in runc.

@tonistiigi
Copy link
Member

I'm still confused about the overall design of this. We are using go-runc library for controlling runc containers. And that library has a Kill method https://github.com/containerd/go-runc/blob/v1.0.0/runc.go#L321 for sending signals to the container. Why are we not using that method and instead have a loop with sleeps in the container startup path(that gets run on any container, even if they don't send any signals at all)?

@coryb
Copy link
Collaborator Author

coryb commented Mar 29, 2023

Why are we not using that method and instead have a loop

We can only use the runc kill (via that go-runc function) to signal the pid1 process. For the processes created via run exec there is no runc built-in command that will allow us to signal it. So we need the pidfile/loop in that scenario.

I have updated the code to use runc kill for the processes launched via runc run and will use the pidfile/loop for processes launched from runc exec. This should allow for the most optimal scenario for the pid1 case which is clearly the most common case. I think I could refactor to lazy-load the pidfile/loop logic when we actually try to send a signal, which I would guess is going to be after the pidfile is written in nearly every case.

I have been trying to simplify this code, but pretty clearly have not succeeded. The runc run vs runc exec code paths are nearly identical, but not quite, and those differences seem to keep causing problems (mostly through bad assumptions and my lack of understanding). I really appreciate the comments here, for example the difference between SIGWINCH vs other signals has been a source of confusion for me for a while.

@tonistiigi
Copy link
Member

For the processes created via run exec there is no runc built-in command that will allow us to signal it. So we need the pidfile/loop in that scenario.

Doesn't the runc process forward the signals in this case? https://github.com/opencontainers/runc/blob/main/exec.go#L191 https://github.com/opencontainers/runc/blob/0d62b950e60f6980b54fe3bafd9a9c608dc1df17/utils_linux.go#L245 https://github.com/opencontainers/runc/blob/0d62b950e60f6980b54fe3bafd9a9c608dc1df17/signals.go#L31

@coryb
Copy link
Collaborator Author

coryb commented Mar 29, 2023

Doesn't the runc process forward the signals in this case

Yeah, you are correct. I keep getting confused because I naively expected there to be a single way to send all signals to the processes in the container, but it seems like we have 3 distinct use cases for signals for our runc workers, with slightly different requirements for runc run vs runc exec.

  1. user generated signals (SIGTERM, SIGINT etc)
    1. for runc run:
      1. use runc kill
      2. send the signal directly to the process via host PID namespace (read from pidfile)
      3. send the signal to the runc process directly which will be propagated to the in-container process
    2. for runc exec
      1. cannot use runc kill
      2. send the signal directly to the process via host PID namespace (read from pidfile)
      3. send the signal to the runc process directly which will be propagated to the in-container process
  2. SIGKILL to terminate the in-container process on context cancel.
    1. for runc run
      1. use runc kill
      2. send the signal directly to the process via host PID namespace (read from pidfile)
      3. cannot send the signal to the runc process, results in unreaped child process.
    2. for runc exec
      1. cannot use runc kill
      2. send the signal directly to the process via host PID namespace (read from pidfile)
      3. cannot send the signal to the runc process, results in unreaped child process
  3. SIGWINCH
    1. send SIGWINCH to the runc process for both runc run and runc exec processes.
    2. cannot use runc kill
    3. cannot send to process via host PID namespace (read from pidfile)

I think I kept getting confused because generally signals are propagated (case 1.) and our tests were mostly working when we used runc to propagate signals, except for case 2. where it is obviously not possible to propagate SIGKILL. Sending SIGKILL to the runc process allowed for the test cases to exit, but unreaped processes were left behind.

@tonistiigi
Copy link
Member

@coryb Thanks for the explanation. It looks like the best option could be send to the runc pid directly always, except for SIGKILL that would need custom handling. The implementation should be such that it doesn't affect performance of containers that do not receive SIGKILL.

The alternative would be to use runc kill for main process (except SIGWINCH) and hand-made solution for exec (+SIGWINCH).

I'm not sure if it is completely impossible for runc to clean up the child processes itself after SIGKILL with pdeathsig/subreaper config. But that would be a discussion for the upstream repo.

@coryb coryb force-pushed the issue-3751 branch 2 times, most recently from 8c6e182 to 758706f Compare March 29, 2023 22:27
@coryb
Copy link
Collaborator Author

coryb commented Mar 29, 2023

It looks like the best option could be send to the runc pid directly always, except for SIGKILL that would need custom handling. The implementation should be such that it doesn't affect performance of containers that do not receive SIGKILL.

I have gone ahead and refactored to send all signals to the runc process except when we try to Kill on cancel. For that I have added a procKiller abstraction to either call runc kill or read from the pidfile and then send the signal to that pid.

I have also fixed an edge case where a user sends SIGKILL over the Signal channel for containers created by gateway NewContainer, previously this would have killed the runc process rather than the in-container process.

I have also cleaned up a few comments that were ambiguous about which process we were working with.

@coryb coryb changed the title runc worker: read pid from pidfile runc worker: fix sigkill handling Mar 29, 2023
}()

if k.pidfile == "" {
// for `runc runc` process we use `runc kill` to terminate the process
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be runc run?

Copy link
Collaborator

@sipsma sipsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nitpicks, nothing blocking, LGTM. Thank you for fixing this @coryb! I confirmed that this commit fixes the problem impacting some Dagger users currently.

cc @tonistiigi if you have time would appreciate getting this merged into master soon if it looks good to you too. Dagger relies on some commits only in the master branch of buildkit right now and we're trying to avoid the overhead of creating a fork of buildkit with patches applied as much as possible 🙏

// never send SIGKILL directly to runc, it needs to go to the
// process in-container
if err := runcProcess.killer.Kill(ctx); err != nil {
bklog.G(ctx).Errorf("failed to kill the process: %+v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this log will mostly be a duplicate of the log on line 480

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks fixed.

if err != nil {
if os.IsNotExist(err) {
select {
case <-ctx.Done():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it might be more robust to use a context like waitCtx, cancel := context.WithTimeout(ctx, 10*time.Seconds) or similar that's created in this function.

Reason being that we're now relying on the provided context to always have timeouts/cancellations setup properly, and even if that's true currently it seems like something that could be missed in any future modifications. So adding one extra layer of protection that's derived from the parent context probably doesn't hurt IMO.

This is a huge nitpick though, cases where this would matter are likely quite obscure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something, but I think this edge case is covered. The ctx that is passed into the Kill function is created on like 564 like:

killCtx, timeout := context.WithTimeout(context.Background(), 7*time.Second)

This is the kill that is triggered on the request context being canceled. There is also the client requested kill where client send in a SIGKILL on the Signal channel, that one will be using the cancelable Background context created in runcProcessHandle on line 543. So it is possible for the client requested SIGKILL via the Signal channel to block, but the client will have a request context to cancel, which would trigger us to call kill again with the 7s timeout.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I totally agree that at this point there wouldn't be a need for a timeout inside this function. I was just thinking that as changes to the rest of the code happen in the future it would be easy to miss something like this and accidentally provide a context that could potentially result in blocking.

But again, this is not important for functionality at this exact moment and a pretty obscure case to begin with, so no need to block merging on this point if you don't think it's worth it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are saying, seems fine to me, added it as a fail-safe.

@@ -33,22 +34,34 @@ func (w *runcExecutor) run(ctx context.Context, id, bundle string, process execu
}

func (w *runcExecutor) exec(ctx context.Context, id, bundle string, specsProcess *specs.Process, process executor.ProcessInfo, started func()) error {
return w.callWithIO(ctx, id, bundle, process, started, func(ctx context.Context, started chan<- int, io runc.IO) error {
isExec := true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat weird now as we are passing a callback that does exec but then we are also passing a boolean to call the related behavior. Should the callback return the killer implementation as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are correct, I think I can pass in the correct killer now, will update.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

This fixes the incorrect kill handling introduced in
b76f8c0.  We need to send the
SIGKILL to the in-container process, not the runc process.  This patch
adds an abstraction over the kill handling:
  * for `runc run` processes use `runc kill`
  * for `runc exec` processes, read pid (in host PID namespace) from
    pidfile created by `runc exec`, then send the signal directly to
    that process.
Also use the kill abstraction when we receive a SIGKILL over the
signal channel for containers created by gateway NewContainer

Signed-off-by: coryb <cbennett@netflix.com>
@coryb coryb merged commit 3187d2d into moby:master Mar 31, 2023
@thaJeztah
Copy link
Member

@tonistiigi does this one need a cherry-pick label?

@tonistiigi
Copy link
Member

I think we were more on the side that because this is a big patch, we will not pick it and instead revert the previous fix in v0.11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

v0.11.5: interrupting Solve() no longer interrupts container processes
6 participants