Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor libcontainerd to minimize containerd RPCs #43564

Merged
merged 7 commits into from Aug 25, 2022

Conversation

corhere
Copy link
Contributor

@corhere corhere commented May 4, 2022

- What I did
While investigating a report of slow docker execs and failing health checks on heavily-loaded systems, I instrumented dockerd with OpenTelemetry tracing (https://github.com/corhere/moby/tree/otel-trace) and captured traces of various container operations. I was surprised to see RPCs such as containerd.services.containers.v1.Container/Get being called multiple times in a row for nearly every operation. In the common case of a local containerd process each RPC costs at least two context switches to complete, so we want to keep the number of RPCs per operation to a minimum to reduce the overhead of dockerd and containerd on the system. I overhauled libcontainerd and refactored how the daemon uses libcontainerd to do just that.

I modified the behavior of the Container.State.OOMKilled flag to update immediately (rather than on container exit) to simplify the code and reduce the amount of state libcontainerd needs to track, with the side benefit of the new behavior being more intuitive to users.

I benchmarked these changes against the base commit (bb88ff4) and measured a reduction in wallclock time for docker exec of 6.8% in my test environment.

Benchmarking details
$ id=$(docker run -d alpine sleep infinity)
$ for x in {1..5}; do time for i in {1..500}; do docker exec $id true; done; done
master libcontainerd-overhaul
Run 1 45.768s 43.029s
Run 2 47.123s 42.953s
Run 3 46.916s 43.299s
Run 4 47.735s 43.245s
Run 5 43.926s 43.223s
Average 46.294s 43.150s
(43.150 - 46.294) / 46.294 = -6.79%

- How I did it
The containerd client library is very chatty, generally erring on the side of refreshing metadata over referencing a locally-cached copy. (The containerd gRPC API has no concurrency control so multiple clients could race to mutate the same resources. Aggressively refreshing metadata only shortens the race window; it does not eliminate it. Perhaps a future version of the containerd API could improve upon the situation so it can be less aggressive about refreshing. But I digress...) And further exacerbating the situation, the libcontainerd implementation is effectively forced to reload the containerd resources at the start of every operation because containers and processes are referenced by ID strings. To illustrate, consider the libcontainerd Start() method. Its implementation is, in broad strokes:

func (c *client) Start(ctx context.Context, id string, /* ... */) {
	ctr, _ := c.client.LoadContainer(ctx, id) // v1.Container/Get
	spec, _ := ctr.Spec(ctx) // v1.Container/Get
	labels, _ := ctr.Labels(ctx) // v1.Container/Get
	t, _ := ctr.NewTask(ctx, /* options derived from spec, labels */) // v1.Container/Get, v1.Task/Create
	_ = t.Start(ctx) // v1.Task/Start
}

That's four calls to containerd.services.containers.v1.Container/Get in a row, all to get the exact same metadata! Two of those redundant calls can be easily eliminated by reusing the container metadata initially loaded by c.client.LoadContainer. And the LoadContainer call could also be elided if the container object returned by c.client.NewContainer from when the container was created was retained. The number of RPCs to start a container can be cut in half without any changes to the containerd client! The containerd client resource objects need to be retained and reused to pull this off, which means more state for the Docker daemon to manage.

The Docker daemon already keeps track of state for every container and exec, with mappings from ID strings to the respective state objects. Holding persistent state in the libcontainerd layer with mappings from ID strings to state objects would be redundant, so I overhauled the libcontainerd interfaces to allow for implementations which hold no shared mutable state. Not coincidentally, this shape mirrors the containerd client's interfaces. I delegated retaining references to the libcontainerd objects to the daemon's container state and exec config structs.

The local Windows (HCSv1) libcontainerd implementation was a bit of a challenge to refactor to fit the new shape of the libcontainerd interface, but it also benefited from eliminating shared mutable state.

- How to verify it
CI ✅

- Description for the changelog

  • Reduced the number of RPCs required to perform operations with containerd-based runtimes
  • A container's State.OOMKilled flag is now set to true immediately upon any container process getting OOM-killed by the kernel, and cleared to false when the container is restarted

- A picture of a cute animal (not mandatory but encouraged)

Comment on lines -444 to -458
// Fix for https://github.com/moby/moby/issues/38719.
// If the init process failed to launch, we still need to reap the
// container to avoid leaking it.
//
// Note we use the explicit exit code of 127 which is the
// Linux shell equivalent of "command not found". Windows cannot
// know ahead of time whether or not the command exists, especially
// in the case of Hyper-V containers.
ctr.Unlock()
exitedAt := time.Now()
p := &process{
id: libcontainerdtypes.InitProcessName,
pid: 0,
}
c.reapContainer(ctr, p, 127, exitedAt, nil, logger)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers: containers will not be leaked if creating the init process fails, despite the code being removed here. The daemon handles Start errors by cleaning up the container with (types.Container).Delete, and the local_windows implementation shuts down and closes the container, same as reapContainer did.

@corhere corhere marked this pull request as ready for review May 10, 2022 21:32
@corhere corhere requested a review from cpuguy83 as a code owner May 10, 2022 21:32
@thaJeztah thaJeztah added this to the v-next milestone May 12, 2022
@thaJeztah thaJeztah added this to In progress in containerd integration Jun 23, 2022
@thaJeztah
Copy link
Member

needs a rebase 😅

@thaJeztah thaJeztah removed this from In progress in containerd integration Jul 11, 2022
@corhere corhere force-pushed the libcontainerd-overhaul branch 2 times, most recently from 666151d to 718120f Compare July 22, 2022 20:00
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to do another pass on this; left some comments / thoughts

Comment on lines +4608 to +4654
Whether a process within this container has been killed because it ran
out of memory since the container was last started.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I wasn't fully aware we also set this field if an exec or "any other process in the namespace" was OOM killed.

IIRC (but I'd have to dig history) the intent of this field was to learn "why did the container exit?" (which could be due to OOM), so (implicitly) this field was meant as indicator for the container's main process being OOM killed.

So, this change means a couple of things;

  • (already the case?) OOMKilled could mean "any process" was killed (some of which could've been "non-fatal")
  • 👍 setting this field immediately allows for such (non-fatal) occurrences to be observed while the container is running
  • ❓ but they may be (potentially) a red herring if the OOMKilled field is true (but wasn't the cause of the container ultimately exiting).
  • 👉 that said; if processes are being killed due to OOM in the container, it could still be useful information (container exiting because one of it's child-processes was terminated, and the container running in "degraded" state).

The above is neither "good", nor "bad", but "different". The only thing I'm wondering is; would there be reasons (would there be value) in being able to differentiate those? Thinking along the lines of;

  • A counter for OOM events (seeing the counter go up, but the container still running, means that it's potentially in degraded state and/or configured with insufficient resources).
  • OOMKilled to be reserved for the container's main process? (devil's advocate; the exit event itself may not be directly caused by OOM for the main process, but as a result of other processes)
  • ^^ possible solution to that would be to either deprecate OOMKilled (in favor of a counter?) or on exit; OOMKilled = len(OOMKilledCounter) > 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOMKilled always meant "any process" was killed. I have only changed when the flag is updated, and corrected the documentation.

Unfortunately, we can't reserve OOMKilled for the container's pid 1 because cgroups only keeps track of how many times the OOM killer has been invoked on processes in a group, not which processes in the group got killed. And even if we could, things could get really confusing with docker run --init and containers which use a fit-for-purpose init (e.g. s6-svscan) as pid 1. If we had the ability to know which processes got OOM-killed, I think it would be so much more useful to surface that information as discrete events associated with the container with timestamps, PIDs and /proc/<pid>/cmdlines.

Even without detailed information, I think surfacing OOMs as timestamped events would be far superior to a counter as it would allow the fatal OOM-kills to be differentiated from the non-fatal ones, even after the container exits. (Heuristic: how much time elapsed between the OOM-kill and the container stopping.) A close runner-up would be surfacing a counter along with the timestamp of the most recent OOM-killed event received by the daemon.


If only memory.oom.group was enabled in containers. (AFAICT runC does not.) That would clear up any ambiguity quite nicely.

// InitializeStdio is called by libcontainerd to connect the stdio.
func (c *Config) InitializeStdio(iop *cio.DirectIO) (cio.IO, error) {
func (c *ExecConfig) InitializeStdio(iop *cio.DirectIO) (cio.IO, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly wondering what the motivation was to move this into the container package (instead of keeping it as a smaller package). I'd probably need to check if locally to see if there's any issues w.r.t. non-exported fields (e.g.)?

(Also looking commit-by-commit, so perhaps there's more in the next commit(s) 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to do it to avoid a circular import between the ./daemon/exec and ./container packages.

s.ctr = ctr
s.task = tsk
if tsk != nil {
s.Pid = int(tsk.Pid())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we should consider changing our State.Pid to an uint32 to match containerd's type

@@ -57,8 +57,11 @@ func (daemon *Daemon) CheckpointCreate(name string, config types.CheckpointCreat
return err
}

if !container.IsRunning() {
return fmt.Errorf("Container %s not running", name)
container.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just thinking out loud) should we have an RWMutex instead of a Mutex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if profiling reveals significant lock contention as RWMutex is slower than Mutex. golang/go#38813, golang/go#17973

Comment on lines +370 to +369
switch status {
case containerd.Paused, containerd.Pausing:
// nothing to do
case containerd.Unknown, containerd.Stopped, "":
log.WithField("status", status).Error("unexpected status for paused container during restore")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly wondering here; so alive is set based on containerd state above; so would we ever be able to get into this situation?

Also, currently this code looks like;

switch {
	case SOME_CONDITION && alive:
	case OTHER_CONDITION && alive:
}
if !alive {
	// do other things
}

Instead of including the && alive in the switch condition, perhaps it'd be clearer to include it in a if/else for alive;

if alive {
	switch {
		case SOME_CONDITION:
		case OTHER_CONDITION:
	}
} else {
	// do other things
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could get into this situation if containerd's task state disagrees with docker's serialized container state.

perhaps it'd be clearer to...

func (*Daemon) restore() is in desperate need of a major refactor. It's nearly 400 lines long with a cyclomatic complexity to match! I would love to improve it, but this PR is already huge and tough enough to review as it is. I think it would be best for everyone's sanity to refactor restore() in a separate PR and include your suggested change there.

daemon/daemon_unix.go Show resolved Hide resolved
Copy link
Member

@tianon tianon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks OK to me - would love to get it into HEAD and have it receive more testing. 👀

It needs a rebase, but hopefully it's not too complicated (looks like maybe just some minor struct changes?)

daemon/exec.go Outdated
// Must use a new context since the current context is done.
ctx := context.TODO()
logrus.Debugf("Sending TERM signal to process %v in container %v", name, ec.Container.ID)
ec.Process.Kill(ctx, signal.SignalMap["TERM"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant to #43739, this could be simpler if we were just SIGKILLing 👀 (this is the same bit of code, right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same bit of code!

@cpuguy83
Copy link
Member

Looks like quite the rebase is needed 😢

@thaJeztah
Copy link
Member

Looks like quite the rebase is needed 😢

I did a quick rebase in #43967 (pushed as temporary PR to have a run of CI, but if it looks good, we can reset this branch to that branch)

@thaJeztah
Copy link
Member

discussing in the maintainers meeting; lets get #43739 in first, so that we can backport that for 22.06, then merge this one 😅

@thaJeztah
Copy link
Member

@corhere #43739 was merged, so this needs a rebase now; after that, I think it should be ready to go?

The daemon.containerd.Exec call does not access or mutate the
container's ExecCommands store in any way, and locking the exec config
is sufficient to synchronize with the event-processing loop. Locking
the ExecCommands store while starting the exec process only serves to
block unrelated operations on the container for an extended period of
time.

Convert the Store struct's mutex to an unexported field to prevent this
from regressing in the future.

Signed-off-by: Cory Snider <csnider@mirantis.com>
The containerd client is very chatty at the best of times. Because the
libcontained API is stateless and references containers and processes by
string ID for every method call, the implementation is essentially
forced to use the containerd client in a way which amplifies the number
of redundant RPCs invoked to perform any operation. The libcontainerd
remote implementation has to reload the containerd container, task
and/or process metadata for nearly every operation. This in turn
amplifies the number of context switches between dockerd and containerd
to perform any container operation or handle a containerd event,
increasing the load on the system which could otherwise be allocated to
workloads.

Overhaul the libcontainerd interface to reduce the impedance mismatch
with the containerd client so that the containerd client can be used
more efficiently. Split the API out into container, task and process
interfaces which the consumer is expected to retain so that
libcontainerd can retain state---especially the analogous containerd
client objects---without having to manage any state-store inside the
libcontainerd client.

Signed-off-by: Cory Snider <csnider@mirantis.com>
The existing logic to handle container ID conflicts when attempting to
create a plugin container is not nearly as robust as the implementation
in daemon for user containers. Extract and refine the logic from daemon
and use it in the plugin executor.

Signed-off-by: Cory Snider <csnider@mirantis.com>
Attempting to delete the directory while another goroutine is
concurrently executing a CheckpointTo() can fail on Windows due to file
locking. As all callers of CheckpointTo() are required to hold the
container lock, holding the lock while deleting the directory ensures
that there will be no interference.

Signed-off-by: Cory Snider <csnider@mirantis.com>
Modifying the builtin Windows runtime to send the exited event
immediately upon the container's init process exiting, without first
waiting for the Compute System to shut down, perturbed the timings
enough to make TestWaitConditions flaky on that platform. Make
TestWaitConditions timing-independent by having the container wait
for input on STDIN before exiting.

Signed-off-by: Cory Snider <csnider@mirantis.com>
Comment on lines +141 to +142
// Hold the container lock while deleting the container root directory
// so that other goroutines don't attempt to concurrently open files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find 👍

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through this one a few more times. Lots of code, but I didn't spot any issues, but of course we'll now have the opportunity to give this some "burn in" time in CI. Looks like a great improvement of the existing code; thank you!

LGTM

@corhere
Copy link
Contributor Author

corhere commented Aug 24, 2022

=== FAIL: amd64.integration-cli TestDockerSuite/TestExecAPIStartValidCommand (0.96s)
    docker_api_exec_test.go:187: assertion failed: inspectJSON.ExecIDs is not nil
    --- FAIL: TestDockerSuite/TestExecAPIStartValidCommand (0.96s)

This test passed in the prior CI run so it's likely some kind of race condition. I'm looking into whether the test is just flaky or if my refactoring opened up a window of time where an invariant is violated.

We have integration tests which assert the invariant that a
GET /containers/{id}/json response lists only IDs of execs which are in
the Running state, according to GET /exec/{id}/json. The invariant could
be violated if those requests were to race the handling of the exec's
task-exit event. The coarse-grained locking of the container ExecStore
when starting an exec task was accidentally synchronizing
(*Daemon).ProcessEvent and (*Daemon).ContainerExecInspect to it just
enough to make it improbable for the integration tests to catch the
invariant violation on execs which exit immediately. Removing the
unnecessary locking made the underlying race condition more likely for
the tests to hit.

Maintain the invariant by deleting the exec from its container's
ExecCommands before clearing its Running flag. Additionally, fix other
potential data races with execs by ensuring that the ExecConfig lock is
held whenever a mutable field is read from or written to.

Signed-off-by: Cory Snider <csnider@mirantis.com>
containerID := container.Run(ctx, t, cli, opts...)
poll.WaitOn(t, container.IsInState(ctx, cli, containerID, "running"), poll.WithTimeout(30*time.Second), poll.WithDelay(100*time.Millisecond))
opts := append([]func(*container.TestContainerConfig){
container.WithCmd("sh", "-c", "read -r; exit 99"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍 this is so much a cleaner solution than sleep

@tianon tianon merged commit 0ec426a into moby:master Aug 25, 2022
@tianon
Copy link
Member

tianon commented Aug 25, 2022

Bombs away!

@corhere corhere deleted the libcontainerd-overhaul branch August 25, 2022 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants