docker driver OOM killed detection flaky on cgroups v2 #13119

shoenig · 2022-05-25T15:40:42Z

This test has started failing a lot, maybe 22.04 related

=== RUN   TestDockerDriver_OOMKilled
    docker.go:36: Successfully connected to docker daemon running version 20.10.16+azure-2
    driver_test.go:2[421](https://github.com/hashicorp/nomad/runs/6595376614?check_suite_focus=true#step:4:422): not killed by OOM killer: Docker container exited with non-zero exit code: 137
--- FAIL: TestDockerDriver_OOMKilled (2.41s)

The text was updated successfully, but these errors were encountered:

tgross · 2022-07-28T14:22:10Z

I've been able to reproduce this locally on a VM running ubuntu 22.04 (Jammy), which is configured with cgroups v2, but I've been unable to reproduce with any cgroups v1 configuration. With the following patch:

$ git diff
diff --git a/drivers/docker/handle.go b/drivers/docker/handle.go
index 0783bd901..c0f787a44 100644
--- a/drivers/docker/handle.go
+++ b/drivers/docker/handle.go
@@ -11,6 +11,7 @@ import (
        "time"

        "github.com/armon/circbuf"
+       "github.com/davecgh/go-spew/spew"
        docker "github.com/fsouza/go-dockerclient"
        "github.com/hashicorp/consul-template/signals"
        hclog "github.com/hashicorp/go-hclog"
@@ -239,6 +240,11 @@ func (h *taskHandle) run() {
        container, ierr := h.waitClient.InspectContainerWithOptions(docker.InspectContainerOptions{
                ID: h.containerID,
        })
+
+       scs := spew.NewDefaultConfig()
+       scs.DisableMethods = true
+       h.logger.Warn("container.State", "state", scs.Sdump(container.State))
+
        oom := false
        if ierr != nil {
                h.logger.Error("failed to inspect container", "error", ierr)

I can see the OOMKilled flag simply isn't being set:

2022-07-28T14:13:57.596Z [WARN]  docker/handle.go:246: docker: container.State: container_id=aac5d6b4620001d04314fd18bea623f5f767615778b88292b98706e4ad998263
  state=
  | (docker.State) {
  |  Status: (string) (len=6) "exited",
  |  Running: (bool) false,
  |  Paused: (bool) false,
  |  Restarting: (bool) false,
  |  OOMKilled: (bool) false,
  |  RemovalInProgress: (bool) false,
  |  Dead: (bool) false,
  |  Pid: (int) 0,
  |  ExitCode: (int) 137,
  |  Error: (string) "",
  |  StartedAt: (time.Time) {
  |   wall: (uint64) 485069344,
  |   ext: (int64) 63794614435,
  |   loc: (*time.Location)(<nil>)
  |  },
  |  FinishedAt: (time.Time) {
  |   wall: (uint64) 515940237,
  |   ext: (int64) 63794614437,
  |   loc: (*time.Location)(<nil>)
  |  },
  |  Health: (docker.Health) {
  |   Status: (string) "",
  |   FailingStreak: (int) 0,
  |   Log: ([]docker.HealthCheck) <nil>
  |  }
  | }

I traced this through go-dockerclient to moby/moby and ended finding this very similar test failure on the moby project: moby/moby#41929. From there I dug down to this PR in containerd containerd/containerd#6323 where the field they're reading from was changed. This was released in containerd v1.6.3 and v1.5.12. The moby project has some unusual vendoring setup going on but it looks like even the most recent tag for moby is pinned to containerd v1.5.0 + security fixes.

So we can't really fix this even by updating our moby dependency until upstream has done the same. I'm going to add a flag to the test to disable it in the case of cgroups v2 with a pointer to this issue, and then mark this issue as waiting for upstream. I'll also drop a note over in moby/moby#41929 pointing them to the containerd PR.

tgross · 2022-07-28T14:41:10Z

#13928 will remove the test flake. I've updated the labels for this issue to make it clear this is a problem with the driver and not the test.

shoenig · 2022-08-01T15:00:04Z

Wow nice detective work @tgross, thanks

mr-karan · 2023-04-12T06:05:23Z

We recently faced an issue in prod (on cgroups v2, Ubuntu 22.04), where the docker container restarted just few seconds after a restart. We think a memory build-up by the app in such a short span of time is impossible (as the normal memory usage is ~1GB, while the limits are 24GB).

Just wanted to confirm, is the issue we faced related to the above issue being discussed?

shoenig · 2023-04-12T15:17:00Z

Hi @mr-karan, I believe a docker container can exit with 137 whether the issue is

the task itself consumed memory beyond its limit
the kernel is experiencing memory pressure and activates the cgroup oversubscription oom-killer, and chooses to kill containers
the docker process decides the host has insufficient memory

Usually at least one of docker logs or dmesg should be able to shed light on the underlying kill reason.

That said, I'm currently in the process of upgrading our docker libraries which are a couple of years old, and may contain bug fixes / improvements in this area.

shoenig added the type/bug label May 25, 2022

shoenig mentioned this issue May 25, 2022

build: update golang version to 1.18.2 #13036

Merged

jrasell added theme/testing Test related issues theme/flaky-tests stage/accepted Confirmed, and intend to work on. No timeline committment though. labels May 26, 2022

tgross self-assigned this Jul 27, 2022

tgross added the stage/waiting-on-upstream This issue is waiting on an upstream PR review label Jul 28, 2022

tgross mentioned this issue Jul 28, 2022

Flaky test: cgroup2: TestInspectOomKilledTrue fails intermittently moby/moby#41929

Open

tgross changed the title ~~buggy docker test for oom killed~~ docker driver OOM killed test flaky on cgroups v2 Jul 28, 2022

tgross changed the title ~~docker driver OOM killed test flaky on cgroups v2~~ docker driver OOM killed detection flaky on cgroups v2 Jul 28, 2022

tgross mentioned this issue Jul 28, 2022

test: disable docker OOM detection test on cgroups v2 #13928

Merged

tgross added theme/driver/docker and removed theme/testing Test related issues theme/flaky-tests labels Jul 28, 2022

tgross removed their assignment Aug 15, 2022

tgross mentioned this issue Jun 4, 2024

Batch job reschedule attempts sometimes not being respected #20462

Open

tgross mentioned this issue Aug 12, 2024

unexpected OOM on template render #23778

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker driver OOM killed detection flaky on cgroups v2 #13119

docker driver OOM killed detection flaky on cgroups v2 #13119

shoenig commented May 25, 2022

tgross commented Jul 28, 2022

tgross commented Jul 28, 2022

shoenig commented Aug 1, 2022

mr-karan commented Apr 12, 2023

shoenig commented Apr 12, 2023

docker driver OOM killed detection flaky on cgroups v2 #13119

docker driver OOM killed detection flaky on cgroups v2 #13119

Comments

shoenig commented May 25, 2022

tgross commented Jul 28, 2022

tgross commented Jul 28, 2022

shoenig commented Aug 1, 2022

mr-karan commented Apr 12, 2023

shoenig commented Apr 12, 2023