Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker driver OOM killed detection flaky on cgroups v2 #13119

Open
shoenig opened this issue May 25, 2022 · 5 comments
Open

docker driver OOM killed detection flaky on cgroups v2 #13119

shoenig opened this issue May 25, 2022 · 5 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/waiting-on-upstream This issue is waiting on an upstream PR review theme/driver/docker type/bug

Comments

@shoenig
Copy link
Member

shoenig commented May 25, 2022

This test has started failing a lot, maybe 22.04 related

=== RUN   TestDockerDriver_OOMKilled
    docker.go:36: Successfully connected to docker daemon running version 20.10.16+azure-2
    driver_test.go:2[421](https://github.com/hashicorp/nomad/runs/6595376614?check_suite_focus=true#step:4:422): not killed by OOM killer: Docker container exited with non-zero exit code: 137
--- FAIL: TestDockerDriver_OOMKilled (2.41s)
@jrasell jrasell added theme/testing Test related issues theme/flaky-tests stage/accepted Confirmed, and intend to work on. No timeline committment though. labels May 26, 2022
@tgross tgross self-assigned this Jul 27, 2022
@tgross
Copy link
Member

tgross commented Jul 28, 2022

I've been able to reproduce this locally on a VM running ubuntu 22.04 (Jammy), which is configured with cgroups v2, but I've been unable to reproduce with any cgroups v1 configuration. With the following patch:

$ git diff
diff --git a/drivers/docker/handle.go b/drivers/docker/handle.go
index 0783bd901..c0f787a44 100644
--- a/drivers/docker/handle.go
+++ b/drivers/docker/handle.go
@@ -11,6 +11,7 @@ import (
        "time"

        "github.com/armon/circbuf"
+       "github.com/davecgh/go-spew/spew"
        docker "github.com/fsouza/go-dockerclient"
        "github.com/hashicorp/consul-template/signals"
        hclog "github.com/hashicorp/go-hclog"
@@ -239,6 +240,11 @@ func (h *taskHandle) run() {
        container, ierr := h.waitClient.InspectContainerWithOptions(docker.InspectContainerOptions{
                ID: h.containerID,
        })
+
+       scs := spew.NewDefaultConfig()
+       scs.DisableMethods = true
+       h.logger.Warn("container.State", "state", scs.Sdump(container.State))
+
        oom := false
        if ierr != nil {
                h.logger.Error("failed to inspect container", "error", ierr)

I can see the OOMKilled flag simply isn't being set:

2022-07-28T14:13:57.596Z [WARN]  docker/handle.go:246: docker: container.State: container_id=aac5d6b4620001d04314fd18bea623f5f767615778b88292b98706e4ad998263
  state=
  | (docker.State) {
  |  Status: (string) (len=6) "exited",
  |  Running: (bool) false,
  |  Paused: (bool) false,
  |  Restarting: (bool) false,
  |  OOMKilled: (bool) false,
  |  RemovalInProgress: (bool) false,
  |  Dead: (bool) false,
  |  Pid: (int) 0,
  |  ExitCode: (int) 137,
  |  Error: (string) "",
  |  StartedAt: (time.Time) {
  |   wall: (uint64) 485069344,
  |   ext: (int64) 63794614435,
  |   loc: (*time.Location)(<nil>)
  |  },
  |  FinishedAt: (time.Time) {
  |   wall: (uint64) 515940237,
  |   ext: (int64) 63794614437,
  |   loc: (*time.Location)(<nil>)
  |  },
  |  Health: (docker.Health) {
  |   Status: (string) "",
  |   FailingStreak: (int) 0,
  |   Log: ([]docker.HealthCheck) <nil>
  |  }
  | }

I traced this through go-dockerclient to moby/moby and ended finding this very similar test failure on the moby project: moby/moby#41929. From there I dug down to this PR in containerd containerd/containerd#6323 where the field they're reading from was changed. This was released in containerd v1.6.3 and v1.5.12. The moby project has some unusual vendoring setup going on but it looks like even the most recent tag for moby is pinned to containerd v1.5.0 + security fixes.

So we can't really fix this even by updating our moby dependency until upstream has done the same. I'm going to add a flag to the test to disable it in the case of cgroups v2 with a pointer to this issue, and then mark this issue as waiting for upstream. I'll also drop a note over in moby/moby#41929 pointing them to the containerd PR.

@tgross tgross added the stage/waiting-on-upstream This issue is waiting on an upstream PR review label Jul 28, 2022
@tgross tgross changed the title buggy docker test for oom killed docker driver OOM killed test flaky on cgroups v2 Jul 28, 2022
@tgross tgross changed the title docker driver OOM killed test flaky on cgroups v2 docker driver OOM killed detection flaky on cgroups v2 Jul 28, 2022
@tgross
Copy link
Member

tgross commented Jul 28, 2022

#13928 will remove the test flake. I've updated the labels for this issue to make it clear this is a problem with the driver and not the test.

@shoenig
Copy link
Member Author

shoenig commented Aug 1, 2022

Wow nice detective work @tgross, thanks

@mr-karan
Copy link
Contributor

We recently faced an issue in prod (on cgroups v2, Ubuntu 22.04), where the docker container restarted just few seconds after a restart. We think a memory build-up by the app in such a short span of time is impossible (as the normal memory usage is ~1GB, while the limits are 24GB).

Just wanted to confirm, is the issue we faced related to the above issue being discussed?

image

@shoenig
Copy link
Member Author

shoenig commented Apr 12, 2023

Hi @mr-karan, I believe a docker container can exit with 137 whether the issue is

  • the task itself consumed memory beyond its limit
  • the kernel is experiencing memory pressure and activates the cgroup oversubscription oom-killer, and chooses to kill containers
  • the docker process decides the host has insufficient memory

Usually at least one of docker logs or dmesg should be able to shed light on the underlying kill reason.

That said, I'm currently in the process of upgrading our docker libraries which are a couple of years old, and may contain bug fixes / improvements in this area.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/waiting-on-upstream This issue is waiting on an upstream PR review theme/driver/docker type/bug
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

4 participants