Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad not correctly recovering from container already exists error #22218

Closed
johnnyplaydrums opened this issue May 22, 2024 · 12 comments
Closed

Comments

@johnnyplaydrums
Copy link

Nomad version

Output from nomad version
v1.7.3

Operating system and Environment details

Ubuntu 20.04.6 LTS

Issue

Hello folks! The issue we're seeing happens during container startup, nomad tries to create a container, times out waiting for docker to create the container (context deadline exceeded (Client.Timeout exceeded while awaiting headers)), but the container does actually get created by docker. Then nomad tries to create the container again a few times but hits the container already exists error each time, then the job is marked as failed. A docker ps -a shows the container does indeed exist, in a Created state.

Nomad has logic to handle this exact failure mode as we can see starting from this block, where nomad detects the container already exists, then purges the container and tries to recreate it. But there seems to be a bug in that block causing nomad to not make it to the purge container step. As you can see from the debug logs below, Nomad doesn't log this log line purged container, so we know it's not getting to that point. I can't quite tell where the bug is or why it's not getting to the purge container step, but I'm hoping a more experienced go/nomad dev can sleuth this one out.

Reproduction steps

I think the easiest way to reproduce this is to write a test case for the container already exists error. In a quick check of driver_test.go I didn't see a test case for this. I suspect once a test is written, we'll find the bug quickly. We experience this about once a day in our environment with 100s of batch jobs being run daily, so it's very intermittent and might be hard to reproduce outside of a golang test case.

Expected Result

If Nomad tries to create a container and Docker times out, Nomad should purge the container, then try to create it.

Actual Result

Nomad doesn't purge the existing container, just continually tries to create the already existing container, after 5 attempts it fails.

Nomad Client logs (if appropriate)

2024-05-22T16:54:36.903Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference type=Received msg="Task received by client" failed=false
2024-05-22T16:54:36.903Z [DEBUG] client.alloc_runner.widmgr: retrieving and renewing workload identities: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 num_identities=0
2024-05-22T16:54:36.905Z [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference
2024-05-22T16:54:36.907Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference type="Task Setup" msg="Building Task Directory" failed=false
2024-05-22T16:54:36.913Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: starting plugin: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference path=/usr/local/bin/nomad args=["/usr/local/bin/nomad", "logmon"]
2024-05-22T16:54:36.913Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin started: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference path=/usr/local/bin/nomad pid=1837990
2024-05-22T16:54:36.913Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: waiting for RPC address: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference plugin=/usr/local/bin/nomad
2024-05-22T16:54:36.934Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: plugin address: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference @module=logmon address=/tmp/plugin4179122346 network=unix timestamp=2024-05-22T16:54:36.934Z
2024-05-22T16:54:36.934Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: using plugin: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference version=2
2024-05-22T16:54:36.935Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference @module=logmon path=/opt/nomad/data/alloc/21aa43f7-31cd-db42-8708-ff467a9f8c96/alloc/logs/.inference.stdout.fifo timestamp=2024-05-22T16:54:36.934Z
2024-05-22T16:54:36.935Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference @module=logmon path=/opt/nomad/data/alloc/21aa43f7-31cd-db42-8708-ff467a9f8c96/alloc/logs/.inference.stderr.fifo timestamp=2024-05-22T16:54:36.935Z
2024-05-22T16:54:36.960Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference type=Driver msg="Downloading image" failed=false
2024-05-22T16:54:59.213Z [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=inference binds="[]string{\"/opt/nomad/data/alloc/21aa43f7-31cd-db42-8708-ff467a9f8c96/alloc:/alloc\", \"/opt/nomad/data/alloc/21aa43f7-31cd-db42-8708-ff467a9f8c96/inference/local:/local\", \"/opt/nomad/data/alloc/21aa43f7-31cd-db42-8708-ff467a9f8c96/inference/secrets:/secrets\"}"
2024-05-22T16:54:59.213Z [DEBUG] client.driver_mgr.docker: applied labels on the container: driver=docker task_name=inference labels="map[com.hashicorp.nomad.alloc_id:21aa43f7-31cd-db42-8708-ff467a9f8c96 com.hashicorp.nomad.job_id:ultrasonic_flow/dispatch-1716396876-8795bc46 com.hashicorp.nomad.job_name:ultrasonic_flow/dispatch-1716396876-8795bc46 com.hashicorp.nomad.namespace:default com.hashicorp.nomad.node_id:df293b02-aeeb-b916-326a-d7d569a505a4 com.hashicorp.nomad.node_name:nomad-client-production-172-31-146-162 com.hashicorp.nomad.parent_job_id:ultrasonic_flow com.hashicorp.nomad.task_group_name:ultrasonic_flow com.hashicorp.nomad.task_name:inference]"
2024-05-22T16:54:59.213Z [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=inference container_name=<redacted_container_name>
2024-05-22T16:59:59.214Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=<redacted_container_name> image_name=<redacted_image_name> image_id=<redacted_image_name> attempt=1 error="Post \"http://unix.sock/containers/create?name=<redacted_container_name>\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2024-05-22T16:59:59.316Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=<redacted_container_name> image_name=<redacted_image_name> image_id=<redacted_image_name> attempt=2 error="container already exists"
2024-05-22T16:59:59.525Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=<redacted_container_name> image_name=<redacted_image_name> image_id=<redacted_image_name> attempt=3 error="container already exists"
2024-05-22T16:59:59.933Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=<redacted_container_name> image_name=<redacted_image_name> image_id=<redacted_image_name> attempt=4 error="container already exists"
2024-05-22T17:00:00.742Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=<redacted_container_name> image_name=<redacted_image_name> image_id=<redacted_image_name> attempt=5 error="container already exists"
2024-05-22T17:00:02.350Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=<redacted_container_name> image_name=<redacted_image_name> image_id=<redacted_image_name> attempt=6 error="container already exists"
2024-05-22T17:00:02.357Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference type="Driver Failure" msg="failed to create container: container already exists" failed=false
2024-05-22T17:00:02.414Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference error="failed to create container: container already exists"
2024-05-22T17:00:02.414Z [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference reason="Policy allows no restarts"
2024-05-22T17:00:02.414Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference type="Not Restarting" msg="Policy allows no restarts" failed=true
2024-05-22T17:00:06.624Z [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference @module=logmon timestamp=2024-05-22T17:00:06.624Z
2024-05-22T17:00:06.624Z [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference @module=logmon timestamp=2024-05-22T17:00:06.624Z
2024-05-22T17:00:06.624Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.stdio: received EOF, stopping recv loop: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference err="rpc error: code = Unavailable desc = error reading from server: EOF"
2024-05-22T17:00:06.626Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference plugin=/usr/local/bin/nomad id=1837990
2024-05-22T17:00:06.626Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference
2024-05-22T17:00:06.626Z [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference
2024-05-22T17:00:06.626Z [INFO]  client.gc: marking allocation for GC: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96
2024-05-22T17:10:40.840Z [INFO]  client.gc: garbage collecting allocation: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 reason="new allocations and over max (50)"
2024-05-22T17:10:40.842Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 task=inference type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false
2024-05-22T17:10:40.843Z [INFO]  client.gc: marking allocation for GC: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96
2024-05-22T17:10:40.846Z [DEBUG] client.gc: alloc garbage collected: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96
2024-05-22T17:45:04.641Z [INFO]  client.gc: garbage collecting allocation: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96 reason="new allocations and over max (50)"
2024-05-22T17:45:04.641Z [DEBUG] client.gc: alloc garbage collected: alloc_id=21aa43f7-31cd-db42-8708-ff467a9f8c96
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation May 23, 2024
@pkazmierczak pkazmierczak self-assigned this May 24, 2024
@pkazmierczak pkazmierczak moved this from Needs Triage to Triaging in Nomad - Community Issues Triage May 24, 2024
@pkazmierczak pkazmierczak added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label May 24, 2024
@pkazmierczak
Copy link
Contributor

Hi @johnnyplaydrums, thanks for reporting the issue, I'm looking into it and will get back to you.

@pkazmierczak
Copy link
Contributor

Ok, so I tried reproducing the issue and so far no luck. A simple example unit test could go like this:

func TestDockerDriver_ContainerAlreadyExists(t *testing.T) {
	ci.Parallel(t)
	testutil.DockerCompatible(t)

	task, cfg, _ := dockerTask(t)
	must.NoError(t, task.EncodeConcreteDriverConfig(cfg))

	client := newTestDockerClient(t)
	driver := dockerDriverHarness(t, nil)
	cleanup := driver.MkAllocDir(task, true)
	defer cleanup()
	copyImage(t, task.TaskDir(), "busybox.tar")

	d, ok := driver.Impl().(*Driver)
	must.True(t, ok)

	_, err := d.createImage(task, cfg, client)
	must.NoError(t, err)

	containerCfg, err := d.createContainerConfig(task, cfg, cfg.Image)
	must.NoError(t, err)

	// create and start a container
	c, err := d.createContainer(client, containerCfg, cfg.Image)
	must.NoError(t, err)
	defer client.RemoveContainer(docker.RemoveContainerOptions{
		ID:    c.ID,
		Force: true,
	})

	// now that the container is running, start the task that uses it, and assert
	// that it doesn't end up in "container already exists" fail loop
	_, _, err = d.StartTask(task)
	must.NoError(t, err)

	defer d.DestroyTask(task.ID, true)
}

and that test passes. Looking in the debugger I can see that the logic works as expected: Nomad tries to create a container, but it's already there and it's not running, so then Nomad attempts to remove the container, and then increases the attempts counter. Once that's done and we GOTO the create step again, creation is successful.

The bug could be elsewhere, perhaps it's ubuntu 20.04 specific (I tested on darwin), but that's weird because then I'd expect to hit the error. Or perhaps it's a problem with a particular version of docker? Which docker version are you running?

@pkazmierczak pkazmierczak added stage/waiting-reply and removed stage/accepted Confirmed, and intend to work on. No timeline committment though. labels May 24, 2024
@pkazmierczak pkazmierczak moved this from Triaging to In Progress in Nomad - Community Issues Triage May 24, 2024
@johnnyplaydrums
Copy link
Author

johnnyplaydrums commented May 24, 2024

@pkazmierczak thanks so much for digging into this! I was hoping a simple test case like the one your wrote would show a bug 🤔 And as you mentioned, I'd expect in our logs to either see the container purged or see the purge error. We are seeing this on Docker version 26.1.3, build b72abbb. I'm taking another pass through this part of the Nomad code now to see if anything stands out.

@johnnyplaydrums
Copy link
Author

johnnyplaydrums commented May 24, 2024

One possible way this could happen is if containerByName doesn't find the container. In the event the container is not found, it returns nil, nil, so that would skip all the blocks below including the purge container block, and just loop until it fails

if !found {
    return nil, nil
}

There is some trace logging that could be helpful in the containerByName function, but to get those trace logs I'd have to use nomad monitor and catch this error live right? Or setup a job/process to run nomad monitor continually... just trying to think through options.

@johnnyplaydrums
Copy link
Author

johnnyplaydrums commented May 24, 2024

Currently running nomad monitor -log-level=trace > nomad.log on our 3 nomad clients to see if I can catch this live. I can try to set a job/process for this so that it can monitor it for a few days I've got this running in the systemd job now nevermind I'm silly and didn't realize the client and just log at trace. I'm sure we'll catch one within a day or two.

My current theory is that containerByName is returning nil, nil. The trace logs could confirm that - we'd see a lot of listed container logs but never a Found container log. I'm a little suspect of containerName := "/" + name, the container names I see on the host don't have a prepended / 🤔.

containerByName isn't a widely used function, only used here and in network.go, so possible there's a bug there?

p.s. just realized you can set agents with log_level = "trace" 🤦 the docs made it seem like this was not possible. My clients are now running with trace, I will post the trace logs once we get the next container already exists error. We usually see it during week days to the nature of our jobs, so it might be sometime next week. Stay tuned and thanks again for your help!

@pkazmierczak
Copy link
Contributor

My current theory is that containerByName is returning nil, nil. The trace logs could confirm that - we'd see a lot of listed container logs but never a Found container log.

If containerByName returns nil, nil then there is indeed some issue with docker then. This method only gets called when Nomad gets "container already exists" message from dockerd, and if that's indeed the bug—i.e., docker reports the container exists but Nomad can't find it—then there's little we can do. We can't purge the container then because we don't know what container to purge. The only reasonable option then is to retry.

Or am I missing something?

@johnnyplaydrums
Copy link
Author

johnnyplaydrums commented May 28, 2024

Made some great progress this weekend! After reviewing the trace logs from a few container exists errors this weekend, I have a clearer sense of what is going on. It looks to be some kind of race condition with docker, with docker sometimes not returning the container in the list results in time before Nomad runs out of retries. The Nomad retries happen over the course of about 3-4 seconds, so if for whatever reason docker is busy creating the container and slow to return the container name in the list results, Nomad will run out of retries before the container appears in the list. Maybe docker is bogged down in the process of creating the container, which is why the creation times out in the first place, but isn't yet in the list results. If we could extend the number of retries so that it retries for maybe 30 seconds or so, I bet that would just about squash this issue.

Here's an example where Nomad tries 3 times to list and find the container, and on the 3rd attempt, docker finally lists the container. Nomad then successfully finds it, purges it, and is able to recreate the container and start the task successfully. This is a happy path as nomad is able to successfully recover from the container already exists error, albeit after 3 attempts:
success-after-3-tries.txt

Here's an example where nomad tries to list the container 6 times, docker never returns it, so the job fails:
failure-after-6-tries.txt

And, to further the race condition point, here's a time where nomad tries 6 times to list the container, it actually finds and purges it on the 6th time, but since attempted isn't < 5, it doesn't get sent back to goto CREATE, and just errors. In this case, if Nomad had just retried one more time, it would have succeeded:
almost-success-after-6-tries.txt

Sorry for all the logs 😅 just want to make sure I provide enough data to make the point clearly. I think we could just about entirely avoid the issue by extending the retries so that Nomad tries for maybe 30 seconds or so? What's the best why to do that? We could increase if attempted < 5 to some number that would give us closer to 30 seconds or whatever you feel is appropriate.

@johnnyplaydrums
Copy link
Author

@pkazmierczak just want to say thanks for your help on this one so far! To sum things up concisely - the ListContainers endpoint from docker is eventually consistent. In this case, Nomad added retries to try to handle the eventual consistency, but the retries are not long enough to handle a common failure case we are hitting.

@pkazmierczak
Copy link
Contributor

@johnnyplaydrums Thanks for all the log messages and details. Here's what's gonna happen now: I'm gonna look into a possible race condition in our removing of containers. You seem to be creating a lot of batch jobs, so it could be the case that containers get removed outside the createContainer method and that's why we can't remove them. Second thing I wanna look at is docker upstream—are there reported data integrity issues? Because this kinda looks like one of them.

Changing the hardcoded if attempted <5 retry is a tricky part, because ideally we'd want to make it into a configurable field. It can of course be done, but I can't promise a timeline here.

@johnnyplaydrums
Copy link
Author

@pkazmierczak that sounds good and thank you for getting a PR up so quickly! Making that value configurable does sound like the best approach. The other and/or additional configuration approach we could take is to make the CreateContainer timeout configurable, as that's the first issue Nomad hits (timeout waiting for docker to finish creating the container). Having either of those levers available to us is great, and I'm perfectly happy with the PR you have up.

No worries on timeline, I'm just glad to have your support on this one 🙇

@johnnyplaydrums
Copy link
Author

I wouldn't mind closing this issue now that the PR is merged, I'll leave that up to you. I feel fairly confident this will solve our issue. And I can post back here once this PR gets into a release and I'm able to test it in production.

Thank you again so much for your help here!!

@pkazmierczak
Copy link
Contributor

Good stuff, @johnnyplaydrums. I'll close the issue for now, feel free to re-open later if the issue persists.

Nomad - Community Issues Triage automation moved this from In Progress to Done May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants