-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad not correctly recovering from container already exists
error
#22218
Comments
Hi @johnnyplaydrums, thanks for reporting the issue, I'm looking into it and will get back to you. |
Ok, so I tried reproducing the issue and so far no luck. A simple example unit test could go like this:
and that test passes. Looking in the debugger I can see that the logic works as expected: Nomad tries to create a container, but it's already there and it's not running, so then Nomad attempts to remove the container, and then increases the attempts counter. Once that's done and we GOTO the create step again, creation is successful. The bug could be elsewhere, perhaps it's ubuntu 20.04 specific (I tested on darwin), but that's weird because then I'd expect to hit the error. Or perhaps it's a problem with a particular version of docker? Which docker version are you running? |
@pkazmierczak thanks so much for digging into this! I was hoping a simple test case like the one your wrote would show a bug 🤔 And as you mentioned, I'd expect in our logs to either see the container purged or see the purge error. We are seeing this on |
One possible way this could happen is if
There is some trace logging that could be helpful in the |
My current theory is that
p.s. just realized you can set agents with |
If Or am I missing something? |
Made some great progress this weekend! After reviewing the trace logs from a few container exists errors this weekend, I have a clearer sense of what is going on. It looks to be some kind of race condition with docker, with docker sometimes not returning the container in the list results in time before Nomad runs out of retries. The Nomad retries happen over the course of about 3-4 seconds, so if for whatever reason docker is busy creating the container and slow to return the container name in the list results, Nomad will run out of retries before the container appears in the list. Maybe docker is bogged down in the process of creating the container, which is why the creation times out in the first place, but isn't yet in the list results. If we could extend the number of retries so that it retries for maybe 30 seconds or so, I bet that would just about squash this issue. Here's an example where Nomad tries 3 times to list and find the container, and on the 3rd attempt, docker finally lists the container. Nomad then successfully finds it, purges it, and is able to recreate the container and start the task successfully. This is a happy path as nomad is able to successfully recover from the container already exists error, albeit after 3 attempts: Here's an example where nomad tries to list the container 6 times, docker never returns it, so the job fails: And, to further the race condition point, here's a time where nomad tries 6 times to list the container, it actually finds and purges it on the 6th time, but since Sorry for all the logs 😅 just want to make sure I provide enough data to make the point clearly. I think we could just about entirely avoid the issue by extending the retries so that Nomad tries for maybe 30 seconds or so? What's the best why to do that? We could increase |
@pkazmierczak just want to say thanks for your help on this one so far! To sum things up concisely - the |
@johnnyplaydrums Thanks for all the log messages and details. Here's what's gonna happen now: I'm gonna look into a possible race condition in our removing of containers. You seem to be creating a lot of batch jobs, so it could be the case that containers get removed outside the Changing the hardcoded |
@pkazmierczak that sounds good and thank you for getting a PR up so quickly! Making that value configurable does sound like the best approach. The other and/or additional configuration approach we could take is to make the No worries on timeline, I'm just glad to have your support on this one 🙇 |
I wouldn't mind closing this issue now that the PR is merged, I'll leave that up to you. I feel fairly confident this will solve our issue. And I can post back here once this PR gets into a release and I'm able to test it in production. Thank you again so much for your help here!! |
Good stuff, @johnnyplaydrums. I'll close the issue for now, feel free to re-open later if the issue persists. |
Nomad version
Output from
nomad version
v1.7.3
Operating system and Environment details
Ubuntu 20.04.6 LTS
Issue
Hello folks! The issue we're seeing happens during container startup, nomad tries to create a container, times out waiting for docker to create the container (
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
), but the container does actually get created by docker. Then nomad tries to create the container again a few times but hits thecontainer already exists
error each time, then the job is marked as failed. Adocker ps -a
shows the container does indeed exist, in aCreated
state.Nomad has logic to handle this exact failure mode as we can see starting from this block, where nomad detects the container already exists, then purges the container and tries to recreate it. But there seems to be a bug in that block causing nomad to not make it to the purge container step. As you can see from the debug logs below, Nomad doesn't log this log line
purged container
, so we know it's not getting to that point. I can't quite tell where the bug is or why it's not getting to the purge container step, but I'm hoping a more experienced go/nomad dev can sleuth this one out.Reproduction steps
I think the easiest way to reproduce this is to write a test case for the
container already exists
error. In a quick check ofdriver_test.go
I didn't see a test case for this. I suspect once a test is written, we'll find the bug quickly. We experience this about once a day in our environment with 100s of batch jobs being run daily, so it's very intermittent and might be hard to reproduce outside of a golang test case.Expected Result
If Nomad tries to create a container and Docker times out, Nomad should purge the container, then try to create it.
Actual Result
Nomad doesn't purge the existing container, just continually tries to create the already existing container, after 5 attempts it fails.
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: