-
Notifications
You must be signed in to change notification settings - Fork 256
[batch] Do not try to kill a non-existent container #10855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
batch/batch/worker/worker.py
Outdated
| if self.container_is_running(): | ||
| try: | ||
| log.info(f'{self} container is still running, killing crun process') | ||
| self.process.terminate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines still make me uneasy, because self.process.terminate() should be sending a SIGTERM. I am almost inclined to remove this line or put it as a later resort. So
- Try to ask crun to stop the container with
crun kill - If the above fails, try to
terminatethe crun process itself (it's unclear to me if we should do some sort of waiting to check whether the process responded to the terminate. I'm not sure what there is left to do at this point if it didn't other than log that we tried really hard and didn't succeed) - Finally set
process = None
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this seems reasonable, but I changed the signal for crun kill to SIGKILL as I think responding to a SIGTERM is optional. Maybe we call crun kill twice with two different signals.
batch/batch/worker/worker.py
Outdated
| except CalledProcessError: | ||
| log.info(f'container {self} cannot be killed because it does not exist', exc_info=True) | ||
| else: | ||
| await check_exec_output('crun', 'kill', '--all', self.container_name, 'SIGTERM') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are still not guaranteed that the container will exist at this point, as it could exit naturally, but it is probably worth checking to see how often that arises in practice.
daniel-goldstein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if my comment was a step backward or forward. I offered a counterpoint that questions whether calling state really tells us anything. If you think it would give us useful information, I'm fine with it as long as we self.process.terminate() in all failure modes.
batch/batch/worker/worker.py
Outdated
| try: | ||
| await check_exec_output('crun', 'state', self.container_name) | ||
| except CalledProcessError: | ||
| log.info(f'container {self} cannot be killed because it does not exist', exc_info=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we want this branch to get terminated too? Is there a possibility that the crun process is still setting up the container, so it doesn't exist yet, then we set self.process = None which leaves a zombie process?
With that in mind, it pretty much makes the except blocks here identical, and we log.info regardless in both cases. crun state erroring out is essentially telling us the same information as crun kill erroring out, it just perhaps feels kinder. I also think these nested try blocks are starting to get confusing.
I see this code as essentially the same, albeit with slightly less clean error messages:
try:
await check_exec_output('crun', 'state', self.container_name)
await check_exec_output('crun', 'kill', '--all', self.container_name, 'SIGKILL')
except CalledProcessError:
log.info(f'Could not delete container {self} because it does not exist', exc_info=True)
self.process.terminate()
finally:
self.process = NoneThe exc_info here could still tell us which one failed, but calling state starts to look not quite as useful as before. For the most part what we've done is change the order of self.process.terminate, which I think is much better now, and downgraded the error from before to info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still unsure of what I think the best approach is here, but I do think that as long as we're positive there are no leaks, we'll hit delete-nonexistent-container errors that we will have to ignore. I'm alright with trying out something like the above, though I'm unsure how to verify that the state of the system has improved and we're not just silencing errors.
We could have a monitoring loop that ensures the output of crun list approximately matches in length the number of containers we expect are running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also use something like an asyncio.wait_for with a reasonable timeout and wait for process to exit. If it doesn't do so within a few seconds then we log an exception. This might be the easiest approach to make sure that we see something if it's really borked, and what I am most worried about is self.process becoming a zombie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crun kill doesn't seem to provide any value over kill because already have the process id (self.process.pid). From the crun man page:
kill Send the specified signal to the container init process. If no signal is specified, SIGTERM is used.
What is wrong with this code:
if self.container_is_running():
try:
log.info(f'{self} container is still running, killing crun process')
self.process.terminate()
except ProcessLookupError:
pass
except CancelledError:
raise
except:
log.exception(f'while deleting container {self}')I check for ProcessLookupError because terminate is not always idempotent:
In [7]: import asyncio
...: async def test():
...: proc = await asyncio.create_subprocess_exec('/bin/bash', '-c', 'echo hello')
...: proc.terminate()
...: proc.terminate()
...: await asyncio.sleep(5)
...: proc.terminate()
...:
...: asyncio.get_event_loop().run_until_complete(test())
Traceback (most recent call last):
File "<ipython-input-7-498d2e69b74a>", line 9, in <module>
asyncio.get_event_loop().run_until_complete(test())
File "/Users/dking/miniconda3/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
return future.result()
File "<ipython-input-7-498d2e69b74a>", line 7, in test
proc.terminate()
File "/Users/dking/miniconda3/lib/python3.7/asyncio/subprocess.py", line 131, in terminate
self._transport.terminate()
File "/Users/dking/miniconda3/lib/python3.7/asyncio/base_subprocess.py", line 150, in terminate
self._check_proc()
File "/Users/dking/miniconda3/lib/python3.7/asyncio/base_subprocess.py", line 143, in _check_proc
raise ProcessLookupError()
ProcessLookupError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is because just sending a sigterm to the init process won't necessarily destroy all processes in the container, and hence the --all option. Why this is the default behavior I'm not sure. I believe I first just tried process.terminate and found some containers couldn't be deleted since they still had running processes.
batch/batch/worker/worker.py
Outdated
| try: | ||
| await check_exec_output('crun', 'state', self.container_name) | ||
| except CalledProcessError: | ||
| log.info(f'container {self} cannot be killed because it does not exist', exc_info=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crun kill doesn't seem to provide any value over kill because already have the process id (self.process.pid). From the crun man page:
kill Send the specified signal to the container init process. If no signal is specified, SIGTERM is used.
What is wrong with this code:
if self.container_is_running():
try:
log.info(f'{self} container is still running, killing crun process')
self.process.terminate()
except ProcessLookupError:
pass
except CancelledError:
raise
except:
log.exception(f'while deleting container {self}')I check for ProcessLookupError because terminate is not always idempotent:
In [7]: import asyncio
...: async def test():
...: proc = await asyncio.create_subprocess_exec('/bin/bash', '-c', 'echo hello')
...: proc.terminate()
...: proc.terminate()
...: await asyncio.sleep(5)
...: proc.terminate()
...:
...: asyncio.get_event_loop().run_until_complete(test())
Traceback (most recent call last):
File "<ipython-input-7-498d2e69b74a>", line 9, in <module>
asyncio.get_event_loop().run_until_complete(test())
File "/Users/dking/miniconda3/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
return future.result()
File "<ipython-input-7-498d2e69b74a>", line 7, in test
proc.terminate()
File "/Users/dking/miniconda3/lib/python3.7/asyncio/subprocess.py", line 131, in terminate
self._transport.terminate()
File "/Users/dking/miniconda3/lib/python3.7/asyncio/base_subprocess.py", line 150, in terminate
self._check_proc()
File "/Users/dking/miniconda3/lib/python3.7/asyncio/base_subprocess.py", line 143, in _check_proc
raise ProcessLookupError()
ProcessLookupError
|
Not sure this is better, but basically my logic is to kill the |
batch/batch/worker/worker.py
Outdated
| try: | ||
| await check_exec_output('crun', 'kill', '--all', self.container_name, 'SIGKILL') | ||
| except CalledProcessError: | ||
| log.info(f'could not delete container {self} because it does not exist', exc_info=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks both for the clarification.
Why do we kill the parent process first? It seems to me that we should kill all the processes within the container before trying to kill the parent. In fact, I would expect crun to always gracefully exit after the child processes finish, no?
Does this seem right?
try:
await check_exec_output('crun', 'kill', '--all', self.container_name, 'SIGKILL')
except CalledProcessError:
pass
try:
self.process.kill()
except ProcessLookupError:
passWe should determine (by testing) which exit code corresponds to "container no longer exists", and only pass when we get that exit code in the first try-except.
Aside: I'm partial to issuing a SIGTERM to both the in-container and controlling process and waiting 5 seconds for each to exit gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the container no longer exists error:
root@46dd36be3822:/# crun kill --all foo
2021-10-13T20:23:19.000811715Z: error opening file `/run/crun/foo/status`: No such file or directory
root@46dd36be3822:/# echo $?
1
|
I think this is pretty close. |
batch/batch/worker/worker.py
Outdated
| f'error opening file `/run/crun/{self.container_name}/status`: No such file or directory' in e.outerr): | ||
| log.exception(f'while deleting container {self}', exc_info=True) | ||
| except asyncio.CancelledError: | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'm feeling good about this. We don't need the except here any more because we don't have any non-cancelled except in this try-finally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'll approve after this is removed.
danking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xx
|
Good work! |
There was a race condition where
crun runcould have been cancelled before the container was actually created. This change fixes this problem by checking whether the container exists before trying to delete it.