Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Do not try to kill a non-existent container #10855

Merged
merged 10 commits into from Oct 14, 2021

Conversation

jigold
Copy link
Collaborator

@jigold jigold commented Sep 8, 2021

There was a race condition where crun run could have been cancelled before the container was actually created. This change fixes this problem by checking whether the container exists before trying to delete it.

@@ -841,7 +841,12 @@ async def delete_container(self):
log.info(f'{self} container is still running, killing crun process')
self.process.terminate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines still make me uneasy, because self.process.terminate() should be sending a SIGTERM. I am almost inclined to remove this line or put it as a later resort. So

  1. Try to ask crun to stop the container with crun kill
  2. If the above fails, try to terminate the crun process itself (it's unclear to me if we should do some sort of waiting to check whether the process responded to the terminate. I'm not sure what there is left to do at this point if it didn't other than log that we tried really hard and didn't succeed)
  3. Finally set process = None

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this seems reasonable, but I changed the signal for crun kill to SIGKILL as I think responding to a SIGTERM is optional. Maybe we call crun kill twice with two different signals.

except CalledProcessError:
log.info(f'container {self} cannot be killed because it does not exist', exc_info=True)
else:
await check_exec_output('crun', 'kill', '--all', self.container_name, 'SIGTERM')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still not guaranteed that the container will exist at this point, as it could exit naturally, but it is probably worth checking to see how often that arises in practice.

Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if my comment was a step backward or forward. I offered a counterpoint that questions whether calling state really tells us anything. If you think it would give us useful information, I'm fine with it as long as we self.process.terminate() in all failure modes.

try:
await check_exec_output('crun', 'state', self.container_name)
except CalledProcessError:
log.info(f'container {self} cannot be killed because it does not exist', exc_info=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we want this branch to get terminated too? Is there a possibility that the crun process is still setting up the container, so it doesn't exist yet, then we set self.process = None which leaves a zombie process?

With that in mind, it pretty much makes the except blocks here identical, and we log.info regardless in both cases. crun state erroring out is essentially telling us the same information as crun kill erroring out, it just perhaps feels kinder. I also think these nested try blocks are starting to get confusing.

I see this code as essentially the same, albeit with slightly less clean error messages:

try:
    await check_exec_output('crun', 'state', self.container_name)
    await check_exec_output('crun', 'kill', '--all', self.container_name, 'SIGKILL')
except CalledProcessError:
    log.info(f'Could not delete container {self} because it does not exist', exc_info=True)
    self.process.terminate()
finally:
    self.process = None

The exc_info here could still tell us which one failed, but calling state starts to look not quite as useful as before. For the most part what we've done is change the order of self.process.terminate, which I think is much better now, and downgraded the error from before to info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still unsure of what I think the best approach is here, but I do think that as long as we're positive there are no leaks, we'll hit delete-nonexistent-container errors that we will have to ignore. I'm alright with trying out something like the above, though I'm unsure how to verify that the state of the system has improved and we're not just silencing errors.

We could have a monitoring loop that ensures the output of crun list approximately matches in length the number of containers we expect are running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also use something like an asyncio.wait_for with a reasonable timeout and wait for process to exit. If it doesn't do so within a few seconds then we log an exception. This might be the easiest approach to make sure that we see something if it's really borked, and what I am most worried about is self.process becoming a zombie.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crun kill doesn't seem to provide any value over kill because already have the process id (self.process.pid). From the crun man page:

kill Send the specified signal to the container init process. If no signal is specified, SIGTERM is used.

What is wrong with this code:

        if self.container_is_running():
             try:
                 log.info(f'{self} container is still running, killing crun process')
                 self.process.terminate()
             except ProcessLookupError:
                 pass
             except CancelledError:
                 raise
             except:
                 log.exception(f'while deleting container {self}')

I check for ProcessLookupError because terminate is not always idempotent:

In [7]: import asyncio 
   ...: async def test(): 
   ...:     proc = await asyncio.create_subprocess_exec('/bin/bash', '-c', 'echo hello') 
   ...:     proc.terminate() 
   ...:     proc.terminate() 
   ...:     await asyncio.sleep(5) 
   ...:     proc.terminate() 
   ...:  
   ...: asyncio.get_event_loop().run_until_complete(test())                                                                                                   
Traceback (most recent call last):
  File "<ipython-input-7-498d2e69b74a>", line 9, in <module>
    asyncio.get_event_loop().run_until_complete(test())
  File "/Users/dking/miniconda3/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
    return future.result()
  File "<ipython-input-7-498d2e69b74a>", line 7, in test
    proc.terminate()
  File "/Users/dking/miniconda3/lib/python3.7/asyncio/subprocess.py", line 131, in terminate
    self._transport.terminate()
  File "/Users/dking/miniconda3/lib/python3.7/asyncio/base_subprocess.py", line 150, in terminate
    self._check_proc()
  File "/Users/dking/miniconda3/lib/python3.7/asyncio/base_subprocess.py", line 143, in _check_proc
    raise ProcessLookupError()
ProcessLookupError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is because just sending a sigterm to the init process won't necessarily destroy all processes in the container, and hence the --all option. Why this is the default behavior I'm not sure. I believe I first just tried process.terminate and found some containers couldn't be deleted since they still had running processes.

danking
danking previously requested changes Oct 7, 2021
try:
await check_exec_output('crun', 'state', self.container_name)
except CalledProcessError:
log.info(f'container {self} cannot be killed because it does not exist', exc_info=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crun kill doesn't seem to provide any value over kill because already have the process id (self.process.pid). From the crun man page:

kill Send the specified signal to the container init process. If no signal is specified, SIGTERM is used.

What is wrong with this code:

        if self.container_is_running():
             try:
                 log.info(f'{self} container is still running, killing crun process')
                 self.process.terminate()
             except ProcessLookupError:
                 pass
             except CancelledError:
                 raise
             except:
                 log.exception(f'while deleting container {self}')

I check for ProcessLookupError because terminate is not always idempotent:

In [7]: import asyncio 
   ...: async def test(): 
   ...:     proc = await asyncio.create_subprocess_exec('/bin/bash', '-c', 'echo hello') 
   ...:     proc.terminate() 
   ...:     proc.terminate() 
   ...:     await asyncio.sleep(5) 
   ...:     proc.terminate() 
   ...:  
   ...: asyncio.get_event_loop().run_until_complete(test())                                                                                                   
Traceback (most recent call last):
  File "<ipython-input-7-498d2e69b74a>", line 9, in <module>
    asyncio.get_event_loop().run_until_complete(test())
  File "/Users/dking/miniconda3/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
    return future.result()
  File "<ipython-input-7-498d2e69b74a>", line 7, in test
    proc.terminate()
  File "/Users/dking/miniconda3/lib/python3.7/asyncio/subprocess.py", line 131, in terminate
    self._transport.terminate()
  File "/Users/dking/miniconda3/lib/python3.7/asyncio/base_subprocess.py", line 150, in terminate
    self._check_proc()
  File "/Users/dking/miniconda3/lib/python3.7/asyncio/base_subprocess.py", line 143, in _check_proc
    raise ProcessLookupError()
ProcessLookupError

@jigold
Copy link
Collaborator Author

jigold commented Oct 13, 2021

Not sure this is better, but basically my logic is to kill the crun run process first. Wait for it to be killed. Set the process = None. Then check if a container exists. If the container exists, then do a deep kill of all child subprocesses. I'm not sure I completely understand where the race condition could occur here.

@jigold jigold dismissed stale reviews from danking and daniel-goldstein October 13, 2021 19:00

done?

try:
await check_exec_output('crun', 'kill', '--all', self.container_name, 'SIGKILL')
except CalledProcessError:
log.info(f'could not delete container {self} because it does not exist', exc_info=True)
Copy link
Collaborator

@danking danking Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daniel-goldstein @jigold ,

OK, thanks both for the clarification.

Why do we kill the parent process first? It seems to me that we should kill all the processes within the container before trying to kill the parent. In fact, I would expect crun to always gracefully exit after the child processes finish, no?

Does this seem right?

try:
    await check_exec_output('crun', 'kill', '--all', self.container_name, 'SIGKILL')
except CalledProcessError:
    pass
try:
    self.process.kill()
except ProcessLookupError:
    pass

We should determine (by testing) which exit code corresponds to "container no longer exists", and only pass when we get that exit code in the first try-except.

Aside: I'm partial to issuing a SIGTERM to both the in-container and controlling process and waiting 5 seconds for each to exit gracefully.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the container no longer exists error:

root@46dd36be3822:/# crun kill --all foo
2021-10-13T20:23:19.000811715Z: error opening file `/run/crun/foo/status`: No such file or directory
root@46dd36be3822:/# echo $?
1

@jigold
Copy link
Collaborator Author

jigold commented Oct 13, 2021

I think this is pretty close.

except CalledProcessError as e:
if not (e.returncode == 1 and
f'error opening file `/run/crun/{self.container_name}/status`: No such file or directory' in e.outerr):
log.exception(f'while deleting container {self}', exc_info=True)
except asyncio.CancelledError:
raise
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm feeling good about this. We don't need the except here any more because we don't have any non-cancelled except in this try-finally.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'll approve after this is removed.

Copy link
Collaborator

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xx

@danking
Copy link
Collaborator

danking commented Oct 13, 2021

Good work!

@danking danking merged commit ac50704 into hail-is:main Oct 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants