Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid indefinite blocking when tasks executed via ansible hang. #137

Open
dceara opened this issue Sep 26, 2022 · 2 comments
Open

Avoid indefinite blocking when tasks executed via ansible hang. #137

dceara opened this issue Sep 26, 2022 · 2 comments

Comments

@dceara
Copy link
Collaborator

dceara commented Sep 26, 2022

It's possible that tasks executed via ansible hang. For example, there were a few cases of podman system prune -f indefinitely blocking. That's likely because of a bug in the container runtime. Nevertheless, do.sh should not indefinitely hang.

We should instead add timeouts to ansible tasks. One option is to set a global timeout:
https://docs.ansible.com/ansible/latest/reference_appendices/config.html#task-timeout

It's possible though that some tasks need a longer timeout than others. This needs to be investigated further.

@dceara
Copy link
Collaborator Author

dceara commented Sep 26, 2022

CC: @igsilya

@igsilya
Copy link
Contributor

igsilya commented Sep 26, 2022

It happened again. What I ca see is that we have a zombie process in the container and conmon doesn't reap it. While stopping the container we hit the Error: cannot remove container d9f655ac066b5af3f4c2df875e897207d027e5a5db2c900d958fdf176b4ec4cf as it could not be stopped: given PIDs did not die within timeout. This is likely triggering SIGKILL that kills parents and they have no chance to wait for a child, causing a PID 1 problem essentially.

# pstree -t -l
systemd(1)-+-...
           |-conmon(2611836)-+-bash(2611847)
           |                 `-{conmon}(2611838)


# ps -aux | grep 2611847
root     2611847  0.4  0.0      0     0 ?        Zs   11:59   0:30 [bash] <defunct>

Not sure what is triggering the timeout in the first place, since 10 seconds (default) should be enough in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants