New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault when reloading HAProxy if there are many persistent connections #2184
Comments
Indeed this is not supposed to behave like this, which exacts versions did you test? Did you had problem like this in previous versions of HAProxy? Looks like a regression to me. I will investigate, thanks! |
In ticket #2184, HAProxy is crashing in a BUG_ON() after a lot of reload when the previous processes did not exit. Each worker has a socketpair which is a FD in the master, when reloading this FD still exists until the process leaves. But the global.maxconn value is not incremented for each of these FD. So when there is too much workers and the number of FD reaches maxsock, the next FD inserted in the poller will crash the process. This patch fixes the issue by increasing the maxsock for each remaining worker. Must be backported in every maintained version.
The patch above fixes the issue, however you are reaching the limit of your system with the maxsock, so instead of crashing it will fail to reload. You reach this bug because you have a low FD limit, so you should consider tweaking your environment. Maybe you are using docker or an environment with a FD limit in a unit file or something like this. Having that much workers can lead to memory problems, so you should probably set short timeouts if you intend to reload a lot. There are also keywords like mworker-max-reloads that can help you reduce the maximum number of workers and hard-stop-after which exits the worker after a timeout even if connections are still alive. |
Thank you for the quick reply and patch. I have As far as I can tell, the FD limit of my HAProxy master is very high (set with |
My mistake, you can indeed reach this problem with the default global.maxsock value computed by HAProxy, which is low in the master process. |
Aurelien Darragon found a case of leak when working on ticket #2184. When a reexec_on_failure() happens *BEFORE* protocol_bind_all(), the worker is not fork and the mworker_proc struct is still there with its 2 socketpairs. The socketpair that is supposed to be in the master is already closed in mworker_cleanup_proc(), the one for the worker was suppposed to be cleaned up in mworker_cleanlisteners(). However, since the fd is not bound during this failure, the fd is never closed. This patch fixes the problem by setting the fd to -1 in the mworker_proc after the fork, so we ensure that this it won't be close if everything was done right, and then we try to close it in mworker_cleanup_proc() when it's not set to -1. This could be triggered with the script in ticket #2184 and a `ulimit -H -n 300`. This will fail before the protocol_bind_all() when trying to increase the nofile setrlimit. In recent version of haproxy, there is a BUG_ON() in fd_insert() that could be triggered by this bug because of the global.maxsock check. Must be backported as far as 2.6. The problem could exist in previous version but the code is different and this won't be triggered easily without other consequences in the master.
In ticket haproxy#2184, HAProxy is crashing in a BUG_ON() after a lot of reload when the previous processes did not exit. Each worker has a socketpair which is a FD in the master, when reloading this FD still exists until the process leaves. But the global.maxconn value is not incremented for each of these FD. So when there is too much workers and the number of FD reaches maxsock, the next FD inserted in the poller will crash the process. This patch fixes the issue by increasing the maxsock for each remaining worker. Must be backported in every maintained version. (cherry picked from commit e6051a0) Signed-off-by: William Lallemand <wlallemand@haproxy.org> (cherry picked from commit 67bddbf) Signed-off-by: William Lallemand <wlallemand@haproxy.org> (cherry picked from commit 665275e) [wla: cfd= does not exist in 2.6] Signed-off-by: William Lallemand <wlallemand@haproxy.org>
Aurelien Darragon found a case of leak when working on ticket haproxy#2184. When a reexec_on_failure() happens *BEFORE* protocol_bind_all(), the worker is not fork and the mworker_proc struct is still there with its 2 socketpairs. The socketpair that is supposed to be in the master is already closed in mworker_cleanup_proc(), the one for the worker was suppposed to be cleaned up in mworker_cleanlisteners(). However, since the fd is not bound during this failure, the fd is never closed. This patch fixes the problem by setting the fd to -1 in the mworker_proc after the fork, so we ensure that this it won't be close if everything was done right, and then we try to close it in mworker_cleanup_proc() when it's not set to -1. This could be triggered with the script in ticket haproxy#2184 and a `ulimit -H -n 300`. This will fail before the protocol_bind_all() when trying to increase the nofile setrlimit. In recent version of haproxy, there is a BUG_ON() in fd_insert() that could be triggered by this bug because of the global.maxsock check. Must be backported as far as 2.6. The problem could exist in previous version but the code is different and this won't be triggered easily without other consequences in the master. (cherry picked from commit 117b03f) Signed-off-by: William Lallemand <wlallemand@haproxy.org> (cherry picked from commit 8ee9a50) Signed-off-by: William Lallemand <wlallemand@haproxy.org> (cherry picked from commit 56864e3) Signed-off-by: William Lallemand <wlallemand@haproxy.org>
In ticket haproxy#2184, HAProxy is crashing in a BUG_ON() after a lot of reload when the previous processes did not exit. Each worker has a socketpair which is a FD in the master, when reloading this FD still exists until the process leaves. But the global.maxconn value is not incremented for each of these FD. So when there is too much workers and the number of FD reaches maxsock, the next FD inserted in the poller will crash the process. This patch fixes the issue by increasing the maxsock for each remaining worker. Must be backported in every maintained version. (cherry picked from commit e6051a0) Signed-off-by: William Lallemand <wlallemand@haproxy.org> (cherry picked from commit 67bddbf) Signed-off-by: William Lallemand <wlallemand@haproxy.org>
Aurelien Darragon found a case of leak when working on ticket haproxy#2184. When a reexec_on_failure() happens *BEFORE* protocol_bind_all(), the worker is not fork and the mworker_proc struct is still there with its 2 socketpairs. The socketpair that is supposed to be in the master is already closed in mworker_cleanup_proc(), the one for the worker was suppposed to be cleaned up in mworker_cleanlisteners(). However, since the fd is not bound during this failure, the fd is never closed. This patch fixes the problem by setting the fd to -1 in the mworker_proc after the fork, so we ensure that this it won't be close if everything was done right, and then we try to close it in mworker_cleanup_proc() when it's not set to -1. This could be triggered with the script in ticket haproxy#2184 and a `ulimit -H -n 300`. This will fail before the protocol_bind_all() when trying to increase the nofile setrlimit. In recent version of haproxy, there is a BUG_ON() in fd_insert() that could be triggered by this bug because of the global.maxsock check. Must be backported as far as 2.6. The problem could exist in previous version but the code is different and this won't be triggered easily without other consequences in the master. (cherry picked from commit 117b03f) Signed-off-by: William Lallemand <wlallemand@haproxy.org> (cherry picked from commit 8ee9a50) Signed-off-by: William Lallemand <wlallemand@haproxy.org>
In ticket haproxy#2184, HAProxy is crashing in a BUG_ON() after a lot of reload when the previous processes did not exit. Each worker has a socketpair which is a FD in the master, when reloading this FD still exists until the process leaves. But the global.maxconn value is not incremented for each of these FD. So when there is too much workers and the number of FD reaches maxsock, the next FD inserted in the poller will crash the process. This patch fixes the issue by increasing the maxsock for each remaining worker. Must be backported in every maintained version. (cherry picked from commit e6051a0) Signed-off-by: William Lallemand <wlallemand@haproxy.org> (cherry picked from commit 67bddbf) Signed-off-by: William Lallemand <wlallemand@haproxy.org> (cherry picked from commit 665275e) [wla: cfd= does not exist in 2.6] Signed-off-by: William Lallemand <wlallemand@haproxy.org> (cherry picked from commit 33ea988) Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
In ticket haproxy#2184, HAProxy is crashing in a BUG_ON() after a lot of reload when the previous processes did not exit. Each worker has a socketpair which is a FD in the master, when reloading this FD still exists until the process leaves. But the global.maxconn value is not incremented for each of these FD. So when there is too much workers and the number of FD reaches maxsock, the next FD inserted in the poller will crash the process. This patch fixes the issue by increasing the maxsock for each remaining worker. Must be backported in every maintained version. (cherry picked from commit e6051a0) Signed-off-by: William Lallemand <wlallemand@haproxy.org>
Aurelien Darragon found a case of leak when working on ticket haproxy#2184. When a reexec_on_failure() happens *BEFORE* protocol_bind_all(), the worker is not fork and the mworker_proc struct is still there with its 2 socketpairs. The socketpair that is supposed to be in the master is already closed in mworker_cleanup_proc(), the one for the worker was suppposed to be cleaned up in mworker_cleanlisteners(). However, since the fd is not bound during this failure, the fd is never closed. This patch fixes the problem by setting the fd to -1 in the mworker_proc after the fork, so we ensure that this it won't be close if everything was done right, and then we try to close it in mworker_cleanup_proc() when it's not set to -1. This could be triggered with the script in ticket haproxy#2184 and a `ulimit -H -n 300`. This will fail before the protocol_bind_all() when trying to increase the nofile setrlimit. In recent version of haproxy, there is a BUG_ON() in fd_insert() that could be triggered by this bug because of the global.maxsock check. Must be backported as far as 2.6. The problem could exist in previous version but the code is different and this won't be triggered easily without other consequences in the master. (cherry picked from commit 117b03f) Signed-off-by: William Lallemand <wlallemand@haproxy.org>
In ticket haproxy#2184, HAProxy is crashing in a BUG_ON() after a lot of reload when the previous processes did not exit. Each worker has a socketpair which is a FD in the master, when reloading this FD still exists until the process leaves. But the global.maxconn value is not incremented for each of these FD. So when there is too much workers and the number of FD reaches maxsock, the next FD inserted in the poller will crash the process. This patch fixes the issue by increasing the maxsock for each remaining worker. Must be backported in every maintained version. (cherry picked from commit e6051a0) Signed-off-by: William Lallemand <wlallemand@haproxy.org> (cherry picked from commit 67bddbf) Signed-off-by: William Lallemand <wlallemand@haproxy.org> (cherry picked from commit 665275e) [wla: cfd= does not exist in 2.6] Signed-off-by: William Lallemand <wlallemand@haproxy.org> (cherry picked from commit 33ea988) Signed-off-by: Christopher Faulet <cfaulet@haproxy.com> (cherry picked from commit 9490f9d) Signed-off-by: Willy Tarreau <w@1wt.eu> (cherry picked from commit 7af84d7) Signed-off-by: Willy Tarreau <w@1wt.eu>
In ticket haproxy#2184, HAProxy is crashing in a BUG_ON() after a lot of reload when the previous processes did not exit. Each worker has a socketpair which is a FD in the master, when reloading this FD still exists until the process leaves. But the global.maxconn value is not incremented for each of these FD. So when there is too much workers and the number of FD reaches maxsock, the next FD inserted in the poller will crash the process. This patch fixes the issue by increasing the maxsock for each remaining worker. Must be backported in every maintained version. (cherry picked from commit e6051a0) Signed-off-by: William Lallemand <wlallemand@haproxy.org> (cherry picked from commit 67bddbf) Signed-off-by: William Lallemand <wlallemand@haproxy.org> (cherry picked from commit 665275e) [wla: cfd= does not exist in 2.6] Signed-off-by: William Lallemand <wlallemand@haproxy.org> (cherry picked from commit 33ea988) Signed-off-by: Christopher Faulet <cfaulet@haproxy.com> (cherry picked from commit 9490f9d) Signed-off-by: Willy Tarreau <w@1wt.eu>
Detailed Description of the Problem
In master-worker mode, the HAProxy master seems to reliably segfault if I send it many reloads (~250) with SIGUSR2, and if in between the reloads I create a fresh long-lived connection. I've reproduced on a couple of different systems (smaller VMs to bigger hardware).
The bug is the below:
Expected Behavior
I know the reloads are costly so while I was not expecting HAProxy to be happy, I was not expecting it to segfault.
For my use case I have an automated update system that will update configs and reload HAProxy. In between those reloads, new long-lived connections will be created from normal client usage. I came across this segfault in a test environment (not production) when trying to estimate how much reload "load" HAProxy could tolerate.
Steps to Reproduce the Behavior
Run a server to accept connections (HAProxy will proxy this):
Run a client in a loop that creates a long-lived connection to the server (via HAProxy), and then shortly after creating the connection, reloads HAProxy.
The below reproducer reliably crashes HAProxy after about ~250 reloads / connections.
commands:
Do you have any idea what may have caused this?
No response
Do you have an idea how to solve the issue?
No response
What is your configuration?
Output of
haproxy -vv
Last Outputs and Backtraces
No response
Additional Information
I've tested on 2.6, 2.7, and 2.8, and can reproduce the bug. I also tried to run as root, with no limits, and with a large maxconn.
The text was updated successfully, but these errors were encountered: