-
Notifications
You must be signed in to change notification settings - Fork 369
Unix domain socket not cleaned up if the router process is not terminated cleanly #1448
Description
Hello,
We're running into an issue with nginx-unit, which is mostly caused by OOM-killer. Unit is running in a Docker container, and we have fairly strict memory and CPU constraints configured for it, which we don't want to remove. If a process in the container tries to allocate more memory than cgroup limits allow, OOM killer steps in and sends a SIGKILL signal to a (possibly random, haven't confirmed) process in the container/cgroup. If it kills the "router" process, then unit is unable to recover from that, returning the bind(\"unix:/tmp/app-listener.unit.sock\") failed (98: Address already in use) error when it starts up again (previously discussed in #669 and a few other issues).
It'd be great if unit was able to recover gracefully from failures like this. We're currently testing the following patch which removes the socket if it already exists, before binding to it. This does work but not sure if it's a good idea:
diff --git a/src/nxt_main_process.c b/src/nxt_main_process.c
index 060ead41..a59d5703 100644
--- a/src/nxt_main_process.c
+++ b/src/nxt_main_process.c
@@ -1184,6 +1184,16 @@ nxt_main_listening_socket(nxt_sockaddr_t *sa, nxt_listening_socket_t *ls)
const socklen_t length = sizeof(int);
static const int enable = 1;
+ if (sa != NULL && sa->u.sockaddr.sa_family == AF_UNIX && sa->u.sockaddr_un.sun_path[0] != '\0') {
+ char *filename;
+ filename = sa->u.sockaddr_un.sun_path;
+
+ struct stat buffer;
+ if (stat(filename, &buffer) == 0) {
+ unlink(filename);
+ }
+ }
+
s = socket(sa->u.sockaddr.sa_family, sa->type, 0);
if (nxt_slow_path(s == -1)) {
Reproduction steps/example (it's also reproducible on 1.33.0):
# docker top app
UID PID PPID C STIME TTY TIME CMD
root 90925 90904 0 13:25 ? 00:00:00 unit: main v1.32.1 [/usr/sbin/unitd --no-daemon --control unix:/nginx-unit/control.unit.sock]
systemd+ 90981 90925 0 13:25 ? 00:00:00 unit: controller
systemd+ 90982 90925 0 13:25 ? 00:00:00 unit: router
1000009+ 91380 90925 0 13:26 ? 00:00:00 unit: "app-test-app" prototype
1000009+ 91381 91380 31 13:26 ? 00:00:00 unit: "app-test-app" application
# kill -9 90982
# docker top app
UID PID PPID C STIME TTY TIME CMD
root 90925 90904 0 13:25 ? 00:00:00 unit: main v1.32.1 [/usr/sbin/unitd --no-daemon --control unix:/nginx-unit/control.unit.sock]
systemd+ 90981 90925 0 13:25 ? 00:00:00 unit: controller
systemd+ 91397 90925 0 13:26 ? 00:00:00 unit: router
# docker logs app 2>&1 | grep alert
2024/10/01 13:26:33 [alert] 1#1 process 36 exited on signal 9
2024/10/01 13:26:33 [alert] 1#1 sendmsg(10, -1, -1, 1) failed (32: Broken pipe)
2024/10/01 13:26:33 [alert] 1#1 bind(\"unix:/tmp/app-listener.unit.sock\") failed (98: Address already in use)
2024/10/01 13:26:33 [alert] 43#43 failed to apply new conf
2024/10/01 13:26:33 [alert] 35#35 failed to apply previous configuration
I'm wondering if there's a better workaround for this issue and/or if this is a bug that you're open to addressing in the future?