Unix domain socket not cleaned up if the router process is not terminated cleanly

Hello,

We're running into an issue with nginx-unit, which is mostly caused by OOM-killer. Unit is running in a Docker container, and we have fairly strict memory and CPU constraints configured for it, which we don't want to remove. If a process in the container tries to allocate more memory than cgroup limits allow, OOM killer steps in and sends a SIGKILL signal to a (possibly random, haven't confirmed) process in the container/cgroup. If it kills the "router" process, then unit is unable to recover from that, returning the `bind(\"unix:/tmp/app-listener.unit.sock\") failed (98: Address already in use)` error when it starts up again (previously discussed in #669 and a few other issues).

It'd be great if unit was able to recover gracefully from failures like this. We're currently testing the following patch which removes the socket if it already exists, before binding to it. This does work but not sure if it's a good idea:
```
diff --git a/src/nxt_main_process.c b/src/nxt_main_process.c
index 060ead41..a59d5703 100644
--- a/src/nxt_main_process.c
+++ b/src/nxt_main_process.c
@@ -1184,6 +1184,16 @@ nxt_main_listening_socket(nxt_sockaddr_t *sa, nxt_listening_socket_t *ls)
     const socklen_t   length = sizeof(int);
     static const int  enable = 1;
 
+    if (sa != NULL && sa->u.sockaddr.sa_family == AF_UNIX && sa->u.sockaddr_un.sun_path[0] != '\0') {
+        char *filename;
+        filename = sa->u.sockaddr_un.sun_path;
+
+        struct stat buffer;
+        if (stat(filename, &buffer) == 0) {
+            unlink(filename);
+        }
+    }
+
     s = socket(sa->u.sockaddr.sa_family, sa->type, 0);
 
     if (nxt_slow_path(s == -1)) {
```

Reproduction steps/example (it's also reproducible on 1.33.0):
```
# docker top app
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                90925               90904               0                   13:25               ?                   00:00:00            unit: main v1.32.1 [/usr/sbin/unitd --no-daemon --control unix:/nginx-unit/control.unit.sock]
systemd+            90981               90925               0                   13:25               ?                   00:00:00            unit: controller
systemd+            90982               90925               0                   13:25               ?                   00:00:00            unit: router
1000009+            91380               90925               0                   13:26               ?                   00:00:00            unit: "app-test-app" prototype
1000009+            91381               91380               31                  13:26               ?                   00:00:00            unit: "app-test-app" application

# kill -9 90982

# docker top app
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                90925               90904               0                   13:25               ?                   00:00:00            unit: main v1.32.1 [/usr/sbin/unitd --no-daemon --control unix:/nginx-unit/control.unit.sock]
systemd+            90981               90925               0                   13:25               ?                   00:00:00            unit: controller
systemd+            91397               90925               0                   13:26               ?                   00:00:00            unit: router

# docker logs app 2>&1 | grep alert
2024/10/01 13:26:33 [alert] 1#1 process 36 exited on signal 9
2024/10/01 13:26:33 [alert] 1#1 sendmsg(10, -1, -1, 1) failed (32: Broken pipe)
2024/10/01 13:26:33 [alert] 1#1 bind(\"unix:/tmp/app-listener.unit.sock\") failed (98: Address already in use)
2024/10/01 13:26:33 [alert] 43#43 failed to apply new conf
2024/10/01 13:26:33 [alert] 35#35 failed to apply previous configuration
```

I'm wondering if there's a better workaround for this issue and/or if this is a bug that you're open to addressing in the future?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unix domain socket not cleaned up if the router process is not terminated cleanly #1448

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unix domain socket not cleaned up if the router process is not terminated cleanly #1448

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions