Skip to content
This repository was archived by the owner on Oct 8, 2025. It is now read-only.
This repository was archived by the owner on Oct 8, 2025. It is now read-only.

Unix domain socket not cleaned up if the router process is not terminated cleanly #1448

@mtrop-godaddy

Description

@mtrop-godaddy

Hello,

We're running into an issue with nginx-unit, which is mostly caused by OOM-killer. Unit is running in a Docker container, and we have fairly strict memory and CPU constraints configured for it, which we don't want to remove. If a process in the container tries to allocate more memory than cgroup limits allow, OOM killer steps in and sends a SIGKILL signal to a (possibly random, haven't confirmed) process in the container/cgroup. If it kills the "router" process, then unit is unable to recover from that, returning the bind(\"unix:/tmp/app-listener.unit.sock\") failed (98: Address already in use) error when it starts up again (previously discussed in #669 and a few other issues).

It'd be great if unit was able to recover gracefully from failures like this. We're currently testing the following patch which removes the socket if it already exists, before binding to it. This does work but not sure if it's a good idea:

diff --git a/src/nxt_main_process.c b/src/nxt_main_process.c
index 060ead41..a59d5703 100644
--- a/src/nxt_main_process.c
+++ b/src/nxt_main_process.c
@@ -1184,6 +1184,16 @@ nxt_main_listening_socket(nxt_sockaddr_t *sa, nxt_listening_socket_t *ls)
     const socklen_t   length = sizeof(int);
     static const int  enable = 1;
 
+    if (sa != NULL && sa->u.sockaddr.sa_family == AF_UNIX && sa->u.sockaddr_un.sun_path[0] != '\0') {
+        char *filename;
+        filename = sa->u.sockaddr_un.sun_path;
+
+        struct stat buffer;
+        if (stat(filename, &buffer) == 0) {
+            unlink(filename);
+        }
+    }
+
     s = socket(sa->u.sockaddr.sa_family, sa->type, 0);
 
     if (nxt_slow_path(s == -1)) {

Reproduction steps/example (it's also reproducible on 1.33.0):

# docker top app
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                90925               90904               0                   13:25               ?                   00:00:00            unit: main v1.32.1 [/usr/sbin/unitd --no-daemon --control unix:/nginx-unit/control.unit.sock]
systemd+            90981               90925               0                   13:25               ?                   00:00:00            unit: controller
systemd+            90982               90925               0                   13:25               ?                   00:00:00            unit: router
1000009+            91380               90925               0                   13:26               ?                   00:00:00            unit: "app-test-app" prototype
1000009+            91381               91380               31                  13:26               ?                   00:00:00            unit: "app-test-app" application

# kill -9 90982

# docker top app
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                90925               90904               0                   13:25               ?                   00:00:00            unit: main v1.32.1 [/usr/sbin/unitd --no-daemon --control unix:/nginx-unit/control.unit.sock]
systemd+            90981               90925               0                   13:25               ?                   00:00:00            unit: controller
systemd+            91397               90925               0                   13:26               ?                   00:00:00            unit: router

# docker logs app 2>&1 | grep alert
2024/10/01 13:26:33 [alert] 1#1 process 36 exited on signal 9
2024/10/01 13:26:33 [alert] 1#1 sendmsg(10, -1, -1, 1) failed (32: Broken pipe)
2024/10/01 13:26:33 [alert] 1#1 bind(\"unix:/tmp/app-listener.unit.sock\") failed (98: Address already in use)
2024/10/01 13:26:33 [alert] 43#43 failed to apply new conf
2024/10/01 13:26:33 [alert] 35#35 failed to apply previous configuration

I'm wondering if there's a better workaround for this issue and/or if this is a bug that you're open to addressing in the future?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions