Frequent warning, until process-crash: send aio error Out of files #1546

realdognose · 2023-11-25T12:10:35Z

realdognose
Nov 25, 2023

I'm using nanoMQ 0.20.8 with ubuntu 22.04.

After one day of uptime, logs are beeing spamed with the following error, until the process dies:

2023-11-25 09:54:29 [904] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtts/broker_tls.c:1745:  send aio error Out of files
2023-11-25 09:54:29 [894] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtts/broker_tls.c:1745:  send aio error Out of files
2023-11-25 09:54:29 [911] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtts/broker_tls.c:1745:  send aio error Out of files
2023-11-25 09:54:29 [895] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtts/broker_tls.c:1745:  send aio error Out of files
2023-11-25 09:54:29 [903] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtts/broker_tls.c:1745:  send aio error Out of files
2023-11-25 09:54:29 [896] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtts/broker_tls.c:1745:  send aio error Out of files
2023-11-25 09:54:29 [905] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtts/broker_tls.c:1745:  send aio error Out of files
2023-11-25 09:54:29 [895] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtts/broker_tls.c:1745:  send aio error Out of files
2023-11-25 09:54:29 [899] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtts/broker_tls.c:1745:  send aio error Out of files
2023-11-25 09:54:29 [903] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtts/broker_tls.c:1745:  send aio error Out of files
2023-11-25 09:54:29 [909] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtts/broker_tls.c:1745:  send aio error Out of files

Any hint, what this message means? Google did not yield any valuable results.

JaylinYu · 2023-11-25T17:04:04Z

JaylinYu
Nov 25, 2023
Maintainer

This is an error returned by kernel when trying to send packet to socket. Check NNG_ENOFILES. It basically tells There is no file descriptor left. In this case, if there is still new connu coming, broker will take a backoff strategy to sleep and wait for new FD is released
As for the final exit. Looks like a kill from OS side. Or I suspect some POSIX api calls might also went wrong and broker will exit itself.

0 replies

realdognose · 2023-11-26T09:46:23Z

realdognose
Nov 26, 2023
Author

Found the respective docs about the exception, but well - pretty much undocumented ;)
Couldn't figure out, if there is any configuration where I could take influence on this.

So, I thought a while about a possible reason, and it might be related to the following:

The ubuntu-vm is running on a hyperV-Server with a thin-provisioned disk.

I remember years ago, where this caused IO-issues the moment the thin-provisioned virtual disk is at the edge of expanding.
So, i've now artificially inflated the thin disk by adding some GB of data and removing them again.

Will see, if the problem appears again during the next day, or if this might be the root cause and therefore gone for quite a time.
If this seems to be the cause, will switch to a thick-provisioned disk, ofc.

1 reply

JaylinYu Nov 26, 2023
Maintainer

Mhhh, dk why it is related to Disk size....
How about trying to increase the max FD numbers? Test it with "ulimit -n"?

realdognose · 2023-11-27T08:28:07Z

realdognose
Nov 27, 2023
Author

Ok, apparently seems not disk related. Currently the server is still running, but producing about 20 "out of files" log-messages per seconds.

Will test the recommendation with ulimit -n, probably the next days.

1 reply

JaylinYu Nov 27, 2023
Maintainer

Yup, increase the max FD number, and check with ulimit -n

realdognose · 2023-11-27T10:08:23Z

realdognose
Nov 27, 2023
Author

I've just checked the actual running threads - and noted, that two of the nng:aio:expire threads are stuck with ~ 9.3% CPU usage.
Since the server is a 12 core vm, that means these two threads are using a single core each for 100% permanently.

Not sure if related at all, just noted.

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   875 root      20   0   15.8g   3.1g   2280 S   9.3  83.0  70:47.03 nng:aio:expire
   877 root      20   0   15.8g   3.1g   2280 R   9.3  83.0  22:01.82 nng:aio:expire
   880 root      20   0   15.8g   3.1g   2280 S   1.7  83.0  16:00.51 nng:aio:expire
   881 root      20   0   15.8g   3.1g   2280 S   1.3  83.0  16:47.17 nng:aio:expire
   876 root      20   0   15.8g   3.1g   2280 S   0.7  83.0  13:44.86 nng:aio:expire
   878 root      20   0   15.8g   3.1g   2280 S   0.7  83.0  13:09.09 nng:aio:expire
   879 root      20   0   15.8g   3.1g   2280 S   0.7  83.0  12:58.27 nng:aio:expire
   846 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   5:09.19 nng:poll:epoll
   852 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:04.37 nng:task
   854 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:04.44 nng:task
   855 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:16.50 nng:task
   856 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:05.74 nng:task
   858 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:03.81 nng:task
   859 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:05.37 nng:task
   860 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:04.65 nng:task
   861 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:04.25 nng:task
   862 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:05.22 nng:task
   863 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:04.30 nng:task
   864 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:04.11 nng:task
   865 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:03.87 nng:task
   866 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:04.49 nng:task
   868 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:04.02 nng:task
   869 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:03.88 nng:task
   870 root      20   0   15.8g   3.1g   2280 S   0.3  83.0   4:04.92 nng:task
   882 root      20   0   15.8g   3.1g   2280 S   0.3  83.0  10:01.46 nng:aio:expire
   839 root      20   0   15.8g   3.1g   2280 S   0.0  83.0   0:00.01 nanomq
   848 root      20   0   15.8g   3.1g   2280 S   0.0  83.0   0:00.00 nng:resolver
   849 root      20   0   15.8g   3.1g   2280 S   0.0  83.0   0:00.00 nng:resolver
   850 root      20   0   15.8g   3.1g   2280 S   0.0  83.0   0:00.00 nng:resolver
   851 root      20   0   15.8g   3.1g   2280 S   0.0  83.0   0:00.00 nng:resolver
   853 root      20   0   15.8g   3.1g   2280 S   0.0  83.0   4:03.74 nng:task
   857 root      20   0   15.8g   3.1g   2280 S   0.0  83.0   4:05.72 nng:task
   867 root      20   0   15.8g   3.1g   2280 S   0.0  83.0   4:05.39 nng:task
   871 root      20   0   15.8g   3.1g   2280 S   0.0  83.0   4:04.36 nng:task
   873 root      20   0   15.8g   3.1g   2280 S   0.0  83.0   0:00.46 nng:reap2
   874 root      20   0   15.8g   3.1g   2280 S   0.0  83.0   0:00.00 nng:timer
   889 root      20   0   15.8g   3.1g   2280 S   0.0  83.0   0:00.00 nanomq

(and while finishing this comment, the server finally crashed / got killed)

3 replies

JaylinYu Nov 27, 2023
Maintainer

Pretty curious about your test case.

If you are using a Top similar command, then 9.3% means 9.3% of one core. aio_expire thread is for reaping the expired async I/O, while other thread that for real workload is not taking too much CPU time, so that seems like massive timeout/disconnect/connect is going on.

I believe this is a OOM kill, already 83% mem usage, perhaps mostly consumed by expired connection, not strange the process got killed,

realdognose Nov 27, 2023
Author

Oh, didn't pay attention at the memory so far. But I believe this memory-footprint is just a consequence, not the cause.
After a reboot, memory stays at about 0.1% for hours.

I have about 40 mqtt-devices / applications, shouldn't be too much of a memory issue.
Maybe one of them is connecting with ridiculous keep-alive times or sth, which finally causes the socket-exhaust (open-file error)
which finally leads to massive log / memory polution.

JaylinYu Nov 27, 2023
Maintainer

That sounds like you are DDoS yourself....
remember to disable session keeping In this case.

realdognose · 2023-11-27T11:17:13Z

realdognose
Nov 27, 2023
Author

Found some minutes to take the required actions.

I've created a systemd service for nanomq, so I don't have to wory about starting it on each reboot.
Hence the ulimit -n is not applicable in this case.

systemd services allow to specify that in the respective service file, i.e. /etc/systemd/system/nanomq.service with the parameter LimitNOFILE

[Unit]
Description=nanoMQ Service

[Service]
User=root
ExecStart=/bin/bash -c 'nanomq start'
# optional items below
LimitNOFILE=25000
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

to verify it applies, use pidof nanomq and nano /proc/{pid}/limits which I raised in this example form 1024 (default) to 25k to begin with.

Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             14923                14923                processes
Max open files            25000                25000                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       14923                14923                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

Will report back during the next days, if the problem is resolved - OR if raising the limit will just delay potential issues that shouldn't occur in the first place ;)

0 replies

realdognose · 2023-11-29T21:57:50Z

realdognose
Nov 29, 2023
Author

So far, no crash until today.

I've did some checks and found out, that there is actually one device (Shelly H&T Sensor) that keeps a lot of open sockets, even if it is only waking up every once in a while to report data. (triggered by temperature and/or humidity change)

Right now, nanomq has 245 open sockets, of which about 200 belong to this device.
But pretty stable around 250 during the past days.

So, with the default limit of 1024, there is a potential chance that to frequent temperature / humidity changes caused a raising number of left over sockets, leading to the error.

sockstat is helpful on figuring that out:

...
root     nanomq               837      tcp4   10.10.21.11:1883          10.10.20.68:18994         ESTABLISHED
root     nanomq               837      tcp4   10.10.21.11:1883          10.10.20.68:10258         ESTABLISHED
root     nanomq               837      tcp4   10.10.21.11:1883          10.10.20.68:1334          ESTABLISHED
root     nanomq               837      tcp4   10.10.21.11:1883          10.10.20.68:17343         ESTABLISHED
root     nanomq               837      tcp4   10.10.21.11:1883          10.10.20.58:52566         ESTABLISHED
root     nanomq               837      tcp4   10.10.21.11:1883          10.10.20.68:30035         ESTABLISHED
root     nanomq               837      tcp4   10.10.21.11:1883          10.10.20.68:5996          ESTABLISHED
root     nanomq               837      tcp4   10.10.21.11:1883          10.10.20.68:18003         ESTABLISHED
...

and ls -l /proc/{pid}/fd | wc -l shows all sockets currently in use by nanoMQ.

root@mqtt:~#  ls -l /proc/837/fd | wc -l
245

Since this is stable at about 250 for the past days, it seems there is some sort of timeout, when these sockets get recycled?
Is there a configuration option for that? Else it would continue to grow and eventually reach even the 25k limit.

(The Shelly itself is configured to connect with a clean session and a keepalive of 10seconds - so apparently a bug, maybe related to it's deep sleep between measurements)

1 reply

JaylinYu Nov 30, 2023
Maintainer

Aha, see, told u.
MQTT is meant for long-standing connection, not one-time communication, it would be better the sensor could maintain the session. But, I understand low-power MCU needs to sleep while it can. So i guess this is what it is. Another options is to extend the keepalive, as long as the next report is within the keepalive period, the TCP stands.

As for socket recycling. It is managed by the kernel, not much we can do other than use close(). You can check “tcp_tw_recycle/reuse” option in the kernel. that closed socket should be in TIMEWAIT state if my assumption is correct.

realdognose · 2023-12-03T18:38:18Z

realdognose
Dec 3, 2023
Author

Is there a configuration option for that? Else it would continue to grow and eventually reach even the 25k limit.

So, yeah, just happened right now ;)

root@mqtt:~# ls -l /proc/837/fd | wc -l
25001

checking ls -l /proc/837/fd shows an infinite amount of sockets, but looking at sockstat however just shows the usual ~ 250 sockets.

So it seems, that even if linux is closing sockets after a while, nanoMQ doesn't or doesn't release the fd-handle in some sort?

I could upload both the dumps (proc / sockstat) if that's helpfull?
Edit: Nope, not this time - dumped it to tmp, but gone after reboot :P

1 reply

JaylinYu Dec 4, 2023
Maintainer

I don't think this is nanoMQ issue. Broker called close. Then it is kernel's place to do the rest job.
Have tested on my machine, works fine with unauthorized, protocol errorand malformed packet case.

What is your disconnecting scenario? It could be a bug If this is a case I neglected before,

realdognose · 2023-12-04T10:23:56Z

realdognose
Dec 4, 2023
Author

I don't think this is nanoMQ issue. Broker called close. Then it is kernel's place to do the rest job. Have tested on my machine, works fine with unauthorized, protocol errorand malformed packet case.

What is your disconnecting scenario? It could be a bug If this is a case I neglected before,

I've just cheked the logs, and (this time) there are hundrets of thousands of these lines:

2023-11-30 20:00:07 [874] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:612: nni aio recv error!! Connection shutdown

2023-11-30 20:00:07 [874] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:880: tcptran_pipe_recv_cb: parse error rv: 139

2023-11-30 20:00:07 [856] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:612: nni aio recv error!! Connection shutdown

2023-11-30 20:00:07 [856] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:880: tcptran_pipe_recv_cb: parse error rv: 139

2023-11-30 20:00:07 [863] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:612: nni aio recv error!! Connection shutdown

2023-11-30 20:00:07 [863] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:880: tcptran_pipe_recv_cb: parse error rv: 139

(heavy log-flooding)

and these

2023-11-30 20:31:50 [862] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:880: tcptran_pipe_recv_cb: parse error rv: 139

2023-11-30 20:37:24 [878] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:612: nni aio recv error!! Object closed

2023-11-30 20:37:24 [878] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:880: tcptran_pipe_recv_cb: parse error rv: 139

2023-11-30 20:43:30 [853] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:612: nni aio recv error!! Object closed

2023-11-30 20:43:30 [853] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:880: tcptran_pipe_recv_cb: parse error rv: 139

2023-11-30 20:49:04 [880] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtt/broker_tcp.c:612: nni aio recv error!! Object closed

(light log-flooding)

Not sure if these errors tell you the kind of connection error that's happening.

I don't think this is nanoMQ issue.

nanoMQ may not be the root cause - but as a software developer myself i'd say:
"If a client-side-issue is able to knock out our server, we have to handle that issue to ensure reliability of our service" ;)
(Don't understand that as a blame or something, it's always a struggle to predict every possible (user-)failure that may occur under certain circumstances)

0 replies

realdognose · 2023-12-04T10:50:23Z

realdognose
Dec 4, 2023
Author

I've now enabled trace logging, and a short script to track the fd-count per minute.

Maybe this better allows to identify what is going on "the moment it starts to build up massive fd-handles".

nanoMQid=$(pidof nanomq)
ct=$(ls -l /proc/$nanoMQid/fd | wc -l)
dt=$(date +"%d-%m-%y %H:%M:%S")
echo "$dt: $ct" >> /var/log/nanoMQfd.log

3 replies

JaylinYu Dec 4, 2023
Maintainer

I am an open-minded & rational person; my feeling is the last thing you need to be concerned. But still appreciate your kindness.
And....
The case is clear, you are flooding TCP Connect request to the 1883 port of your machine/VM, the rate is overwhelming and far more faster than fd recycling of kernel. In this case, yes, you are attacking the server.

"we have to handle that issue to ensure reliability of our service", correct, but nanomq as a broker live on the top of POSIX/Kernal rarely has any weapon against it. Actually this is a common case when running huge scale of internet service, this is why ssl/tls exist, and usually, we disable an IP if it initiates too much useless request in a short time.

as long as the FD will decrease, I don't think any issues remains in broker internally.

In your case particularly, perhaps reusing the FD for this undisciplined sensor/MCU is the way to go.

realdognose Dec 4, 2023
Author

I will still see, if the debug logs will unveil something. That one mentioned sensor is doing something wrong, that's obvious - but it's socket-pollution is at a rate of about 100 per day. So, the current limit of 24000 should be good for at least 240 days without any recycling going on - which it wasn't. Either that sensor is going crazy after a while or it's open sockets is / was the wrong "trace" to pick up.

As mentioned above, it is strange that the processes' fd-count goes beyond 24k, while sockstat shows ~ 250 sockets in use.

I've been running nanoMQ on a Windows Server for 4 Months before. Didn't have any issues there. Ofc, windows handles the stuff different than POSIX, but from my experience... Most the time not better :)

But it also was nanoMQ 19.1 compared to the current 20.8, don't know if there are major changes to fundamental classes that may have an impact.

Whatsoever - I will post an update if there's something worth mentioning inside this (probably) gigabytes of trace logs ;-)

JaylinYu Dec 4, 2023
Maintainer

Those logs are worth little despite telling you how fast that sensor connects and disconnects. According to your last post, the rate definitely varies, As you can image, while the reading experiencing dramatical change....
As shown in your log, it made 3+ connect/disconnect within 1 sec (2023-11-30 20:00:07), this is an astonishing rate....

The most significant fix between 19.1-20.8 is #1411, this is also the only PR change TCP/Core part.

BTW, Yes Windows handles sockets in a different way compare to Linux. It use IOCP while Linux stay with epoll

realdognose · 2023-12-07T14:05:37Z

realdognose
Dec 7, 2023
Author

Everything has been running 3 days without any surprise so far. 250 fd-handles +/-

Today, I checked the fd-handle-count-log and noted, that nanoMQ has started to build up handles.
Every full hour, between xx:00:00 and xx:01:00 the fd-handle-count increased by about 250 - then remains stable +/- 10 for the next hour.

07-12-23 07:00:01: 251
07-12-23 07:01:01: 531
...
07-12-23 08:00:01: 542
07-12-23 08:01:01: 822
...
07-12-23 09:00:01: 833
07-12-23 09:01:01: 1113
...
07-12-23 10:00:01: 1125
07-12-23 10:01:01: 1405
...
07-12-23 11:00:01: 1415
07-12-23 11:01:01: 1695
...
07-12-23 12:00:01: 1705
07-12-23 12:01:01: 1985
...
07-12-23 13:00:02: 1996
07-12-23 13:01:01: 2276
...
07-12-23 14:00:01: 2286
07-12-23 14:01:01: 2566

At this rate, it'll probably hit the 25k limit in about 3 more days i'd say.

yet, at this point, sockstat just reports ~ 240-251 sockets beeing open:

root@mqtt:/var/log# sockstat | wc -l
243

At exactly 14:00 I even sockstatet like hell - no noticable change there.

So, whatever triggers the increase in fd-handles does not seem to be related to external socket connections - else sockstat should report?

I've attached a trace log from around 13:59 - 14:01 of today, in case you want to have a look at it. Since I don't know the messages I could only look for repeated stuff, but probably just encounter traces that are meant to repeat over and over without any indicator of an error.

nanomq.log.zip

I've checked the logs a bit - every full hour, there are about 280 "conenctions & publishes" of a customers remote-servers (disk-space-check) - they are done through a powershell client, over a VPN-Connection.

(The clients in the subnet 192.168.136.0)

this matches the hourly fd-handle increase quite good - the big question would be, why does this become an issue after 3 days of uptime of nanoMQ - and not right away?

May it be possible that NAT has something to do with it? And nanoMQ maybe associates all 280 connections by their "next hop" rather than their actual address, hence somehow just "closes" 279 handles probably? (But again, why does this begin after a certain uptime? these servers are publishing 24/7)

7 replies

JaylinYu Dec 8, 2023
Maintainer

With all due respect, the log tells the whole story. Broker is not able to decrease the fd-count other than fire a request to kernel. and this is a pretty basic action; if this goes wrong, the bug report shall have already filled this board.

There is no ground to put accusation on nanomq.

From 2023-12-07 14:00:06 to 2023-12-07 14:00:35, connect/disconnect happens 120 times, The broker has to deal with it, and once the client disconnects, it will tell the kernel to release it. NanoMQ is able to accept the TCP connection in high performance, this could make things worse.

And FYI, the kernel is able to recycle the FD quickly only if the TCP/Socket connection is closed properly(this is also what you said " everything works as expected again" ). If this is an abrupt disconnect due to network outrage or whatever reason, it will produce a TIME_WAIT/CLOSE_WAIT state, it will leave the socket fd living for a while. This is common knowledge regarding networking and OS. you could find it by using "netstat" command.

My suggestion is, learn about Computer networking, read https://serverfault.com/questions/212093/how-to-reduce-number-of-sockets-in-time-wait, and then check the network or firewall or client to find out what causes the frequent reconnect issue.

realdognose Dec 8, 2023
Author

No need become offending mate.

There is no "reconnect issue", it's a total of 280 messages delivered through a independent connection, opening it, sending data, closing it. There aren't any sockets remaining in TIME_WAIT / CLOSE_WAIT - they are closed. Yet the fd-handles remains unchanged AFTER a certain uptime of the system / nanomq process. For ANY client.

As I said, and as u are arguing: It may be related to the operating system (or configuration or lack thereof), that these fd-handles are not closed when nanoMQ requests to do so. (then it is obviously not nanomqs fault)

I'm a software dev, so you are - so you know, there could be code that is working 99.8% flawless and fails in a very special edge szenario.
But if you guys are no longer interested in "considering" that it might be a nanoMQ-Issue - I can stop reporting "things I noted" for this issue.

Spending time to extract and write down some observations doesn't have the purpose to blame or insult somebody. This is not a fb-group where discussions are about who's right.

JaylinYu Dec 8, 2023
Maintainer

Well, as I said, I don't have strong emotions... I apologize if you feel offended. Feel free to fire your issue and questions, debate produces truth.

But, I doubt your statement "There aren't any sockets remaining in TIME_WAIT / CLOSE_WAIT", it is clear from log that the client is closed abnormally. The disconnect packet has not yet been received. And reconnecting 120+ times within 20 seconds definitely seems like a reconnect issue to me.
and the number of "tcptran_pipe_close" matched with disconnecting times, which means the broker has closed socket each time the client disconnect.

Double-check the netstat. Some packet sniffing could help.

Whoops even worse.
2023-12-07 14:00:03 to 2023-12-07 14:00:06, connect/disconnect 160 times.........
The reason why this become an issue after 3 days of uptime of nanoMQ but not right away ---- is the rapidly reconnect only happens after some uptime, in this case is right at the 14:00, also match with the log you posting:

07-12-23 14:00:01: 2286
07-12-23 14:01:01: 2566

realdognose Dec 8, 2023
Author

I think I have to add some clarification about the (re-)connect issue:

I'm running about 20 Servers with a total of 35 disks on a remote-site. Each Server runs a script every hour to report it's disk usage(s) to the mqtt server. The Publish method (script below) is creating a new connection per-publish because it is used in various scripts that sometimes need to publish something. So, each report for a single disk is 8 connections, publishes and disconnects. Total of 280 connections for 35 disks.

The script itself / the MQTT-Publish-Method is awaiting the Disconnect of it's client. This was necessary because the client is using some asynchronous message-processing in it's Disconnect() Method, so It may probably happen that the powershell-runspace terminates, before the async Disconnect-Handler has done it's job.

Since none of the scripts ever get's stuck and netstat / sockstat on the mqtt host don't show anything else, it is pretty "save" to say that each client is performing it's disconnect as it should.

As mentioned, these scripts are running 24/7 and don't cause any issues upto a certain point. At that point (Which I noted after 3 days so far) I still can see that sockets are closed properly - but the fd-handles are no longer removed. They then start to increase until the process runs out of fd-handles.

I also can confirm, that NO OTHER clients disconnect causes a drop in fd-handles (anymore). Upon connection it's a +1 - upon disconnect fd-handles remain constant.

This is the diskspace-script mentioned:

Add-Type -Path "C:\Program Files\PackageManagement\NuGet\Packages\M2Mqtt.4.3.0.0\lib\net45\M2Mqtt.Net.dll"

function MQTT-Publish{
    param
    (
        [Parameter(Mandatory=$True)]
        [String]
        $Key,
        [Parameter(Mandatory=$True)]
        [String]
        $Message
    )

    $hostName = $env:COMPUTERNAME
    $MqttClient = [uPLibrary.Networking.M2Mqtt.MqttClient]("mqtt.ad.xxxxxxxxx.de")
    $con = $MqttClient.Connect($hostName + "-Powershell-Publisher", "admin", "xxxxxxx")
    $pub = $MqttClient.Publish($Key, [System.Text.Encoding]::UTF8.GetBytes($Message.toString()), 1, $true);
    
    Write-Host "Publishing: $Key => $Message $con $pub"

    $MqttClient.Disconnect();     

    while ($MqttClient.IsConnected){
        Start-Sleep -Milliseconds 20
    }
}

$os = (Get-WmiObject -class Win32_OperatingSystem).Caption

if ($os -like "*2019*" -or $os -like "*2016*" -or $os -like "*2012*"){
    $date = Get-Date  
    $volumes = Get-Volume
    $volumes | ForEach-Object {
        if ($_.DriveLetter -ne $null -and $_.DriveLetter -ne "A" -and $_.DriveType -eq "Fixed"){
            $Size = $_.Size
            $Free = $_.SizeRemaining
            $Used = $_.Size - $_.SizeRemaining
            $Percent = $Used / $_.Size

            MQTT-Publish -Key ("server/diskSpaceTable/" + $env:COMPUTERNAME + "_" + $_.DriveLetter) -Message $date.ToString("yyy-MM-dd HH:mm:ss")
            MQTT-Publish -Key ("server/diskSpaceTable/" + $env:COMPUTERNAME + "_" + $_.DriveLetter + "/sizeBytes") -Message $Size
            MQTT-Publish -Key ("server/diskSpaceTable/" + $env:COMPUTERNAME + "_" + $_.DriveLetter + "/freeBytes") -Message $Free
            MQTT-Publish -Key ("server/diskSpaceTable/" + $env:COMPUTERNAME + "_" + $_.DriveLetter + "/tenant") -Message "MA"
            MQTT-Publish -Key ("server/diskSpaceTable/" + $env:COMPUTERNAME + "_" + $_.DriveLetter + "/usedBytes") -Message $Used
            MQTT-Publish -Key ("server/diskSpaceTable/" + $env:COMPUTERNAME + "_" + $_.DriveLetter + "/usedPercent") -Message $Percent
            MQTT-Publish -Key ("server/diskSpaceTable/" + $env:COMPUTERNAME + "_" + $_.DriveLetter + "/letter") -Message $_.DriveLetter
            MQTT-Publish -Key ("server/diskSpaceTable/" + $env:COMPUTERNAME + "_" + $_.DriveLetter + "/host") -Message $env:COMPUTERNAME
        }
    }
}

I've even just ran the following "test":

Creating 5000 mqtt-clients, connecting (but NOT disconnecting)
Closing the powershell instance.
Seeing 5000 sockets + fd-handles beeing removed after 60 * 1.5seconds (keep-alive * 1.5)

for ($i=0; $i -lt 5000; $i++){
    $hostName = $env:COMPUTERNAME
    $MqttClient = [uPLibrary.Networking.M2Mqtt.MqttClient]("mqtt.ad.xxxxxxx.de")
    $con = $MqttClient.Connect($hostName + "-" + $i + "-Powershell-Publisher", "admin", "xxxxxxx")
}

So, whatever happens after "about 3 days" (and cause that fd-handles are no longer removed) will be the interesting thing, i'm sure.
Just have to figure out, if it's proc-based or sys-based.

So, when I encounter this Issue next time, I will verify 2 things:

How do fd-Handles for OTHER processes behave at that time. If they also stop to be removed, it clearly indicates an OS-Issue or configuration Issue with that.

If others fail as well:
2) What do system logs show at the point when the system starts to fail to remove fd-handles? Any indication there?

JaylinYu Dec 9, 2023
Maintainer

Your perseverance really inspired me....

Well.... you tested it, you see that FD requires certain times to be recycled...
To correct you, If it is a graceful disconnect, then FD shall be recycled immediately, not after 60 * 1.5seconds (keep-alive * 1.5)..........
So it is easy for you to image, it could run out of files if the equilibrium is broken.

And dislike Win. Resource operation in Linux requires a FD. Including reaping worker of nng:aio:expire as you see in the top. Running out of files could result in serious side effects and unexpected software action.

And as the screenshot you posted in the early thread (that indicates nng:aio:expire occupied much of CPU time ), I started to paint a reasonable pic.

Clients keep connecting and disconnecting, the recycling rate is a bit lower than the acceptance rate. For some reason (or just run out of files already, this is where become strange in your case), nanomq cannot get new FD from the kernel and hit NNG_ENOFILES, in this case, it will backoff for 10 seconds to wait for the new FD to be released. So that the processing of current TCP will be delayed and add up to the queue for later resolution.
For your knowledge, The sockets will eventually be closed/cleaned-up properly only if the client subsequently attempts to access the closed socket or if the application is closed. Which means it was reaped when MQTT keepalive timer expired (rely on nng:aio:expire), or TCP keepalive expired (rely on kernel, 60s by default).

Upon that, old TCP was holding, new connection is kept flowing into the system, finally the equilibrium is broken, at this moment, the FD will add up exponentially, since broker is nearly not able to accept any new TCP connection, and busy handling the past connection, the queue grows longer and longer. the reaper(aio_expire) of NanoMQ cannot keep with the pace of flooding TCP socket. This is called FD-Exhaustion.

This is pretty common case for our customer, we have dealt with a lot of similar cases in Cloud, nothing particular for you case, except that you were saying the FD lives forever. I believe by stopping all clients, you will see the FD slowly get recycled.

This is my final guess, I will leave you to your struggle then.

realdognose · 2023-12-22T10:38:20Z

realdognose
Dec 22, 2023
Author

I have now been encountering the Issue for three more times - and added another layer of "tracing" to it.

Finally, I was able to find what I was hoping to find:

An error logged, appearing ONCE, beeing able to confirm fd-handle-releasing works before - but stops after.

So, what I did:

I've improved the fd-handle-count script to be executed every 10 seconds.
I've setup a remote-script that is creating 5 connections every minute, disconnecting after 30 seconds.
The later has the target to identify the exact time (down to a minute), when handles aren't released anymore.

So, derived from this and the syslog of the system, this is what I was able to figure out:

fd handles are raising / falling by 5 every minute, just as expected.
then, a single warning is logged by nanoMQ. It doesn't appear any time earlier or later: pipe id 1441120957 is gone, pub failed
After that, fd-handles are no longer removed and the process starts to run out of handles sooner or later.

23-12-22 08:52:41: 57
23-12-22 08:52:51: 62
23-12-22 08:53:01: 62
23-12-22 08:53:11: 58
23-12-22 08:53:21: 58
23-12-22 08:53:31: 58
23-12-22 08:53:41: 58
23-12-22 08:53:51: 63
Dec 22 08:53:52 mqtt systemd[1]: session-7.scope: Deactivated successfully.
Dec 22 08:53:54 mqtt bash[836]: 2023-12-22 08:53:54 [871] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/transport/mqtts/broker_tls.c:599 tlstran_pipe_recv_cb: nni aio recv error!! Connection shutdown
Dec 22 08:53:54 mqtt bash[836]: 2023-12-22 08:53:54 [865] WARN  /home/runner/work/nanomq/nanomq/nng/src/sp/protocol/mqtt/nmq_mqtt.c:401 nano_ctx_send: pipe id 1441120957 is gone, pub failed
23-12-22 08:54:01: 63
23-12-22 08:54:11: 63
23-12-22 08:54:21: 63
23-12-22 08:54:31: 63
23-12-22 08:54:41: 63
23-12-22 08:54:51: 68
23-12-22 08:55:01: 68
23-12-22 08:55:11: 68
23-12-22 08:55:21: 68
23-12-22 08:55:31: 68
23-12-22 08:55:41: 69
23-12-22 08:55:51: 74
23-12-22 08:56:01: 74
23-12-22 08:56:11: 74
23-12-22 08:56:21: 74
23-12-22 08:56:31: 74
23-12-22 08:56:41: 74
...
23-12-22 10:11:01: 1019
23-12-22 10:11:11: 1019
23-12-22 10:11:21: 1019
23-12-22 10:11:31: 1019
23-12-22 10:11:41: 1019
23-12-22 10:11:51: 1025
23-12-22 10:12:01: 1025
23-12-22 10:12:11: 1025

I will now clear the logs, restart everything, and see if this is "exactly" happening again - or if the message and failure-time was a coincidence.

I've checked the code in nmq_mqtt.c but i'm not into C.
However, there is a comment left, saying

// Pipe is gone. Make this look like a good send to avoid
// disrupting the state machine. We don't care if the peer
// lost interest in our reply.

Maybe whatever should happen there is STILL disrupting the "state machine"? ;-)

From a logical point of view: Maybe the calls to "Set msg=null" and " free(msg)" should be swapped?

nni_aio_set_msg(aio, NULL);
log_warn("pipe id %ld is gone, pub failed", pipe);
nni_msg_free(msg);

like this?

nni_msg_free(msg);
nni_aio_set_msg(aio, NULL);
log_warn("pipe id %ld is gone, pub failed", pipe);

(Just a wild guess)

3 replies

JaylinYu Dec 22, 2023
Maintainer

The solid proof shall be: once you think fd-handle-releasing stops working, stop all connections, and the fd count never drops.

JaylinYu Dec 22, 2023
Maintainer

According to the log, apprantely the client is not graceful disconnecting. Other wise a Disconnect will be logged.

“pipe id 1441120957 is gone, pub failed” only says, "hey, the target client is gone. You are sending msg to void! I gonna drop the msg"
The actual cleansing work is done by line: 1032 of nmq_mqtt.c

realdognose Dec 22, 2023
Author

Will do next time - just rebooted.

btw, nni_msg_free(nni_msg *m) (in message.c) takes a pointer and checks that (m) for NULL before doing it's work.
nni_aio_set_msg sets aio->a_msg = msg, with msg beeing passed as NULL, before that.

So, while msg beeing passed to nni_msg_free() is just a pointer to nni_aio_get_msg(aio) and therefore aio->a_msg it'll now be null?.

I may be terrible wrong in C about this, but if this was java or c#, it would definitly set the underlaying object-property (aio->a_msg) to null, so nni_msg_free just does nothing, cause the reference m* would now point to NULL.

JaylinYu · 2024-01-23T04:20:13Z

JaylinYu
Jan 23, 2024
Maintainer

A potential case looks similar to the scenario described here: look issue #1619.
it only happens when client enables session keeping(clean-start = 0), keep disconnecting violently(not gracefully disconnect TCP) by power-off or plug-off. Also reconnecting with the same clientid.

0 replies

JaylinYu · 2024-02-18T03:30:51Z

JaylinYu
Feb 18, 2024
Maintainer

@realdognose This turns out a bug that only with TLS transport, my fault.
Has been fixed in nanomq/NanoNNG#862

1 reply

realdognose Feb 18, 2024
Author

Thx for the headsup. I will give that a try when I have some time and let you know if this is applicable for my situation.

(Currently switched to another Broker, cause needed to have it running unattended for a couple of weeks)

Frequent warning, until process-crash: send aio error Out of files #1546

realdognose Nov 25, 2023

Replies: 13 comments · 21 replies

JaylinYu Nov 25, 2023 Maintainer

realdognose Nov 26, 2023 Author

JaylinYu Nov 26, 2023 Maintainer

realdognose Nov 27, 2023 Author

JaylinYu Nov 27, 2023 Maintainer

realdognose Nov 27, 2023 Author

JaylinYu Nov 27, 2023 Maintainer

realdognose Nov 27, 2023 Author

JaylinYu Nov 27, 2023 Maintainer

realdognose Nov 27, 2023 Author

realdognose Nov 29, 2023 Author

JaylinYu Nov 30, 2023 Maintainer

realdognose Dec 3, 2023 Author

JaylinYu Dec 4, 2023 Maintainer

realdognose Dec 4, 2023 Author

realdognose Dec 4, 2023 Author

JaylinYu Dec 4, 2023 Maintainer

realdognose Dec 4, 2023 Author

JaylinYu Dec 4, 2023 Maintainer

realdognose Dec 7, 2023 Author

JaylinYu Dec 8, 2023 Maintainer

realdognose Dec 8, 2023 Author

JaylinYu Dec 8, 2023 Maintainer

realdognose Dec 8, 2023 Author

JaylinYu Dec 9, 2023 Maintainer

realdognose Dec 22, 2023 Author

JaylinYu Dec 22, 2023 Maintainer

JaylinYu Dec 22, 2023 Maintainer

realdognose Dec 22, 2023 Author

JaylinYu Jan 23, 2024 Maintainer

JaylinYu Feb 18, 2024 Maintainer

realdognose Feb 18, 2024 Author

realdognose
Nov 25, 2023

Replies: 13 comments 21 replies

JaylinYu
Nov 25, 2023
Maintainer

realdognose
Nov 26, 2023
Author

JaylinYu Nov 26, 2023
Maintainer

realdognose
Nov 27, 2023
Author

JaylinYu Nov 27, 2023
Maintainer

realdognose
Nov 27, 2023
Author

JaylinYu Nov 27, 2023
Maintainer

realdognose Nov 27, 2023
Author

JaylinYu Nov 27, 2023
Maintainer

realdognose
Nov 27, 2023
Author

realdognose
Nov 29, 2023
Author

JaylinYu Nov 30, 2023
Maintainer

realdognose
Dec 3, 2023
Author

JaylinYu Dec 4, 2023
Maintainer

realdognose
Dec 4, 2023
Author

realdognose
Dec 4, 2023
Author

JaylinYu Dec 4, 2023
Maintainer

realdognose Dec 4, 2023
Author

JaylinYu Dec 4, 2023
Maintainer

realdognose
Dec 7, 2023
Author

JaylinYu Dec 8, 2023
Maintainer

realdognose Dec 8, 2023
Author

JaylinYu Dec 8, 2023
Maintainer

realdognose Dec 8, 2023
Author

JaylinYu Dec 9, 2023
Maintainer

realdognose
Dec 22, 2023
Author

JaylinYu Dec 22, 2023
Maintainer

JaylinYu Dec 22, 2023
Maintainer

realdognose Dec 22, 2023
Author

JaylinYu
Jan 23, 2024
Maintainer

JaylinYu
Feb 18, 2024
Maintainer

realdognose Feb 18, 2024
Author