Frequent warning, until process-crash: send aio error Out of files #1546
Replies: 13 comments 21 replies
-
This is an error returned by kernel when trying to send packet to socket. Check NNG_ENOFILES. It basically tells There is no file descriptor left. In this case, if there is still new connu coming, broker will take a backoff strategy to sleep and wait for new FD is released |
Beta Was this translation helpful? Give feedback.
-
Found the respective docs about the exception, but well - pretty much undocumented ;) So, I thought a while about a possible reason, and it might be related to the following: The ubuntu-vm is running on a hyperV-Server with a thin-provisioned disk. I remember years ago, where this caused IO-issues the moment the thin-provisioned virtual disk is at the edge of expanding. Will see, if the problem appears again during the next day, or if this might be the root cause and therefore gone for quite a time. |
Beta Was this translation helpful? Give feedback.
-
Ok, apparently seems not disk related. Currently the server is still running, but producing about 20 "out of files" log-messages per seconds. Will test the recommendation with ulimit -n, probably the next days. |
Beta Was this translation helpful? Give feedback.
-
I've just checked the actual running threads - and noted, that two of the nng:aio:expire threads are stuck with ~ 9.3% CPU usage. Not sure if related at all, just noted.
(and while finishing this comment, the server finally crashed / got killed) |
Beta Was this translation helpful? Give feedback.
-
Found some minutes to take the required actions. I've created a systemd service for nanomq, so I don't have to wory about starting it on each reboot. systemd services allow to specify that in the respective service file, i.e.
to verify it applies, use
Will report back during the next days, if the problem is resolved - OR if raising the limit will just delay potential issues that shouldn't occur in the first place ;) |
Beta Was this translation helpful? Give feedback.
-
So far, no crash until today. I've did some checks and found out, that there is actually one device (Shelly H&T Sensor) that keeps a lot of open sockets, even if it is only waking up every once in a while to report data. (triggered by temperature and/or humidity change) Right now, nanomq has 245 open sockets, of which about 200 belong to this device. So, with the default limit of 1024, there is a potential chance that to frequent temperature / humidity changes caused a raising number of left over sockets, leading to the error.
and
Since this is stable at about 250 for the past days, it seems there is some sort of timeout, when these sockets get recycled? (The Shelly itself is configured to connect with a clean session and a keepalive of 10seconds - so apparently a bug, maybe related to it's deep sleep between measurements) |
Beta Was this translation helpful? Give feedback.
-
So, yeah, just happened right now ;)
checking So it seems, that even if linux is closing sockets after a while, nanoMQ doesn't or doesn't release the fd-handle in some sort? I could upload both the dumps (proc / sockstat) if that's helpfull? |
Beta Was this translation helpful? Give feedback.
-
I've just cheked the logs, and (this time) there are hundrets of thousands of these lines:
(heavy log-flooding) and these
(light log-flooding) Not sure if these errors tell you the kind of connection error that's happening.
nanoMQ may not be the root cause - but as a software developer myself i'd say: |
Beta Was this translation helpful? Give feedback.
-
I've now enabled trace logging, and a short script to track the fd-count per minute. Maybe this better allows to identify what is going on "the moment it starts to build up massive fd-handles".
|
Beta Was this translation helpful? Give feedback.
-
Everything has been running 3 days without any surprise so far. 250 fd-handles +/- Today, I checked the fd-handle-count-log and noted, that nanoMQ has started to build up handles.
At this rate, it'll probably hit the 25k limit in about 3 more days i'd say. yet, at this point, sockstat just reports ~ 240-251 sockets beeing open:
At exactly 14:00 I even So, whatever triggers the increase in fd-handles does not seem to be related to external socket connections - else sockstat should report? I've attached a trace log from around 13:59 - 14:01 of today, in case you want to have a look at it. Since I don't know the messages I could only look for repeated stuff, but probably just encounter traces that are meant to repeat over and over without any indicator of an error. I've checked the logs a bit - every full hour, there are about 280 "conenctions & publishes" of a customers remote-servers (disk-space-check) - they are done through a powershell client, over a VPN-Connection. (The clients in the subnet 192.168.136.0) this matches the hourly fd-handle increase quite good - the big question would be, why does this become an issue after 3 days of uptime of nanoMQ - and not right away? May it be possible that NAT has something to do with it? And nanoMQ maybe associates all 280 connections by their "next hop" rather than their actual address, hence somehow just "closes" 279 handles probably? (But again, why does this begin after a certain uptime? these servers are publishing 24/7) |
Beta Was this translation helpful? Give feedback.
-
I have now been encountering the Issue for three more times - and added another layer of "tracing" to it. Finally, I was able to find what I was hoping to find: An error logged, appearing ONCE, beeing able to confirm fd-handle-releasing works before - but stops after. So, what I did:
So, derived from this and the syslog of the system, this is what I was able to figure out:
I will now clear the logs, restart everything, and see if this is "exactly" happening again - or if the message and failure-time was a coincidence. I've checked the code in nmq_mqtt.c but i'm not into C.
Maybe whatever should happen there is STILL disrupting the "state machine"? ;-) From a logical point of view: Maybe the calls to "Set msg=null" and " free(msg)" should be swapped?
like this?
(Just a wild guess) |
Beta Was this translation helpful? Give feedback.
-
A potential case looks similar to the scenario described here: look issue #1619. |
Beta Was this translation helpful? Give feedback.
-
@realdognose This turns out a bug that only with TLS transport, my fault. |
Beta Was this translation helpful? Give feedback.
-
I'm using nanoMQ 0.20.8 with ubuntu 22.04.
After one day of uptime, logs are beeing spamed with the following error, until the process dies:
Any hint, what this message means? Google did not yield any valuable results.
Beta Was this translation helpful? Give feedback.
All reactions