-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebGUI: can't bind to socket: 127.0.0.1:443: Address already in use #6351
Comments
HAProxy or Nginx perhaps? I don’t think that happens in a default install. Cheers, |
I think that would be easy to spot, but I don't have anything running that I would suspect to use :443. Additional packets I have installed:
|
Interfaces : Diagnostics : Netstat, go for Ports or Protocol and search for 443 |
However this probably only makes sense, after lighttpd actually can't restart? |
Does acme client also handle the web GUI certificate? Or if there is still a process it probably hangs doing "something' although it seems new to me. Selecting specific interfaces also doesn't help in general, but is probably not causing it. Cheers, |
It seems to use Best check the proceslist first
Best regards, Ad |
..but acme lighty instance not using standard port by default and not using SSL |
I am not sure if it is acme. I checked the logs. The lighttpd error from above occours at 5am daily and 7am on Mondays.
acme on the other hand is at midnight daily. Around that lighttpd error at 5am there is multiple other errors, but hard to figure what is relevant. siproxd (expected as network is reconnecting): prefixes.sh: radvd: ntpd: mount: then there is lighttpd: at 7am, there is only the lighttpd error. I can also provide a full log, if that helps you. |
@chris42 sorry, just to clarify, so you can reproduce this issue on every "Periodic interface reset" cron job? |
Yes and no. I can reproduce the lighttpd error in every "Periodic interface reset" cron job and the "Automatic firmware update" one. However it does not lead to the "503 - service unavailable" error right away. It seems that it takes a while until lighttpd is broken. |
@chris42 thanks! one more question: is Web Server Log enabled? (System: Settings: Logging) |
D'OH sorry for not thinking of lighttpd logs as well. So here is what I have in there:
Then coinciding with the last time I saw the 503 error I have the following on Tuesday in the log:
FYI: The 18:08 should be me trying to access opnsense. Following a "killall lighttpd" from the console and then restarted all services to fix the 503 error. The two log pieces are everything that is there for 20.-21.02. |
Ok, this is getting weirder by the minute.
Being on the WebGUI, it shows me there, that the WebGUI is not running? |
sorry, i dont have a suitable stand handy for "Periodic interface reset" test with pppoe interface. Whether correctly I understand that this job will trigger core/src/etc/inc/plugins.inc.d/webgui.inc Line 139 in 09f40f0
of the lighty that just started and cannot start a new instance ( can't bind to socket )(I managed to reproduce this (not on a first try) by calling the configctl webgui restart almost simultaneously from two parallel ssh sessions)
|
@chris42 |
Thanks a lot! Did a reboot before applying, so any silly state is hopefully cleared up. Applied the patch (I guess reboot won't be necessary to activate) |
When the problem occurs, who owns the sockets in /tmp and who has them open? If lighttpd.conf is configured with If something else is auto-cleaning /tmp, I recommend moving the sockets to a different, protected path, such as |
So far webGUI is still alive and kicking. lighttpd logs are clean with only server stopping/starting messages. Within system logs I found the following on yesterday's reconnect:
On today's reconnect (5am) and firmware check (7am) I get the following:
Does that help? |
@chris42 yes, thanks! may help (resembles a some 'race condition') I will comment on the log a bit later |
I'm a little bit confused. There should be no "working instance" if lighttpd is being started. If there is, then whatever privileged process tried to start lighttpd failed to first check if lighttpd was already running, or failed to shut lighttpd down and wait for lighttpd to exit, and for the lighttpd instance shutting down to remove its pid file. After changing lighttpd configuration, it is recommended that the config be tested before lighttpd is restarted: However, if the lighttpd executable has changed (e.g. package upgrade), then a stop and start of lighttpd is needed. SIGINT will stop lighttpd gracefully, handling existing connections for a few more seconds (configurable). SIGTERM will stop lighttpd very quickly, without continuing to handle connections in progress. Now, I don't know the specifics of the webgui behavior. If a SIGTERM is sent to lighttpd, does the webgui do it, or does the webgui ask the init script to restart lighttpd? The init script should be used. If the webgui is sending the signal, there is a possibility that someone is attempting to start lighttpd before the older process has exited, and so the failure to bind might occur. I don't have a machine in front of me, so this is all theoretical scenarios, depending on what the webgui is actually doing. |
Let's try 3610606 with SIGINT then. It could be a race, but the start/stop is sequential so it's unlikely an issue with the general flow.
Cheers, |
Looking in 3610606, there is definitely a race after sending SIGTERM (now SIGINT) and exec'ing lighttpd because there is nothing that checks that lighttpd has exited before trying to start lighttpd back up. While changing the signal to SIGINT is more friendly to existing web requests in progress, lighttpd will by default wait for 8 seconds for those in-progress connections to finish before lighttpd closes those connections. The time to wait is configurable, but just about any value will result in the lighttpd bind failure at startup since the older lighttpd will still be running (unless there are no open connections and the older lighttpd manages to kill the backend PHP, wait for it to exit, and then lighttpd exit before webgui.inc tries to start up a new instance of lighttpd). Is there a service which restarts lighttpd if lighttpd exits? Can the webgui ask that service to restart lighttpd? If not, can the webgui wait until the lighttpd pid file is removed or |
The third parameter is true which means it monitors PID file to wait for exit. |
FYI: these are now the defaults in lighttpd since lighttpd 1.4.68, so this can be omitted.
and I question the correctness of |
As @fichtner notes, There is a different race condition in
That loops and every 200ms reads the pidfile. In the meantime, something else might delete the pidfile and something else might recreate the pidfile, and the loop would be none-the-wiser. The pidfile should be read and saved. Then the loop can operate on the numeric pid value. The pidfile should be read into memory before sending lighttpd SIGINT, and the signal should be sent to the pid, not to the |
Yep, let me fix that tomorrow. Not sure if related as this was the case for years, but we will see. Thanks! |
I disagree to some degree and pull myself out this issue now because it’s going nowhere. |
@gstrauss i just think that if (correct me if im wrong) pid-file deleted before |
I have the same issue happening. I don't understand half of this thread but what i understand is that two processes are fighting over some stuff which is causing this. However, how can i fix this now? Is it just waiting on the next update, or is there a patch i can (try to) install? |
@Melantrix It's highly likely solved if you keep the OPNsense defaults, System->Settings->Administration/"Listen Interfaces" : "All (Recommended)" |
Okay but I don't want to listen on all interfaces. I have disabled my untrusted networks I'm that list because, well, they are untrusted. I understand I can block the traffic with firewall rules which i have. But why listen on untrusted interfaces when they are untrusted? Is there anything else to fix this? |
Bind to loopback with a port forward? (and make sure you don't lock yourself out while doing so) Most issues are related to people using some sort of non static interface to bind to (which theoretically can't work). Offering a selection was likely one of the worst choices we made in the past years (after a user pushed for it and we warned from the beginning this is risky when surroundings change frequently)..... |
@AdSchellevis Thanks for responding this quicly! a non static interface would be? I have them only enabled on my trusted vlans that are on my LAN interface but disabled on e.g. my WAN interface or Guest network. i don't understand what the problem of this could be too be honest. What causes this issue exactly? If i understand the thread correctly, there are two processes fichting over port 443 on an interface to listen on? How is that connected to which interface the gui listens too? As far as i know there is no other web gui installed on the firewall that should also listen on a port? (i'm just trying to understand the issue, so i can do a proper risk assesment on which path to take to fix this). Maybe i'm just thinking about this in the wrong way..? |
To the OPNsense members subscribed: having systems-level functionality managed by PHP written by non-systems-developers is not a good look for OPNsense. I contributed code with a working solution above. If time jumping backwards were a concern, my proposed solution is not wrong, but could be improved to handle that case, which is what @kulikov-a added. However, there is nobody on the OPNsense team who has posted here committing either to reviewing those proposed solutions, or suggesting other next steps. |
yes indeed, that's the most common cause of issues.
In most cases that should work, but.I don't know the exact circumstances.
If I understand the ticket correctly (but I have not been involved), there's a locking issue with the webserver, for which @fichtner proposed a simple solution 226c133, there where other ventures in the same thread, but as they add a lot of complexity (and add risk), at the moment their not in the OPNsense codebase. Quite some if this is old (yet cleaned-up) code, which is vital for the system from a user perspective, we're careful what to merge there. In our experience adding a lot of complexity has the risk of quite some backlash, which might be worse than the original issue that doesn't seem to affect many users. If you're only using static interfaces, I would certainly try if 226c133 improves the situation, other improvements might take longer to integrate. If it does, just ping @fichtner so he has another datapoint. |
Will this be fixed via c95c444 in one of the future (minor) releases? |
They will be in 23.1.4, but they don’t fix 500 error. Honestly, just disable listening interface selection and avoid this one. |
Looks like improvements in 23.1.4 made the problem worse, indicating that the direction this is going is a bit doubtful. Instead, funnel the restart through configd to reach some state of serialization similar to what filter_configure() is doing. While here move the service definition to the correct file.
New patch on top of 23.1.4 is 33ad504
If it works I'd like to close this ticket finally. Cheers, |
So I would roll back kulikov-a patch from the beginning and add the one you mentioned? |
The idea is simply no more 500, but it may be that configd in detach mode will not serialise correctly. It's a one line fix then. If you have other patches move to 23.1.4 and run:
|
Ok... trap is set... now lets wait. |
So far the WebGUI seems to work. Errors haven't showed up. Only having the following in the system log at pppoe reconnect:
|
Looks promising. I added 019ea52 on top ready for 23.1.6 then. |
23.1.6 was released today. |
Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
Describe the bug
After some time running opnsense I'll get a "503 - service unavailable" error when trying to access the WebGUI. Opnsense itself seems to still do its job, however the WebGUI is gone.
Killing lighttpd and restarting services via console solves it (as mentioned here: https://forum.opnsense.org/index.php?topic=32540.0)
I noticed this after the last upgrade.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Would expect regular WebGUI operation and login window.
Describe alternatives you considered
Relevant log files
As in Forum post mentioned, I found the below part in the logs as well. Could not spot any other error related to webgui.
2023-02-20T07:20:07 Error opnsense /usr/local/etc/rc.restart_webgui: The command '/usr/local/sbin/lighttpd -f /var/etc/lighty-webConfigurator.conf' returned exit code '255', the output was '2023-02-20 07:20:07: (network.c.537) can't bind to socket: 127.0.0.1:443: Address already in use'
Environment
OPNsense 23.1.1_2-amd64
FreeBSD 13.1-RELEASE-p6
OpenSSL 1.1.1t 7 Feb 2023
AMD GX-412TC SOC
The text was updated successfully, but these errors were encountered: