Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebGUI: can't bind to socket: 127.0.0.1:443: Address already in use #6351

Closed
2 tasks done
chris42 opened this issue Feb 21, 2023 · 86 comments
Closed
2 tasks done

WebGUI: can't bind to socket: 127.0.0.1:443: Address already in use #6351

chris42 opened this issue Feb 21, 2023 · 86 comments
Assignees
Labels
cleanup Low impact changes
Milestone

Comments

@chris42
Copy link

chris42 commented Feb 21, 2023

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug

After some time running opnsense I'll get a "503 - service unavailable" error when trying to access the WebGUI. Opnsense itself seems to still do its job, however the WebGUI is gone.
Killing lighttpd and restarting services via console solves it (as mentioned here: https://forum.opnsense.org/index.php?topic=32540.0)

I noticed this after the last upgrade.

To Reproduce

Steps to reproduce the behavior:

  1. Run below mentioned version
  2. Wait a few days
  3. Try to access WbGUI
  4. See error

Expected behavior

Would expect regular WebGUI operation and login window.

Describe alternatives you considered

Relevant log files

As in Forum post mentioned, I found the below part in the logs as well. Could not spot any other error related to webgui.
2023-02-20T07:20:07 Error opnsense /usr/local/etc/rc.restart_webgui: The command '/usr/local/sbin/lighttpd -f /var/etc/lighty-webConfigurator.conf' returned exit code '255', the output was '2023-02-20 07:20:07: (network.c.537) can't bind to socket: 127.0.0.1:443: Address already in use'

Environment

OPNsense 23.1.1_2-amd64
FreeBSD 13.1-RELEASE-p6
OpenSSL 1.1.1t 7 Feb 2023
AMD GX-412TC SOC

@fichtner
Copy link
Member

HAProxy or Nginx perhaps? I don’t think that happens in a default install.

Cheers,
Franco

@fichtner fichtner added the support Community support label Feb 21, 2023
@chris42
Copy link
Author

chris42 commented Feb 21, 2023

I think that would be easy to spot, but I don't have anything running that I would suspect to use :443.

Additional packets I have installed:

  • os-acme-client
  • os-ddclient
  • os-git-backup
  • os-siproxd
  • os-wireguard

@mimugmail
Copy link
Member

Interfaces : Diagnostics : Netstat, go for Ports or Protocol and search for 443

@chris42
Copy link
Author

chris42 commented Feb 21, 2023

sockstat -l | grep :443
will give (ipv6 removed)

root     lighttpd   29089 6  tcp4   127.0.0.1:443         *:*
root     lighttpd   29089 7  tcp6   ::1:443               *:*
root     lighttpd   29089 8  tcp4   192.168.2.1:443       *:*
root     lighttpd   29089 9  tcp6   <xxx>:443 *:*

However this probably only makes sense, after lighttpd actually can't restart?

@fichtner
Copy link
Member

Does acme client also handle the web GUI certificate? Or if there is still a process it probably hangs doing "something' although it seems new to me. Selecting specific interfaces also doesn't help in general, but is probably not causing it.

Cheers,
Franco

@AdSchellevis
Copy link
Member

It seems to use lighttpd

https://github.com/opnsense/plugins/blob/df1b28335d17f4fd2b4060b3b8df731bd2c567ca/security/acme-client/src/etc/rc.d/acme_http_challenge#L27

Best check the proceslist first

ps fax | grep lighttpd

Best regards,

Ad

@kulikov-a
Copy link
Member

..but acme lighty instance not using standard port by default and not using SSL
could it be just gui reload on 'newwanip' and lighty just did not terminated gracefully?
then there should be some errors in the Web GUI log before the mentioned general log error

@chris42
Copy link
Author

chris42 commented Feb 22, 2023

I am not sure if it is acme. I checked the logs. The lighttpd error from above occours at 5am daily and 7am on Mondays.

  • 5am is my pppoe daily reset for DSL
  • 7am is my weekly firmware update check

acme on the other hand is at midnight daily.

Around that lighttpd error at 5am there is multiple other errors, but hard to figure what is relevant.

siproxd (expected as network is reconnecting):
2023-02-20T05:00:01+01:00 opnsense.<xxx>.de siproxd 92666 - [meta sequenceId="1"] sock.c:445 ERROR:sendto() [<xxx>:5060 size=1082] call failed: Network is unreachable

prefixes.sh:
2023-02-20T05:00:12+01:00 opnsense.<xxx>.de opnsense 24088 - [meta sequenceId="44"] /usr/local/etc/rc.newwanipv6: The command '/usr/sbin/daemon -f -p '/var/run/dhcpleases6.pid' '/usr/local/opnsense/scripts/dhcp/prefixes.sh'' returned exit code '3', the output was 'daemon: process already running, pid: 17697'

radvd:
2023-02-20T05:00:12+01:00 opnsense.<xxx>.de opnsense 75150 - [meta sequenceId="45"] /usr/local/etc/rc.newwanipv6: The command '/usr/local/sbin/radvd -p /var/run/radvd.pid -C /var/etc/radvd.conf -m syslog' returned exit code '255', the output was ''

ntpd:
2023-02-20T05:00:30+01:00 opnsense.<xxx>.de opnsense 75150 - [meta sequenceId="151"] /usr/local/etc/rc.newwanipv6: The command '/usr/local/sbin/ntpd -g -c '/var/etc/ntpd.conf'' returned exit code '1', the output was ''

mount:
2023-02-20T05:00:31+01:00 opnsense.<xxx>.de opnsense 75150 - [meta sequenceId="156"] /usr/local/etc/rc.newwanipv6: The command '/sbin/mount -t devfs devfs '/var/unbound/dev'' returned exit code '1', the output was 'mount: devfs: Device busy'
2023-02-20T05:00:31+01:00 opnsense.<xxx>.de opnsense 40336 - [meta sequenceId="157"] /usr/local/etc/rc.newwanipv6: The command '/sbin/mount -r -t nullfs '/usr/local/lib/python3.9' '/var/unbound/usr/local/lib/python3.9'' returned exit code '1', the output was 'mount_nullfs: /var/unbound/usr/local/lib/python3.9: Device busy'

then there is lighttpd:
2023-02-20T05:00:36+01:00 opnsense.<xxx>.de opnsense 40336 - [meta sequenceId="174"] /usr/local/etc/rc.newwanipv6: The command '/usr/local/sbin/lighttpd -f /var/etc/lighty-webConfigurator.conf' returned exit code '255', the output was '2023-02-20 05:00:36: (network.c.537) can't bind to socket: 127.0.0.1:443: Address already in use'

at 7am, there is only the lighttpd error.

I can also provide a full log, if that helps you.

@kulikov-a
Copy link
Member

@chris42 sorry, just to clarify, so you can reproduce this issue on every "Periodic interface reset" cron job?

@chris42
Copy link
Author

chris42 commented Feb 22, 2023

Yes and no. I can reproduce the lighttpd error in every "Periodic interface reset" cron job and the "Automatic firmware update" one. However it does not lead to the "503 - service unavailable" error right away. It seems that it takes a while until lighttpd is broken.

@kulikov-a
Copy link
Member

kulikov-a commented Feb 23, 2023

@chris42 thanks! one more question: is Web Server Log enabled? (System: Settings: Logging)
would be interesting to see the lighty logs.
imho can't bind to socket error itself cannot cause an 503 response. it feels like something is killing fastcgi while lighty is running .. I haven’t figured out how this is possible yet, but I managed to reproduce it just once using the "Automatic firmware update" job: instead of "server stopped" and "server started" messages, lighty throws (gw_backend.c.369) child signalled: 1 then tries to reestablish fastcgi connection ((gw_backend.c.274) establishing connection failed: socket: unix:/tmp/php-fastcgi.socket-0: Connection refused) and finaly gives 503 responses in gui

@chris42
Copy link
Author

chris42 commented Feb 23, 2023

D'OH sorry for not thinking of lighttpd logs as well.

So here is what I have in there:
At 5am (pppoe reset), I normally get "server stopped", "server started". However on Monday this week I get:

2023-02-20T05:00:36+01:00 opnsense.<redacted> lighttpd 58826 - [meta sequenceId="1"] (gw_backend.c.369) child signalled: 1
2023-02-20T05:00:36+01:00 opnsense.<redacted> lighttpd 58826 - [meta sequenceId="2"] (gw_backend.c.369) child signalled: 1

Then coinciding with the last time I saw the 503 error I have the following on Tuesday in the log:

2023-02-21T18:08:29+01:00 opnsense.<redacted> lighttpd 58826 - [meta sequenceId="1"] (gw_backend.c.274) establishing connection failed: socket: unix:/tmp/php-fastcgi.socket-1: Connection refused
2023-02-21T18:08:29+01:00 opnsense.<redacted> lighttpd 58826 - [meta sequenceId="2"] (gw_backend.c.274) establishing connection failed: socket: unix:/tmp/php-fastcgi.socket-0: Connection refused
2023-02-21T18:08:29+01:00 opnsense.<redacted> lighttpd 58826 - [meta sequenceId="3"] (gw_backend.c.960) all handlers for /index.php? on .php are down.
2023-02-21T18:08:31+01:00 opnsense.<redacted> lighttpd 58826 - [meta sequenceId="4"] (gw_backend.c.351) gw-server re-enabled: unix:/tmp/php-fastcgi.socket-1  0 /tmp/php-fastcgi.socket
2023-02-21T18:08:31+01:00 opnsense.<redacted> lighttpd 58826 - [meta sequenceId="5"] (gw_backend.c.351) gw-server re-enabled: unix:/tmp/php-fastcgi.socket-0  0 /tmp/php-fastcgi.socket
2023-02-21T18:29:23+01:00 opnsense.<redacted> lighttpd 58826 - [meta sequenceId="1"] (gw_backend.c.274) establishing connection failed: socket: unix:/tmp/php-fastcgi.socket-1: Connection refused
2023-02-21T18:29:23+01:00 opnsense.<redacted> lighttpd 58826 - [meta sequenceId="2"] (gw_backend.c.274) establishing connection failed: socket: unix:/tmp/php-fastcgi.socket-0: Connection refused
2023-02-21T18:29:23+01:00 opnsense.<redacted> lighttpd 58826 - [meta sequenceId="3"] (gw_backend.c.960) all handlers for /index.php? on .php are down.
2023-02-21T18:29:26+01:00 opnsense.<redacted> lighttpd 58826 - [meta sequenceId="4"] (gw_backend.c.351) gw-server re-enabled: unix:/tmp/php-fastcgi.socket-1  0 /tmp/php-fastcgi.socket
2023-02-21T18:29:26+01:00 opnsense.<redacted> lighttpd 58826 - [meta sequenceId="5"] (gw_backend.c.351) gw-server re-enabled: unix:/tmp/php-fastcgi.socket-0  0 /tmp/php-fastcgi.socket
2023-02-21T18:32:37+01:00 opnsense.<redacted> lighttpd 70749 - [meta sequenceId="1"] (server.c.2078) server stopped by UID = 0 PID = 12116
2023-02-21T18:32:37+01:00 opnsense.<redacted> lighttpd 58826 - [meta sequenceId="2"] (server.c.2078) server stopped by UID = 0 PID = 12116
2023-02-21T18:32:37+01:00 opnsense.<redacted> lighttpd 58826 - [meta sequenceId="3"] (server.c.441) unlink failed for: /var/run/lighty-webConfigurator.pid: No such file or directory
2023-02-21T18:33:18+01:00 opnsense.<redacted> lighttpd 48897 - [meta sequenceId="4"] (server.c.1704) server started (lighttpd/1.4.68)
2023-02-21T18:33:18+01:00 opnsense.<redacted> lighttpd 48897 - [meta sequenceId="5"] (gw_backend.c.510) connect /tmp/php-fastcgi.socket-0: Connection refused
2023-02-21T18:33:18+01:00 opnsense.<redacted> lighttpd 48897 - [meta sequenceId="6"] (gw_backend.c.510) connect /tmp/php-fastcgi.socket-1: Connection refused
2023-02-21T18:33:27+01:00 opnsense.<redacted> lighttpd 44790 - [meta sequenceId="7"] (server.c.1704) server started (lighttpd/1.4.68)
2023-02-21T18:33:27+01:00 opnsense.<redacted> lighttpd 44790 - [meta sequenceId="8"] (mod_deflate.c.733) DEPRECATED: compress.filetype replaced with deflate.mimetypes
2023-02-21T18:33:27+01:00 opnsense.<redacted> lighttpd 44790 - [meta sequenceId="9"] (mod_deflate.c.796) DEPRECATED: compress.cache-dir replaced with deflate.cache-dir
2023-02-21T18:33:27+01:00 opnsense.<redacted> lighttpd 44790 - [meta sequenceId="10"] (configfile.c.1289) WARNING: unknown config-key: server.dir-listing (ignored)
2023-02-21T18:33:44+01:00 opnsense.<redacted> lighttpd 48897 - [meta sequenceId="11"] (server.c.2078) server stopped by UID = 0 PID = 30177
2023-02-21T18:33:47+01:00 opnsense.<redacted> lighttpd 57316 - [meta sequenceId="12"] (server.c.1704) server started (lighttpd/1.4.68)
2023-02-21T18:33:47+01:00 opnsense.<redacted> lighttpd 57316 - [meta sequenceId="13"] (gw_backend.c.369) child signalled: 1
2023-02-21T18:33:47+01:00 opnsense.<redacted> lighttpd 57316 - [meta sequenceId="14"] (gw_backend.c.510) connect /tmp/php-fastcgi.socket-1: Connection refused
2023-02-21T18:33:47+01:00 opnsense.<redacted> lighttpd 57316 - [meta sequenceId="15"] (gw_backend.c.369) child signalled: 1
2023-02-21T18:33:47+01:00 opnsense.<redacted> lighttpd 57316 - [meta sequenceId="16"] (gw_backend.c.510) connect /tmp/php-fastcgi.socket-0: Connection refused
2023-02-21T18:33:48+01:00 opnsense.<redacted> lighttpd 57316 - [meta sequenceId="17"] (server.c.2078) server stopped by UID = 0 PID = 77623
2023-02-21T18:33:48+01:00 opnsense.<redacted> lighttpd 57316 - [meta sequenceId="18"] (server.c.441) unlink failed for: /var/run/lighty-webConfigurator.pid: No such file or directory
2023-02-21T18:33:48+01:00 opnsense.<redacted> lighttpd 9787 - [meta sequenceId="19"] (server.c.1704) server started (lighttpd/1.4.68)
2023-02-21T18:34:03+01:00 opnsense.<redacted> lighttpd 9787 - [meta sequenceId="20"] (server.c.2078) server stopped by UID = 0 PID = 41182
2023-02-21T18:34:04+01:00 opnsense.<redacted> lighttpd 51570 - [meta sequenceId="21"] (server.c.1704) server started (lighttpd/1.4.68)
2023-02-21T18:34:18+01:00 opnsense.<redacted> lighttpd 51570 - [meta sequenceId="22"] (server.c.2078) server stopped by UID = 0 PID = 88442
2023-02-21T18:34:18+01:00 opnsense.<redacted> lighttpd 29089 - [meta sequenceId="23"] (server.c.1704) server started (lighttpd/1.4.68)

FYI: The 18:08 should be me trying to access opnsense. Following a "killall lighttpd" from the console and then restarted all services to fix the 503 error. The two log pieces are everything that is there for 20.-21.02.

@chris42
Copy link
Author

chris42 commented Feb 24, 2023

Ok, this is getting weirder by the minute.
Today I get the following in the logs:

2023-02-24T05:00:40+01:00 opnsense.<redacted> lighttpd 28235 - [meta sequenceId="1"] (server.c.2078) server stopped by UID = 0 PID = 83947
2023-02-24T05:00:41+01:00 opnsense.<redacted> lighttpd 42478 - [meta sequenceId="2"] (server.c.1704) server started (lighttpd/1.4.68)
2023-02-24T05:01:05+01:00 opnsense.<redacted> lighttpd 42478 - [meta sequenceId="3"] (server.c.2078) server stopped by UID = 0 PID = 76194
2023-02-24T05:01:06+01:00 opnsense.<redacted> lighttpd 87916 - [meta sequenceId="4"] (server.c.1704) server started (lighttpd/1.4.68)
2023-02-24T05:01:06+01:00 opnsense.<redacted> lighttpd 87916 - [meta sequenceId="5"] (gw_backend.c.369) child signalled: 1
2023-02-24T05:01:06+01:00 opnsense.<redacted> lighttpd 87916 - [meta sequenceId="6"] (gw_backend.c.369) child signalled: 1
2023-02-24T05:01:06+01:00 opnsense.<redacted> lighttpd 87916 - [meta sequenceId="7"] (gw_backend.c.510) connect /tmp/php-fastcgi.socket-1: Connection refused

Being on the WebGUI, it shows me there, that the WebGUI is not running?

@kulikov-a
Copy link
Member

sorry, i dont have a suitable stand handy for "Periodic interface reset" test with pppoe interface. Whether correctly I understand that this job will trigger newwanip (and so GUI reload) twice?
in this case, I can assume that two webgui restarts can occur with such a small delay that the webgui_configure_do will overlap and at the moment when lighty restarts the second time, the following happens: the script kills the php-cgi

killbyname('php-cgi', 'HUP');

of the lighty that just started and cannot start a new instance (can't bind to socket)
(I managed to reproduce this (not on a first try) by calling the configctl webgui restart almost simultaneously from two parallel ssh sessions)

@kulikov-a
Copy link
Member

@chris42
can you try with
opnsense-patch -a kulikov-a d2c26a3
please? (kulikov-a@d2c26a3)
slightly changes the actions order, removes one action and adds debugging info to the general log. perhaps this will help in cases of restarting the gui with a short interval

@chris42
Copy link
Author

chris42 commented Feb 25, 2023

Thanks a lot! Did a reboot before applying, so any silly state is hopefully cleared up. Applied the patch (I guess reboot won't be necessary to activate)
We probably will see tomorrow or on Monday how it works.

@gstrauss
Copy link

When the problem occurs, who owns the sockets in /tmp and who has them open? lsof, fuser, filel might help. Do the socket files exist in /tmp? Is lighttpd.conf configured to create those sockets? Or is PHP-FPM used? (PHP-FPM is recommended to manage PHP services.)

If lighttpd.conf is configured with "bin-path" to create the socket and run the backend, and if something else deletes the socket in /tmp, then lighttpd is managing a backend PHP process which is alive (has not exited) and listening on a socket that has been unlinked from the filesystem. New connections from lighttpd to the socket filesystem path will fail. If the backend PHP process exits, lighttpd will re-create the socket and restart the backend, but as long as the backend started by lighttpd is alive, that does not happen.

If something else is auto-cleaning /tmp, I recommend moving the sockets to a different, protected path, such as /run/lighttpd/...

@chris42
Copy link
Author

chris42 commented Feb 27, 2023

So far webGUI is still alive and kicking. lighttpd logs are clean with only server stopping/starting messages.

Within system logs I found the following on yesterday's reconnect:

2023-02-26T05:00:16+01:00 opnsense.<redacted> opnsense 65690 - [meta sequenceId="52"] /usr/local/etc/rc.newwanip: webgui_id_63fad950a227e: start webgui_configure_do
2023-02-26T05:00:38+01:00 opnsense.<redacted> opnsense 96761 - [meta sequenceId="136"] /usr/local/etc/rc.newwanipv6: webgui_id_63fad96634fba: start webgui_configure_do
2023-02-26T05:00:39+01:00 opnsense.<redacted> opnsense 24122 - [meta sequenceId="146"] /usr/local/etc/rc.newwanipv6: webgui_id_63fad967269eb: start webgui_configure_do
2023-02-26T05:00:44+01:00 opnsense.<redacted> opnsense 96761 - [meta sequenceId="160"] /usr/local/etc/rc.newwanipv6: webgui_id_63fad96c315ea: start webgui_configure_do
2023-02-26T05:00:44+01:00 opnsense.<redacted> opnsense 96761 - [meta sequenceId="161"] /usr/local/etc/rc.newwanipv6: webgui_id_63fad96c315ea: starting generate_config
2023-02-26T05:00:44+01:00 opnsense.<redacted> opnsense 96761 - [meta sequenceId="162"] /usr/local/etc/rc.newwanipv6: webgui_id_63fad96c315ea: reload gui templates
2023-02-26T05:00:44+01:00 opnsense.<redacted> opnsense 96761 - [meta sequenceId="163"] /usr/local/etc/rc.newwanipv6: webgui_id_63fad96c315ea: php-cgi HUP skipped. realy need it? (lighty spawns FastCGI) (besides, it only kills one instance (and may kill captive portal?)?
2023-02-26T05:00:44+01:00 opnsense.<redacted> opnsense 96761 - [meta sequenceId="164"] /usr/local/etc/rc.newwanipv6: webgui_id_63fad96c315ea: config ready. TERM lighty
2023-02-26T05:00:44+01:00 opnsense.<redacted> opnsense 96761 - [meta sequenceId="165"] /usr/local/etc/rc.newwanipv6: webgui_id_63fad96c315ea: start lighty (lighty will remove pid file of working instance if start fail because of 'can't bind'. thats not funny)

On today's reconnect (5am) and firmware check (7am) I get the following:

2023-02-27T05:00:20+01:00 opnsense.<redacted> opnsense 625 - [meta sequenceId="178"] /usr/local/etc/rc.newwanip: webgui_id_63fc2ad400c28: start webgui_configure_do
2023-02-27T05:00:37+01:00 opnsense.<redacted> opnsense 27197 - [meta sequenceId="209"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2ae55aa63: start webgui_configure_do
2023-02-27T05:00:37+01:00 opnsense.<redacted> opnsense 6073 - [meta sequenceId="219"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2ae5bfe1c: start webgui_configure_do
2023-02-27T05:00:50+01:00 opnsense.<redacted> opnsense 27197 - [meta sequenceId="240"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2af2439f4: start webgui_configure_do
2023-02-27T05:00:50+01:00 opnsense.<redacted> opnsense 27197 - [meta sequenceId="242"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2af2439f4: starting generate_config
2023-02-27T05:00:50+01:00 opnsense.<redacted> opnsense 27197 - [meta sequenceId="243"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2af2439f4: reload gui templates
2023-02-27T05:00:50+01:00 opnsense.<redacted> opnsense 6073 - [meta sequenceId="245"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2af25c4a9: start webgui_configure_do
2023-02-27T05:00:50+01:00 opnsense.<redacted> opnsense 6073 - [meta sequenceId="246"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2af25c4a9: starting generate_config
2023-02-27T05:00:50+01:00 opnsense.<redacted> opnsense 6073 - [meta sequenceId="247"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2af25c4a9: reload gui templates
2023-02-27T05:00:50+01:00 opnsense.<redacted> opnsense 6073 - [meta sequenceId="248"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2af25c4a9: php-cgi HUP skipped. realy need it? (lighty spawns FastCGI) (besides, it only kills one instance (and may kill captive portal?)?
2023-02-27T05:00:50+01:00 opnsense.<redacted> opnsense 6073 - [meta sequenceId="249"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2af25c4a9: config ready. TERM lighty
2023-02-27T05:00:51+01:00 opnsense.<redacted> opnsense 6073 - [meta sequenceId="250"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2af25c4a9: start lighty (lighty will remove pid file of working instance if start fail because of 'can't bind'. thats not funny)
2023-02-27T05:00:51+01:00 opnsense.<redacted> opnsense 27197 - [meta sequenceId="252"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2af2439f4: php-cgi HUP skipped. realy need it? (lighty spawns FastCGI) (besides, it only kills one instance (and may kill captive portal?)?
2023-02-27T05:00:51+01:00 opnsense.<redacted> opnsense 27197 - [meta sequenceId="253"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2af2439f4: config ready. TERM lighty
2023-02-27T05:00:51+01:00 opnsense.<redacted> opnsense 27197 - [meta sequenceId="260"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2af2439f4: start lighty (lighty will remove pid file of working instance if start fail because of 'can't bind'. thats not funny)
2023-02-27T05:00:55+01:00 opnsense.<redacted> opnsense 55596 - [meta sequenceId="274"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2af7a123b: start webgui_configure_do
2023-02-27T05:01:09+01:00 opnsense.<redacted> opnsense 55596 - [meta sequenceId="288"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b05b1080: start webgui_configure_do
2023-02-27T05:01:09+01:00 opnsense.<redacted> opnsense 55596 - [meta sequenceId="289"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b05b1080: starting generate_config
2023-02-27T05:01:09+01:00 opnsense.<redacted> opnsense 55596 - [meta sequenceId="290"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b05b1080: reload gui templates
2023-02-27T05:01:10+01:00 opnsense.<redacted> opnsense 55596 - [meta sequenceId="291"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b05b1080: php-cgi HUP skipped. realy need it? (lighty spawns FastCGI) (besides, it only kills one instance (and may kill captive portal?)?
2023-02-27T05:01:10+01:00 opnsense.<redacted> opnsense 55596 - [meta sequenceId="292"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b05b1080: config ready. TERM lighty
2023-02-27T05:01:10+01:00 opnsense.<redacted> opnsense 55596 - [meta sequenceId="293"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b05b1080: start lighty (lighty will remove pid file of working instance if start fail because of 'can't bind'. thats not funny)
2023-02-27T05:01:19+01:00 opnsense.<redacted> opnsense 6073 - [meta sequenceId="322"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b0fcdc3b: start webgui_configure_do
2023-02-27T05:01:19+01:00 opnsense.<redacted> opnsense 27197 - [meta sequenceId="325"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b0fe1e93: start webgui_configure_do
2023-02-27T05:01:19+01:00 opnsense.<redacted> opnsense 73445 - [meta sequenceId="326"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b0fe2abc: start webgui_configure_do
2023-02-27T05:01:24+01:00 opnsense.<redacted> opnsense 73445 - [meta sequenceId="340"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b14e553b: start webgui_configure_do
2023-02-27T05:01:24+01:00 opnsense.<redacted> opnsense 73445 - [meta sequenceId="341"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b14e553b: starting generate_config
2023-02-27T05:01:24+01:00 opnsense.<redacted> opnsense 73445 - [meta sequenceId="342"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b14e553b: reload gui templates
2023-02-27T05:01:25+01:00 opnsense.<redacted> opnsense 73445 - [meta sequenceId="343"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b14e553b: php-cgi HUP skipped. realy need it? (lighty spawns FastCGI) (besides, it only kills one instance (and may kill captive portal?)?
2023-02-27T05:01:25+01:00 opnsense.<redacted> opnsense 73445 - [meta sequenceId="344"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b14e553b: config ready. TERM lighty
2023-02-27T05:01:25+01:00 opnsense.<redacted> opnsense 73445 - [meta sequenceId="345"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b14e553b: start lighty (lighty will remove pid file of working instance if start fail because of 'can't bind'. thats not funny)
2023-02-27T05:01:38+01:00 opnsense.<redacted> opnsense 73445 - [meta sequenceId="359"] /usr/local/etc/rc.newwanipv6: webgui_id_63fc2b2241aec: start webgui_configure_do
2023-02-27T07:06:15+01:00 opnsense.<redacted> opnsense 44368 - [meta sequenceId="1"] /usr/local/etc/rc.restart_webgui: webgui_id_63fc4857e4d35: start webgui_configure_do
2023-02-27T07:06:15+01:00 opnsense.<redacted> opnsense 44368 - [meta sequenceId="2"] /usr/local/etc/rc.restart_webgui: webgui_id_63fc4857e4d35: starting generate_config
2023-02-27T07:06:15+01:00 opnsense.<redacted> opnsense 44368 - [meta sequenceId="3"] /usr/local/etc/rc.restart_webgui: webgui_id_63fc4857e4d35: reload gui templates
2023-02-27T07:06:16+01:00 opnsense.<redacted> opnsense 44368 - [meta sequenceId="4"] /usr/local/etc/rc.restart_webgui: webgui_id_63fc4857e4d35: php-cgi HUP skipped. realy need it? (lighty spawns FastCGI) (besides, it only kills one instance (and may kill captive portal?)?
2023-02-27T07:06:16+01:00 opnsense.<redacted> opnsense 44368 - [meta sequenceId="5"] /usr/local/etc/rc.restart_webgui: webgui_id_63fc4857e4d35: config ready. TERM lighty
2023-02-27T07:06:16+01:00 opnsense.<redacted> opnsense 44368 - [meta sequenceId="6"] /usr/local/etc/rc.restart_webgui: webgui_id_63fc4857e4d35: start lighty (lighty will remove pid file of working instance if start fail because of 'can't bind'. thats not funny)

Does that help?

@kulikov-a
Copy link
Member

kulikov-a commented Feb 27, 2023

@chris42 yes, thanks! may help (resembles a some 'race condition')
Only there is a feeling that the data is sorted by the Message column. I will try to change the sorting to 'by time and sequence ID' and then it will become clearer (i hope) <-sorry about that - I don't know what it was.

I will comment on the log a bit later

@gstrauss
Copy link

config ready. TERM lighty
start lighty (lighty will remove pid file of working instance if start fail because of 'can't bind'. thats not funny)

I'm a little bit confused. There should be no "working instance" if lighttpd is being started. If there is, then whatever privileged process tried to start lighttpd failed to first check if lighttpd was already running, or failed to shut lighttpd down and wait for lighttpd to exit, and for the lighttpd instance shutting down to remove its pid file.

After changing lighttpd configuration, it is recommended that the config be tested before lighttpd is restarted:
lighttpd -f /etc/lighttpd/lighttpd.conf -tt
No output and exit value of 0 is success.
Then, instead of sending lighttpd SIGTERM, sending SIGUSR1 will result in a graceful restart of lighttpd, reloading the config.

However, if the lighttpd executable has changed (e.g. package upgrade), then a stop and start of lighttpd is needed. SIGINT will stop lighttpd gracefully, handling existing connections for a few more seconds (configurable). SIGTERM will stop lighttpd very quickly, without continuing to handle connections in progress.

Now, I don't know the specifics of the webgui behavior. If a SIGTERM is sent to lighttpd, does the webgui do it, or does the webgui ask the init script to restart lighttpd? The init script should be used. If the webgui is sending the signal, there is a possibility that someone is attempting to start lighttpd before the older process has exited, and so the failure to bind might occur.

I don't have a machine in front of me, so this is all theoretical scenarios, depending on what the webgui is actually doing.

@fichtner
Copy link
Member

Let's try 3610606 with SIGINT then. It could be a race, but the start/stop is sequential so it's unlikely an issue with the general flow.

# opnsense-patch 3610606

Cheers,
Franco

@gstrauss
Copy link

Looking in 3610606, there is definitely a race after sending SIGTERM (now SIGINT) and exec'ing lighttpd because there is nothing that checks that lighttpd has exited before trying to start lighttpd back up.

While changing the signal to SIGINT is more friendly to existing web requests in progress, lighttpd will by default wait for 8 seconds for those in-progress connections to finish before lighttpd closes those connections. The time to wait is configurable, but just about any value will result in the lighttpd bind failure at startup since the older lighttpd will still be running (unless there are no open connections and the older lighttpd manages to kill the backend PHP, wait for it to exit, and then lighttpd exit before webgui.inc tries to start up a new instance of lighttpd).

Is there a service which restarts lighttpd if lighttpd exits? Can the webgui ask that service to restart lighttpd?

If not, can the webgui wait until the lighttpd pid file is removed or kill -0 $pid on the lighttpd pid is invalid?

@fichtner
Copy link
Member

The third parameter is true which means it monitors PID file to wait for exit.

@gstrauss
Copy link

FYI: these are now the defaults in lighttpd since lighttpd 1.4.68, so this can be omitted.

            /* harden TLS for PCI conformance */
            $lighty_config .= <<<EOD
ssl.openssl.ssl-conf-cmd = (
    "MinProtocol" => "TLSv1.2",
    "Options" => "-ServerPreference",
    "CipherString" => "EECDH+AESGCM:AES256+EECDH:CHACHA20:!SHA1:!SHA256:!SHA384"
)

EOD;

and I question the correctness of $lighty_config .= " \"MinProtocol\" => \"TLSv1\""; on line 404 in the else block below that. TLSv1 is deprecated and its use strongly discouraged on any modern system. Omit specifying "MinProtocol" to get the default TLS MinProtocol in lighttpd, which is currently "TLSv1.2".

@gstrauss
Copy link

As @fichtner notes, killbypid() waits for the process to exit. Given that, there should not be a race condition where lighttpd is still running when that continues executing webgui.inc. (And yet, here we are discussing in this issue why lighttpd might fail to start with bind address already in use.)

There is a different race condition in src/etc/inc/util.inc in killbypid()

    while (isvalidpid($pidfile)) {
        usleep(200 * 1000);
    }

That loops and every 200ms reads the pidfile. In the meantime, something else might delete the pidfile and something else might recreate the pidfile, and the loop would be none-the-wiser. The pidfile should be read and saved. Then the loop can operate on the numeric pid value. The pidfile should be read into memory before sending lighttpd SIGINT, and the signal should be sent to the pid, not to the pkill -F if $waitforit is set.

@fichtner
Copy link
Member

Yep, let me fix that tomorrow. Not sure if related as this was the case for years, but we will see. Thanks!

@fichtner fichtner self-assigned this Feb 27, 2023
@fichtner fichtner removed their assignment Mar 2, 2023
@fichtner fichtner removed the cleanup Low impact changes label Mar 2, 2023
@fichtner fichtner removed this from the 23.7 milestone Mar 2, 2023
@fichtner fichtner added the support Community support label Mar 2, 2023
@fichtner
Copy link
Member

fichtner commented Mar 2, 2023

I disagree to some degree and pull myself out this issue now because it’s going nowhere.

@kulikov-a
Copy link
Member

@gstrauss i just think that if (correct me if im wrong) pid-file deleted before graceful-shutdown-timeout wait this increases the likelihood of thread mixing since we use killbypid
@fichtner sorry to bother ) . just trying to understand: if a lock() is used in filter_configure_sync() and works perfect in kulikov-a@b802b41, why are you against using a lock for concurrency management in this case?

@Melantrix
Copy link

I have the same issue happening. I don't understand half of this thread but what i understand is that two processes are fighting over some stuff which is causing this.

However, how can i fix this now? Is it just waiting on the next update, or is there a patch i can (try to) install?

@AdSchellevis
Copy link
Member

@Melantrix It's highly likely solved if you keep the OPNsense defaults, System->Settings->Administration/"Listen Interfaces" : "All (Recommended)"

@Melantrix
Copy link

@Melantrix It's highly likely solved if you keep the OPNsense defaults, System->Settings->Administration/"Listen Interfaces" : "All (Recommended)"

Okay but I don't want to listen on all interfaces. I have disabled my untrusted networks I'm that list because, well, they are untrusted. I understand I can block the traffic with firewall rules which i have. But why listen on untrusted interfaces when they are untrusted?

Is there anything else to fix this?

@AdSchellevis
Copy link
Member

Bind to loopback with a port forward? (and make sure you don't lock yourself out while doing so) Most issues are related to people using some sort of non static interface to bind to (which theoretically can't work). Offering a selection was likely one of the worst choices we made in the past years (after a user pushed for it and we warned from the beginning this is risky when surroundings change frequently).....

@Melantrix
Copy link

@AdSchellevis Thanks for responding this quicly!

a non static interface would be?

I have them only enabled on my trusted vlans that are on my LAN interface but disabled on e.g. my WAN interface or Guest network. i don't understand what the problem of this could be too be honest.

What causes this issue exactly? If i understand the thread correctly, there are two processes fichting over port 443 on an interface to listen on? How is that connected to which interface the gui listens too? As far as i know there is no other web gui installed on the firewall that should also listen on a port?

(i'm just trying to understand the issue, so i can do a proper risk assesment on which path to take to fix this).

Maybe i'm just thinking about this in the wrong way..?

@gstrauss
Copy link

To the OPNsense members subscribed: having systems-level functionality managed by PHP written by non-systems-developers is not a good look for OPNsense.

I contributed code with a working solution above. If time jumping backwards were a concern, my proposed solution is not wrong, but could be improved to handle that case, which is what @kulikov-a added. However, there is nobody on the OPNsense team who has posted here committing either to reviewing those proposed solutions, or suggesting other next steps.

@AdSchellevis
Copy link
Member

@Melantrix

a non static interface would be?

yes indeed, that's the most common cause of issues.

I have them only enabled on my trusted vlans that are on my LAN interface but disabled on e.g. my WAN interface or Guest network. i don't understand what the problem of this could be too be honest.

In most cases that should work, but.I don't know the exact circumstances.

What causes this issue exactly? If i understand the thread correctly, there are two processes fichting over port 443 on an interface to listen on? How is that connected to which interface the gui listens too? As far as i know there is no other web gui installed on the firewall that should also listen on a port?

If I understand the ticket correctly (but I have not been involved), there's a locking issue with the webserver, for which @fichtner proposed a simple solution 226c133, there where other ventures in the same thread, but as they add a lot of complexity (and add risk), at the moment their not in the OPNsense codebase.

Quite some if this is old (yet cleaned-up) code, which is vital for the system from a user perspective, we're careful what to merge there. In our experience adding a lot of complexity has the risk of quite some backlash, which might be worse than the original issue that doesn't seem to affect many users.

If you're only using static interfaces, I would certainly try if 226c133 improves the situation, other improvements might take longer to integrate. If it does, just ping @fichtner so he has another datapoint.

@maxkratz
Copy link

Will this be fixed via c95c444 in one of the future (minor) releases?

@fichtner
Copy link
Member

They will be in 23.1.4, but they don’t fix 500 error. Honestly, just disable listening interface selection and avoid this one.

fichtner added a commit that referenced this issue Mar 22, 2023
Looks like improvements in 23.1.4 made the problem worse, indicating
that the direction this is going is a bit doubtful.  Instead, funnel
the restart through configd to reach some state of serialization
similar to what filter_configure() is doing.

While here move the service definition to the correct file.
@fichtner fichtner self-assigned this Mar 22, 2023
@fichtner fichtner added cleanup Low impact changes and removed support Community support labels Mar 22, 2023
@fichtner fichtner added this to the 23.7 milestone Mar 22, 2023
@fichtner
Copy link
Member

New patch on top of 23.1.4 is 33ad504

# opnsense-patch 33ad50456a1

If it works I'd like to close this ticket finally.

Cheers,
Franco

@chris42
Copy link
Author

chris42 commented Mar 22, 2023

So I would roll back kulikov-a patch from the beginning and add the one you mentioned?
What do I look out for to tell if this is working? no more 500? special log entries?

@fichtner
Copy link
Member

The idea is simply no more 500, but it may be that configd in detach mode will not serialise correctly. It's a one line fix then.

If you have other patches move to 23.1.4 and run:

# opnsense-revert opnsense && opnsense-patch 33ad50456a1

@chris42
Copy link
Author

chris42 commented Mar 22, 2023

Ok... trap is set... now lets wait.

fichtner added a commit that referenced this issue Mar 23, 2023
Same as filter_configure() now.
@chris42
Copy link
Author

chris42 commented Mar 30, 2023

So far the WebGUI seems to work. Errors haven't showed up. Only having the following in the system log at pppoe reconnect:

2023-03-30T05:00:51+02:00 opnsense.<redacted> opnsense 59790 - [meta sequenceId="259"] /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : webgui_configure_do(,lan))
2023-03-30T05:00:52+02:00 opnsense.<redacted> opnsense 86374 - [meta sequenceId="273"] /usr/local/etc/rc.restart_webgui: The command '/usr/local/bin/flock -ne /var/run/lighty-webConfigurator.pid /usr/local/sbin/lighttpd -f /var/etc/lighty-webConfigurator.conf' returned exit code '1', the output was ''
2023-03-30T05:00:52+02:00 opnsense.<redacted> opnsense 86374 - [meta sequenceId="274"] /usr/local/etc/rc.restart_webgui: The command '/bin/kill -'TERM' '9643'' returned exit code '1', the output was 'kill: 9643: No such process'
2023-03-30T05:00:59+02:00 opnsense.<redacted> opnsense 82595 - [meta sequenceId="278"] /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : webgui_configure_do(,opt6))

@fichtner
Copy link
Member

Looks promising. I added 019ea52 on top ready for 23.1.6 then.

@fichtner
Copy link
Member

23.1.6 was released today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup Low impact changes
Development

No branches or pull requests

8 participants