-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible regression of Unbound since 18.7.6 #2894
Comments
|
It looks like an issue with Unbound 1.8. 1.7 was reported to work fine. Our integration didn't change in this regard. You could verify this theory by reverting to 1.7: Cheers, |
|
Thanks, with 1.7 I didnt face such issue. I will try to revert back to that older unbound version. After rollback should I take care of anything regarding this package is no longer in sync with what is nornally shipped with 18.7.6+ ? |
|
Reload Unbound manually to load the previous version. Code-wise both are compatible. On next OPNsense update, however, it will try to install 1.8 again so you can lock the package from System: Firmware: Packages to prevent such an update. It could be that the update will stall in this case. Unbound is intertwined in some other packages... but nothing bad happens as it just won't update in this case. This is just for confirmation. I do believe it needs to be reported to upstream developers that something is wrong with 1.8. |
|
Is there anything to check in the logs in case the next time it stops (was it a graceful stop by some opnsense service, or it died ungracefully by its own) ? Or this is more related to Unbound devs? Ps. The unbound log is located at /var/log/resolver.log in case someone wants to read it from CLI (took me some time to figure out this fact) |
|
It happened again a couple of minutes ago. I added some DHCP static mapping, applied the change. This seemed trigger the restart of Unbound as well. I could see in the log, that the last line was this same error, as I reported the last time: Nov 14 12:29:03 unbound: [58191:0] fatal error: Could not read config file: /unbound.conf. Maybe try unbound -dd, it stays on the commandline to see more errors, or unbound-checkconf Seems this fatal error blocked Unbound from successfully starting. Strange thing, if I press the service start from GUI, it starts without issues. |
|
Yep, its reproducible every single time I apply config. |
|
@fichtner IMHO i would expect this to be a timing issue during the generation of the unbound configuration. Since it happens from time to time, it should be expected to the DHCP hostname auto-registering in ubound which causes that. Maybe in 1.8. they changed the timings / blocking mechanism during the reload or they just "hard bind" to the inode of the configuration, while we "replace" the file with a new one during the generation of the configuration - and the new unbound version is more sensible to that. I am pretty sure this is not an upstream bug or at least if, it would be a "freebsd" related upstream bug and that just sounds rather unlikely. I am currently waiting for arm32/arm64 alpine to get fixed on docker.hub so i can package 1.8 so i could try to reproduce this on https://github.com/EugenMayer/docker-unbound .. but i have to wait. I could though try this with amd64 since i already packaged |
|
Guys, while you are investigating this issue: I checked System\Settings\Cron \ Add new job --> but I can only find 8-10 predefined jobs, none of them are custom service restart. Dont want to mess with anything non-GUI way unless someone can give me 2-3 lines of clear instruction. |
|
Forgot to say: I was afraid of rolling back to unbound 1.7, fearing what may break when the next opnsense upgrade tries to reinstall it / tries to skip that out-of-sync package. So yes, I am still on this buggy 1.8.1, and if you need any log to capture the stop-reason, let me know. |
|
the command to restart unbound is "/usr/local/sbin/pluginctl dns", just add it to /etc/crontab (the next firmware update will overwrite crontab, but if you're looking for a solution until unbound is updated that's probably it) |
|
Rather use monit to monitor and restart the service if port 53 is dead? Monit is part of the plugins, so this should be easy and far better then the cron based restart, since it instantly self heals you can also just test-resolve with monit but be aware, exit code of nslookup / dig are nonsense, use grep to make sense of it |
|
@EugenMayer : thanks for the hint, monit should be the real solution, and would prefer that definitely. Unfortunately I dont have the spare time to become expert of the nuances of monit, meanwhile the next time unbound stops, angry users would throw the firewall to the trash. So would like something very basic and very reliable to restart that damn unbound. |
|
Google is your friend :) Dont make your life harder :) |
|
Small update: it seems it is triggered on Saturday, around 10:36 am. That is the 3rd Saturday morning when it happens, always around this 10:36 am time. |
|
Just as I expected: happened today again, at exactly 1 week later. //the last line before the current issue is from dec1, when the previous disconnect-reconnect event happened, so betwen the 2 events there is exactly 1 week silence // During this exact time, unbound is stopped, and fails to restart correctly, with the same error message posted in my initial post. |
|
You probably set explicit Unbound Interfaces? |
|
You are correct, I mentioned recently to protect DNS by not even listening on WAN. But after you said it may silently break things, I reverted that config back, so it listens on ALL since a couple of weeks. |
|
Hmmm, maybe your reconnect comes back with the same IP? it might be that it doesn't know that it must rerun configuration when the IP is back |
|
Nope, it gets a completely different IP after the succesful reconnect Dec 1 10:36:54 FW01 ppp: [wan] IPCP: LayerUp |
|
Ok, I'm testing Unbound 1.8.2 just now, but here's another patch that will help recover from a crashed unbound in your particular case (listen "all"): 3d8fd00 |
|
Thanks, I applied the patch (seems succesfully). Did a forced pppoe disconnect -> unbound stopped and could not start with the same error message. |
|
TBH, if it says "Could not read config file: /unbound.conf" it really mixes up the chroot for no apparent reason. The file is always there but it may be missing that "/var/unbound" is actually "./" but we changed nothing in this regard as far as I'm aware. |
|
I see what you mean: this patch is a workaround, the real issue is why does unbound search for the config file at the wrong location? |
|
core/src/etc/inc/plugins.inc.d/unbound.inc Line 464 in 3d8fd00
That's the only spot we ever start Unbound and as you can see it has a full path to the file. The chroot directive is inside unbound.conf, so how Unbound can say "./unbound.conf" is not found while adhering to the information inside the file is beyond me. :D core/src/etc/inc/plugins.inc.d/unbound.inc Line 294 in 3d8fd00
|
|
The only speciality I have is the following in the Custom options section: include: /usr/local/etc/unbound/unbound_advertisement_serverlist_2018.11.28.txt I did a quick test and removed these 2 lines, saved the change, applied it, then disconnect-reconnect ppoe -> seems unbound started correctly this time |
|
you should place this file in /var/unbound/ with chown unbound:unbound to be sure... |
|
The only problem my /var is on tmpfs, so the next reboot wipes it. Yet another syshook script? |
|
symlink might work. it's your setup :) |
|
Nope, symlink didnt work. custom options: ls /var/unbound ls /usr/local/etc/unbound/ Still the same error, and after 1 min, the automated retry starts it successfully. If I copied the file directly under /var/unbound, it works though. So because of /var is a tmpfs, I need to find a way to copy the file from permanent storage to /var right after tmpfs is mounted, but before unbound is being started (otherwise unbound fails with missing include file, and stops). So only way to do this, is syshook script, am I right? |
|
The right time is whenever you update the file you push it to /var/unbound and then run "pluginctl dns". I'm sorry, there's nothing else we can try here in an open source scope. The discussed workaround goes into 18.7.10. |
|
@fichtner: can you tell me, at which stage of the boot process /syshook script run do you generate the unbound.conf file that gets copied to /var/unbound? I would need to hack into that step to add the extra file copy from persistent filesystem to /var/unbound tmpfs, and the unbound service will find that copied file when it is getting started. At least thats my understanding. |
|
use "start" syshook, /var/unbound will be populated and running copy the file run "pluginctl dns" to restart unbound. |
|
Cool, I will try to implement this and hope it will work :) |
|
it should. if not let me know :) |
|
I have to wait couple of days (most probably until the weekend) for this. When is the 18.7.10 planned, so I could merge the 2 action? |
|
Preliminary target for 18.7.10 is Tuesday (Jan 8) |
|
Thanks, will wait for it then, before going for custom scripting. |
Hello devs,
since upgraded to 18.7.6 that introduced unbound 1.8.1, I face "frequent" DNS resolver service shutdowns. Checking the Unbound log I dont see any reason why the service suddenly stopped itself, rendering all name resolution stop.
All I could find in the unbound log is this line:
Nov 10 10:35:53 | unbound: [67269:0] fatal error: Could not read config file: /unbound.conf. Maybe try unbound -dd, it stays on the commandline to see more errors, or unbound-checkconf
When this is observed, and I manually try to start the service again, it starts successfully, and seems to be working for a couple of days, then stops suddently again.
I tried set the logging level2 to get more logs, but in this case everye single name resolution attempt is alao logged, and it will be difficult to catch the event that has anything to do with the stop/crash.
Where is the unbound log stored in the underlying system, so I could use some clever grep to extra t meaningful entries?
The text was updated successfully, but these errors were encountered: