-
-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FTL crashed (main_dnsmasq+0x1238) #770
Comments
Can you provide the corresponding lines from We have debugging information prepared here and it would be very helpful if you can get the data for us: https://docs.pi-hole.net/ftldns/debugging/ |
I checked that log first, but didn't see anything of interest (although maybe all these queries to the one domain is a symptom?). Here's the logs just before it crashes. May 18 08:35:48 dnsmasq[23890]: query[AAAA] hel-efz.office.com from 192.168.0.21 |
This issue has been mentioned on Pi-hole Userspace. There might be relevant details there: |
Is there a workaround for this? It is extremely tedious! I have added a cron script ( |
Sorry for the inconvenience. The workaround can only come when someone affected by this sends me debugging details as described here: https://docs.pi-hole.net/ftldns/debugging/ Whenever I have them, I can work on a fix that truly solves the issue. However, until then I cannot do anything as I'm still unable to reproduce this myself. Given we have only 5 users reporting this when we have (taking from Githubs statistics) there are > 40,000 successful Pi-hole updates, this really seems to be an edge-case. Obviously, we will try our best to get this fixed as soon as possible. We just need your help for it. |
Hey man, I'm happy to do this if I can, but does that mean I need to have an SSH terminal running on another computer until this issue happens? It can be more than 24hrs between crashes, that might be difficult to do. |
No, it runs unattended and you can come back when it crashed. |
Gotcha, I didn't quite understand the stuff about screen, but now I do. I'm debugging now. I did get one error when starting gdb: [New LWP 726] I assume that is normal? |
Yes |
This issue has been mentioned on Pi-hole Userspace. There might be relevant details there: |
For those who cannot attach the debugger (which would still give more information about the crash source): Please run
This should (hopefully) generate logs with extended information for us to check. |
Is this to do with /etc/resolv.conf pointing to 127.0.0.1? I just commented the line out, did |
My log looks like this - 15:38:15 is when it started working again. It's like dnsmasq wasn't running in that time. During that time, I restarted Pihole using the admin interface,
Tried this:
|
Ok I've done some more testing, and I've disabled DNSSEC - since then I haven't noticed any problems. I will leave it today, and if I still have no issues, turn it back on and post back. |
I just posted on Reddit that I have this issue and set my cron to restart DNS every 20 minutes. The TS was kind of enough to link this issue. Going to attach the debugger, turn off the cronjob and let's see what it gives us. It crashes quite often so I hope it won't be too long before I can come back with some results. |
I've been running gdb on pihole for the past 4 days, and it hasn't crashed. Normally it would crash at least once every 2 days. Not sure if gdb is stopping it from hitting an exception or if it's just (un)luckily been running fine for a few days longer than usual. |
Make the debugger a requirement 😝 I am now also using PiBar and it tells me that Pi-Hole is disconnected once the DNS stops, so that is an easy giveaway to know it has stopped. Debugger is running, going to keep an eye on it. |
The debugger cannot be the solution. It attaches only from the side and monitors what is going on, it does not influence the process it attached to in any way. It is not even possible to decide from within the application if there is a debugger currently attached to the system. |
This issue has been mentioned on Pi-hole Userspace. There might be relevant details there: |
Any update for me? |
I have not had any crash for about a week... why, I don't know. Once every 2 days, or more, before then. Considering the two times it happened where I had checked the log files, it was the same domain that was logged before each crash, could it have been some kind of issue with that domain on the block lists (which has now been resolved)? Does the gdb debugging give the same output as addr2line branch? I might try switching to that to see if I can replicate. |
Similarly, the crashes in the other open issue tickets seem to have disappeared rather quickly. At this point we can only speculate. The only common thing I've seen so far is that many/(all?) use Cloudflare as upstream DNS. There is a chance that they did something violating the DNS specifications and, hence,
Looking at your backtrace, I don't think the blocklists are involved here, but we don't know for sure.
|
No update from me either, ever since turning on the debugging, it hasn't stopped while it was a few times a day before that. It is still running in screen and just letting it run since I have no clue when it may happen again. |
I'll keep the debugger running for another week. If it still doesn't crash, I'll consider it 'fixed'. |
@roland-d You're to fast, builds are still generating :) |
We have another crash
|
So I have another crash. The log output for the few minutes before the crash is quite a bit . How far back do you want me to go? It is about 2000 lines and about 170k Here is the crash part of it:
|
@skinnayt I will probably need a lot, maybe even from the beginning of your last update/checkout. Can you maybe compress the file and send it via email to Could you (also the others in here like @roland-d), maybe run the following lines?
This will launch |
@DL6ER It crashed again (never too long a wait ;))
Here also the stderr.log, the stdout.log is empty |
@roland-d Sorry, the issue was likely that you started a new |
@DL6ER No worries. I killed FTL, restarted it and then run the setcap and valgrind. It is running again now, let's see what it gives us :) |
Thanks a lot for your continued assistance! If it still does not work right (like
where |
@DL6ER Haven't had a crash yet but wanted to let you know that even with the new command the |
Submitted my log files for the previous crash I had. I am now running under valgrind right now and I am getting output in both the stdout.log and stderr.log files. |
I now have a crash while running under valgrind
This is the size of the uncompressed files. The stdout.log file is truncated so I don't know if I have the right information to make this useful. The stderr.log file looks to be okay.
|
@DL6ER Looks like I missed the dash in the email domain. Resent the first set now. |
@skinnayt Thanks, I received your two mails and checked the code. There is a certain picture building up right now: Office is querying one and the same office.com subdomain multiple times at the same time over IPv4, IPv6 and UDP and TCP. There is likely a race-collision now where both entries are added to the same cache entry at the same time and funny cache issues arise. Du you have a valgrind + log combination that comes from a crash? @roland-d Given what I wrote above I don't think you'll ever see FTL crash because it does not spawn TCP children in debug mode. While this comes at only a rather small impact in performance, it unfortunately breaks compatibility with Netflix and a few other services as users have reported in the past. |
@DL6ER I sent another email with the pihole logs for the period that the valgrind session captured. |
@skinnayt Thanks, I'm still evaluating the output, unfortunately (or well, actually this is a good thing!), I haven't found any invalid memory access reports in there. @skinnayt @roland-d
a try? We changed quite a bit since v5.0. I'm not convinced that it will resolve the issue for you, however, we will only know for sure when you tried it. I have been very busy with traveling and they severely affected my debugging abilities over the past two weeks, however, I reserved some time next week where I will try again to reproduce the bug locally with all the tools attached. And if this takes that I go and buy a (hopefully not too expensive) laptop with Office on it, then I'll likely even do that. I feel responsible for the code I wrote and promise do my best to finally resolve this issue for everyone! |
@DL6ER Thank you for your continued support, I definitely appreciate it. I will checkout the development branch and keep you update of what is going on. Safe travels. |
@DL6ER I checked out that development branch. For some reason, it doesn't seem to stay running under valgrind. It is late here so I will try again in the morning. Seems to stay running under systemd though. I did do the |
Gave this another try under valgrind and it seems to stay running now. Not sure what I did wrong or different before |
@skinnayt As long as it runs uninterrupted in "normal" operation, everything will be fine. So far, the |
@DL6ER I wanted to say that since I switched to the development branch 5 days ago, I haven't had anymore crashes. |
For anyone still being affected by this bug (last issue report seems to be > 5 days ago?): We have a potential fix for a rather severe Please run
and see if the crashes are resolved. As always, I very much appreciate testing as it is the only way to be sure we really got it fixed! |
Had a crash on the
|
I'd file a new issue. This looks like a db code bug, which is unrelated to what I fixed. |
Since updating to v5, there are a few times when it seems all my clients lose DHCP... they end up with a random IP address (within the DHCP range), and the DNS server pointing to some other random IP address within the range (i.e. not the Pihole).
I check the Pihole web interface and it says DNS and FTL are not running. A reboot of the pihole fixes it. It happened today again, and I checked the FTL logs and found this:
[2020-05-18 07:21:45.125 13079] Note: FTL forked to handle TCP requests
[2020-05-18 07:22:21.587 23890] Resizing "/FTL-dns-cache" from 335872 to 339968
[2020-05-18 07:25:48.744 13111] Note: FTL forked to handle TCP requests
[2020-05-18 07:59:00.307 23890] Notice: Database size is 187.80 MB, deleted 899 rows
[2020-05-18 08:15:48.690 13806] Note: FTL forked to handle TCP requests
[2020-05-18 08:35:49.086 23890] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2020-05-18 08:35:49.086 23890] ----------------------------> FTL crashed! <----------------------------
[2020-05-18 08:35:49.087 23890] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2020-05-18 08:35:49.087 23890] Please report a bug at https://github.com/pi-hole/FTL/issues
[2020-05-18 08:35:49.087 23890] and include in your report already the following details:
[2020-05-18 08:35:49.087 23890] FTL has been running for 249138 seconds
[2020-05-18 08:35:49.087 23890] FTL branch: master
[2020-05-18 08:35:49.088 23890] FTL version: v5.0
[2020-05-18 08:35:49.088 23890] FTL commit: 3d7c095
[2020-05-18 08:35:49.088 23890] FTL date: 2020-05-10 18:58:38 +0100
[2020-05-18 08:35:49.088 23890] FTL user: started as pihole, ended as pihole
[2020-05-18 08:35:49.089 23890] Compiled for arm (compiled on CI) using arm-linux-gnueabihf-gcc (crosstool-NG crosstool-ng-1.22.0-88-g8460611) 4.9.3
[2020-05-18 08:35:49.089 23890] Received signal: Segmentation fault
[2020-05-18 08:35:49.089 23890] at address: 0x7a66652d
[2020-05-18 08:35:49.089 23890] with code: SEGV_MAPERR (Address not mapped to object)
[2020-05-18 08:35:49.094 23890] Backtrace:
[2020-05-18 08:35:49.096 23890] B[0000]: 0x4e0704, /usr/bin/pihole-FTL(+0x2c704) [0x4e0704]
[2020-05-18 08:35:49.096 23890] B[0001]: 0xb6db0130, /lib/arm-linux-gnueabihf/libc.so.6(__default_rt_sa_restorer+0) [0xb6db0130]
[2020-05-18 08:35:49.096 23890] B[0002]: 0x516b2c, /usr/bin/pihole-FTL(+0x62b2c) [0x516b2c]
[2020-05-18 08:35:49.096 23890] B[0003]: 0x518cfc, /usr/bin/pihole-FTL(main_dnsmasq+0x1238) [0x518cfc]
[2020-05-18 08:35:49.097 23890] B[0004]: 0x4d2d04, /usr/bin/pihole-FTL(main+0xfc) [0x4d2d04]
[2020-05-18 08:35:49.097 23890] B[0005]: 0xb6d9a718, /lib/arm-linux-gnueabihf/libc.so.6(__libc_start_main+0x10c) [0xb6d9a718]
[2020-05-18 08:35:49.097 23890] ------ Listing content of directory /dev/shm ------
[2020-05-18 08:35:49.097 23890] File Mode User:Group Filesize Filename
[2020-05-18 08:35:49.099 23890] rwxrwxrwx root:root 260 .
[2020-05-18 08:35:49.099 23890] rwxr-xr-x root:root 4K ..
[2020-05-18 08:35:49.100 23890] rw------- pihole:pihole 4K FTL-per-client-regex
[2020-05-18 08:35:49.101 23890] rw------- pihole:pihole 340K FTL-dns-cache
[2020-05-18 08:35:49.101 23890] rw------- pihole:pihole 29K FTL-overTime
[2020-05-18 08:35:49.102 23890] rw------- pihole:pihole 6M FTL-queries
[2020-05-18 08:35:49.102 23890] rw------- pihole:pihole 20K FTL-upstreams
[2020-05-18 08:35:49.103 23890] rw------- pihole:pihole 643K FTL-clients
[2020-05-18 08:35:49.104 23890] rw------- pihole:pihole 262K FTL-domains
[2020-05-18 08:35:49.104 23890] rw------- pihole:pihole 340K FTL-strings
[2020-05-18 08:35:49.105 23890] rw------- pihole:pihole 12 FTL-settings
[2020-05-18 08:35:49.105 23890] rw------- pihole:pihole 124 FTL-counters
[2020-05-18 08:35:49.106 23890] rw------- pihole:pihole 28 FTL-lock
[2020-05-18 08:35:49.106 23890] ---------------------------------------------------
[2020-05-18 08:35:49.107 23890] Thank you for helping us to improve our FTL engine!
[2020-05-18 08:35:49.107 23890] FTL terminated!
[2020-05-18 11:29:44.126 563] Using log file /var/log/pihole-FTL.log
[2020-05-18 11:29:44.138 563] ########## FTL started! ##########
(included a few lines before crash, and after reboot). This happened twice a few days ago (in the one day) and has happened maybe 5 times in total.
I hope this is of some help.
I am running Raspbian on a Pi Zero W. I am not familiar with the code base. I am familiar with C# development, but have never debugged on Linux.
The text was updated successfully, but these errors were encountered: