Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatibility with newer kernels #41

Closed
deathtrip opened this issue Jun 20, 2020 · 16 comments
Closed

Incompatibility with newer kernels #41

deathtrip opened this issue Jun 20, 2020 · 16 comments

Comments

@deathtrip
Copy link

deathtrip commented Jun 20, 2020

Newer kernel versions have broken opensnitch for me.

I used Arch Linux with the hardened kernel, and everything was fine until version 5.6.16 iirc. Since that version every attempt by any program to access the network causes the entire system to freeze, even things like ping, or even starting chromium. But the vanilla 5.6 kernel still worked fine.

When i updated the vanilla kernel to 5.7.2, all network requests were blocked (ping,dns etc.).
Disabling the opensnitchd service solved both the crashing on the hardened kernel, and restored network access on vanilla.
Since 5.7.4, even stopping the opensnitchd service won't restore network access, and i have to reboot to get it back.

I wonder if we have some people on new kernels, who can check it out.

@gustavo-iniguez-goya
Copy link
Owner

Can you tell us more about your system? I need at least the following logs and information:

First of all, set debug level to DEBUG.

  • /var/log/opensnitchd.log*
  • opensnitch settings (I'm interested in the process monitor method you're using, /proc, ftrace or audit)
  • hardened kernel information (grsecurity, pax, LSM, ..) https://wiki.archlinux.org/index.php/Security#Kernel_hardening ?
  • restrictions apply via sysctl (/etc/sysctl*)
  • libnetfilter_queue and libnfnetlink versions.
  • journalctl -ar > journalctl.txt
  • /var/log/syslog
  • /var/log/messages
  • kernel panic oops if you can gather it (in dmesg maybe)
  • did you notice any pattern that leads to the crash?

Please, provide all this information or the most you can gather. You can email me it if you prefer rather than post it here. Lets see if we can find the problem, or a way to reproduce it.

@DragoonAethis
Copy link

I'm running OpenSnitch on Arch w/ 5.7.4-1-ck-skylake kernel and it's working fine (nothing suspicious in logs, process monitor is /proc).

@deathtrip
Copy link
Author

By hardened kernel i mean linux-hardened package from the repo.
libnfnetlink 1.0.1
libnetfilter_queue 1.0.5
Process monitor method is /proc
I tried disabling sysctl settings but they had no effect, also couldn't find anything suspicious using journalctl.

And after further research it looks like DNS requests are causing the crashes/hangups.
I run unbound as my local resolver and i noticed that there were no more requests from the user "unbound" in the UI.
Ping reported Temporary failure in name resolution
Disabling unbound didn't solve the problem.
When i booted with opensnitchd disabled, and enabled it afterwards, the already established connections persisted.

I got it working again by removing the following kernel command line options:
slab_nomerge slub_debug=FZP vsyscall=none module.sig_enforce=1 page_alloc.shuffle=1

Any DNS request with these options and running opensnitchd will cause the mentioned symptoms.
Here's the opensnitchd.log shows when it couldn't resolve anything:
First it was this:
�[2m[2020-06-21 00:04:46]�[0m �[97m�[104m IMP �[0m Starting opensnitch-daemon v1.0.0rc10 �[2m[2020-06-21 00:04:46]�[0m �[97m�[42m INF �[0m Loading rules from /etc/opensnitchd/rules ... �[2m[2020-06-21 00:04:46]�[0m �[97m�[43m WAR �[0m Is opnensitchd already running? �[2m[2020-06-21 00:04:46]�[0m �[97m�[41m�[1m !!! �[0m Error while creating queue #0: Error opening Queue handle: protocol not supported
and then:
�[2m[2020-06-21 07:20:59]�[0m �[97m�[104m IMP �[0m Starting opensnitch-daemon v1.0.0rc10 �[2m[2020-06-21 07:20:59]�[0m �[97m�[42m INF �[0m Loading rules from /etc/opensnitchd/rules ... �[2m[2020-06-21 07:20:59]�[0m �[97m�[43m WAR �[0m Is opnensitchd already running? �[2m[2020-06-21 07:20:59]�[0m �[97m�[41m�[1m !!! �[0m Error while creating queue #0: Error binding to queue: operation not permitted

So it's one (or more) of the kernel command line options. Didn't have yet time to find out which one is it.
The linux-hardened package also enforces PTI, so that may have something to do with the system freezing under it, as i don't use it under the regular kernel. Let's see if anyone can reproduce in now.

@gustavo-iniguez-goya
Copy link
Owner

thank you very much for the information @deathtrip !

You're using libnetfilter_queue 1.0.5, and there have been a lot of changes from 1.0.3 to 1.0.5, I'm wondering if the problem also reproduces using libnetfilter_queue 1.0.3.

A few days ago someone on the original repo also reported a kernel panic using kernel 5.6.16, with this backtrace:

? nfqnl_reinject+0x4a/0x70 [nfnetlink_queue]
 nfqnl_reinject+0x4a/0x70 [nfnetlink_queue]
 nfqnl_recv_verdict+0x30d/0x500 [nfnetlink_queue]
 nfnetlink_rcv_msg+0x166/0x2e0 [nfnetlink]
 ? nfnetlink_net_exit_batch+0x60/0x60 [nfnetlink]
 netlink_rcv_skb+0x75/0x140
 netlink_unicast+0x242/0x340
 netlink_sendmsg+0x243/0x480
 sock_sendmsg+0x5e/0x60
 ____sys_sendmsg+0x253/0x290
 ___sys_sendmsg+0x97/0xe0
 ? __lru_cache_add+0x75/0xa0
 __sys_sendmsg+0x81/0xd0
 do_syscall_64+0x49/0x90

If you could find the backtrace of your kernel panic we could compare both.

Also If you have the time to identify the problematic kernel command line option don't doubt in update the issue. It's a very valuable information.

Either way, I'm afraid I'm not much of a help here. In my opinion this problem could be a bug in libnetfilter_queue or newer kernels (>= 5.6.16). Probably we're triggering the bug somehow.

@deathtrip
Copy link
Author

On Arch Linux libnetfilter_queue was updated to 1.0.5 just a few days ago, while kernel 5.6.16 and newer have been around for a few weeks. So it started happening under libnetfilter_queue 1.0.3, but only after upgrading the kernels. Seems to be a problem with the kernel then.

@deathtrip
Copy link
Author

The problematic kernel command-line option seems to be slub_debug=FZP.
When i booted with all my options except slab_nomerge, slub_debug=FZP and page_alloc.shuffle=1, all worked fine.
When i added page_alloc.shuffle=1 i got problems after 5-6 hours.
Then i also added slab_nomerge, and got the DNS problems after approx. 1 hour.
Booting with all three, results in no DNS access for me.

This time also the UI started freezing, so i restarted it from the terminal, but got no errors when it started freezing again.
When i get these DNS problems i can't restart the daemon, as it fails with: systemd[1]: opensnitchd.service: Main process exited, code=exited, status=1/FAILURE

@gustavo-iniguez-goya
Copy link
Owner

Thank you @deathtrip ! I'll try to configure them and debug those DNS and daemon problems.

@gustavo-iniguez-goya
Copy link
Owner

On Debian, kernel 5.7.0, this cmdline parameters "kaslr pti=on slab_nomerge page_poison=1 slub_debug=FPZ nosmt" causes opensnitch to fail with the following error:
Error while creating queue #0: Error binding to queue: operation not permitted

removing slub_debug=FPZ from the options solves the problem. I'm still trying to figure out how to make it working again with that parameter.

@deathtrip
Copy link
Author

Update to the current point release to see if it fixes anything. Also you should check if only one or two of the FZP options is responsible.
Wondering if you could reproduce the system freeze i had on Arch's hardened kernel, because it seems there's something else that could be the problem here.

@gustavo-iniguez-goya
Copy link
Owner

not the freeze, but a BUG, like the one reported with kernel 5.6.16. There're some bug reports related to this parameter athk5 kmod, ext4, ibm mmfs16

I started to analyze it with valgrind, and it seems to be there some mem leaks. In any case, this parameter is also preventing opensnitch from running with the Operation not permitted error so I'll investigate that first.

@gustavo-iniguez-goya
Copy link
Owner

The error Error binding to queue: operation not permitted seems to be caused by a queue not closed, or leaked. If you launch the daemon with -queue-num 2 then it'll run as expected.

This only occurs when stopping the daemon with service opensnitch stop. If you stop it by sending it a HUP signal, or by hitting CTRL+c it doesn't occur. Investigating..

@gustavo-iniguez-goya
Copy link
Owner

ok, so we're not closing the queue on exit.. and for some reason this problem has arised with kernels >= 5.7.x. I'll fix it soon.

@gustavo-iniguez-goya
Copy link
Owner

Still investigating this problem. Fortunately for me, the bug it's not freezing the PC. The daemon stops processing packets and a trace is dumped to dmesg.

@gustavo-iniguez-goya
Copy link
Owner

I've filed a bug on the netfilter bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1440

I think this is a problem in their library when allowing packets (ICMP in particular I think).

@gustavo-iniguez-goya
Copy link
Owner

Pablo Neira posted a patch for this problem, and as far as I can tell it fixes the bug: https://bugzilla.netfilter.org/show_bug.cgi?id=1440#c1

evilsocket#297 safing/portmaster#82

@gustavo-iniguez-goya
Copy link
Owner

Some users have already confirmed that it's fixed by updating ArchLinux kernel. Thank you for reporting it!

gustavo-iniguez-goya added a commit that referenced this issue Jul 16, 2020
When the daemon is stopped, we need to close opened netfilter recurses.
Otherwise we can fall into a situation where we leave NFQUEUE queues
opened, which causes opensnitch to not run anymore until system restart
or a manual intervention, because there's a NFQUEUE queue already created
with the same ID.

This is what was happening as a collateral effect of #41.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants