Hi, I have some questions regarding debugging of random reboot problem caused by kernel panic.
I'm getting random reboot when running a software inserting prerouting iptables rules in order to redirect traffic. My device is a x86_64 router. My openwrt release is compiled by myself from a forked openwrt source at https://github.com/coolsnowwolf/lede . Its kernel version is 4.19.108.
The software causing this problem is called OpenClash. It acts as a transparent proxy. It inserts prerouting rules to redirect all tcp traffic from computers in LAN to its own listening port and sends the traffic through a proxy.
Whenever this software is started, I get random reboots at 1-2 times/day. There was not any abnormal in saved log files because the crash happend in kernel and it caused reboot quickly. So I had to compile the NetConsole kernel module to capture the dmesg when crash happened. You can see the crash logs in crash_dmesg.txt.
The crash happens in the nf_xfrm_me_harder function. Decompiling the crash code, I get crash_code.png (The highlighted line is the crash instruction). The crash log mentions illegal memory access at 000000000000113c and the crash code shows that the kernel was accessing [rax+0x113c], so I think the problem is rax==0, which should not be happening.
After patched, function nf_xfrm_me_harder looks like
int nf_xfrm_me_harder(struct net *net, struct sk_buff *skb, unsigned int family)
struct flowi fl;
unsigned int hh_len;
struct dst_entry *dst;
struct sock *sk = skb->sk;
if (skb->dev && !dev_net(skb->dev)->xfrm.policy_count[XFRM_POLICY_OUT]) // <-------crash
err = xfrm_decode_session(skb, &fl, family);
if (err < 0)
This means dev_net(skb->dev) sometimes equals to `NULL` .
I'm not familiar with the network mechanism in linux kernel, so I'm not sure how I can find the reason of it being NULL. Is this problem something we can safely ignore by checking its validity like this?
if (skb->dev && dev_net(skb->dev) && !dev_net(skb->dev)->xfrm.policy_count[XFRM_POLICY_OUT])
If not, can anyone give me some advice on how I can debug this problem? I understand this may be difficult for you developers to figure out what's happening by merely reading my description, especially when I'm not using the trunk OpenWrt. So I would love to dig it by myself.
You can see other information of my router in dmesg.txt.
Thanks a lot.
The text was updated successfully, but these errors were encountered: