Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash with latest v4.14.73 in netif_skb_features #285

Closed
matttbe opened this issue Oct 3, 2018 · 15 comments

Comments

Projects
None yet
3 participants
@matttbe
Copy link
Member

commented Oct 3, 2018

Hi @cpaasch

I saw that you updated v0.94 branch recently but it caused crashes on my side:

net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_wmem = 4096 87380 4194304
[    7.329803] HTB: quantum of class 10012 is big. Consider r2q change.
[    7.346317] HTB: quantum of class 10012 is big. Consider r2q change.
[    7.688074] htb: netem qdisc 8002: is non-work-conserving?
[    7.698082] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
[    7.698696] IP: netif_skb_features+0x1f/0x230
[    7.699031] PGD 800000000bac9067 P4D 800000000bac9067 PUD 1914e067 PMD 0 
[    7.699581] Oops: 0000 [#1] SMP PTI
[    7.699842] Modules linked in:
[    7.700088] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.73-mptcp+ #4
[    7.700625] Hardware name: Red Hat KVM, BIOS 1.11.0-2.el7 04/01/2014
[    7.701107] task: ffffffff81c104c0 task.stack: ffffffff81c00000
[    7.701587] RIP: 0010:netif_skb_features+0x1f/0x230
[    7.701947] RSP: 0018:ffff88001fc03e68 EFLAGS: 00010286
[    7.702367] RAX: ffff88000c4816c0 RBX: ffff88000c40ca00 RCX: ffff88000c8c2c00
[    7.702922] RDX: ffff88000c481000 RSI: 0000000000000000 RDI: ffff88000c40ca00
[    7.703506] RBP: ffff88000c8c2c00 R08: ffff88000b14509c R09: 0000000000000001
[    7.704048] R10: 00000000de38e38e R11: 0000000000000003 R12: ffff88000ae36000
[    7.704600] R13: ffff88000ae36000 R14: ffff88000b14509c R15: ffff88000b145000
[    7.705123] FS:  0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
[    7.705720] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    7.706141] CR2: 00000000000000d0 CR3: 000000000c9f6000 CR4: 00000000000006b0
[    7.706667] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    7.707190] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    7.707715] Call Trace:
[    7.707901]  <IRQ>
[    7.708057]  validate_xmit_skb+0x13/0x260
[    7.708387]  validate_xmit_skb_list+0x39/0x60
[    7.708713]  sch_direct_xmit+0xb0/0x170
[    7.708997]  __qdisc_run+0x11c/0x270
[    7.709291]  net_tx_action+0xd6/0xf0
[    7.709563]  __do_softirq+0xc3/0x1c8
[    7.709830]  irq_exit+0x65/0x70
[    7.710065]  smp_apic_timer_interrupt+0x5d/0x90
[    7.710399]  apic_timer_interrupt+0x7d/0x90
[    7.710713]  </IRQ>
[    7.710873] RIP: 0010:native_safe_halt+0x2/0x10
[    7.711207] RSP: 0018:ffffffff81c03ec8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[    7.711762] RAX: ffffffff814d4330 RBX: ffffffff81c104c0 RCX: 0000000000000000
[    7.712311] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    7.712835] RBP: ffffffff81c104c0 R08: 000000008705669d R09: ffff88001fc1dcd0
[    7.713355] R10: 0000000000000002 R11: 0000000000000001 R12: ffffffff81c104c0
[    7.713878] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    7.714398]  ? __sched_text_end+0x2/0x2
[    7.714687]  default_idle+0x5/0x10
[    7.714941]  do_idle+0x14f/0x180
[    7.715182]  cpu_startup_entry+0x14/0x20
[    7.715474]  start_kernel+0x4f7/0x502
[    7.715748]  ? set_init_arg+0x50/0x50
[    7.716021]  secondary_startup_64+0xa5/0xb0
[    7.716334] Code: ff ff 48 98 e9 64 ff ff ff 0f 1f 00 41 54 55 53 48 89 fb 48 83 ec 08 8b 87 e0 00 00 00 48 8b 97 e8 00 00 00 48 8b 77 10 48 01 d0 <48> 8b ae d0 00 00 00 66 83 78 04 00 74 61 0f b7 78 06 48 8b 8e 
[    7.717714] RIP: netif_skb_features+0x1f/0x230 RSP: ffff88001fc03e68
[    7.718178] CR2: 00000000000000d0
[    7.718424] ---[ end trace 9ee4da1efb289b2f ]---
[    7.718769] Kernel panic - not syncing: Fatal exception in interrupt
[    7.719288] Kernel Offset: disabled

I didn't start the investigation yet, it is more to know if you also had this kind of crash

@cpaasch

This comment has been minimized.

Copy link
Member

commented Oct 10, 2018

rb-tree processing has been backported to v4.14.73 :-/ Seems like we will need to support that then here as well... We will have to backport a bunch of patches that we fixed in mptcp_trunk.

@cpaasch

This comment has been minimized.

Copy link
Member

commented Oct 11, 2018

Well, actually - The TCP-part of the rb-tree (sender-side retransmit queue) has not been backported. So, it seems to be ok.

I think we need to dig a bit deeper into the issue here. Can you reproduce it? I don't see it panicking .

@webratz

This comment has been minimized.

Copy link

commented Oct 16, 2018

We also seem to hit this from time to time, but I sadly also can't reproduce it.
4.14.72-68.55.amzn1.x86_64

@matttbe

This comment has been minimized.

Copy link
Member Author

commented Oct 16, 2018

I talked to @cpaasch last week, I forgot to share this info here. I was able to reproduce it with apachebenchmark, with files of 1K/50K, e.g.

ab -v 1 -c 5 -n 500 http://<IP>/1KB
ab -c 20 -n 1000 -s 600 http://<IP>/1KB
ab -v 1 -c 5 -n 500 http://<IP>/50KB
etc.
@cpaasch

This comment has been minimized.

Copy link
Member

commented Oct 16, 2018

I am hitting this issue as well now. But I'm not yet sure what introduced it. We should try bisecting things.

@matttbe

This comment has been minimized.

Copy link
Member Author

commented Oct 17, 2018

Could it be linked to 6b92153 or 37c7cc8 ?

@cpaasch

This comment has been minimized.

Copy link
Member

commented Oct 17, 2018

Very likely! :)

@cpaasch

This comment has been minimized.

Copy link
Member

commented Oct 17, 2018

I can reliably reproduce the panic on a non-MPTCP upstream v4.14.71 and v4.14.73 kernel.

@cpaasch

This comment has been minimized.

Copy link
Member

commented Oct 18, 2018

I am able to fix one of these with:

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 2a2ab6bfe5d8..3d325b840802 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -624,6 +624,10 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 			skb->next = NULL;
 			skb->prev = NULL;
 			skb->tstamp = netem_skb_cb(skb)->tstamp_save;
+			/* skb->dev shares skb->rbnode area,
+			 * we need to restore its value.
+			 */
+			skb->dev = qdisc_dev(sch);
 
 #ifdef CONFIG_NET_CLS_ACT
 			/*

But, this code-path is only exercised when one configures a netem-rule. Are you all doing that? :-)

Otherwise, there is yet another bug.

@webratz

This comment has been minimized.

Copy link

commented Oct 18, 2018

Not using netem here

@matttbe

This comment has been minimized.

Copy link
Member Author

commented Oct 18, 2018

@cpaasch Good catch! For me I was indeed using Netem! I will launch some tests with this patch!

@cpaasch

This comment has been minimized.

Copy link
Member

commented Oct 18, 2018

@webratz : The patch I posted seems to fix the issues for me. Can you try out with the latest mptcp_v0.94-branch and the above patch applied? (You mentioned in the above comment that you are based on v4.14.72, which must be a custom merge that you have done I guess).

@matttbe

This comment has been minimized.

Copy link
Member Author

commented Oct 18, 2018

@cpaasch Thanks again for the patch, that seems to fix all my crashes!

I got one WARN with another test but it seems totally unrelated!

[   24.685234] Call Trace:
[   24.685428]  [<ffffffff8143ba4e>] dump_stack+0x19/0x1b
[   24.685819]  [<ffffffff8103f8d2>] __warn+0xc0/0xdb
[   24.686190]  [<ffffffff8103f995>] warn_slowpath_null+0x18/0x1a
[   24.686633]  [<ffffffff810b20d7>] __alloc_pages_nodemask+0x1c0/0x776
[   24.687149]  [<ffffffff81430f17>] ? p9pdu_writef+0x39/0x3b
[   24.687564]  [<ffffffff814310ba>] ? p9pdu_readf+0x1a1/0x9a9
[   24.687990]  [<ffffffff814310ba>] ? p9pdu_readf+0x1a1/0x9a9
[   24.688411]  [<ffffffff810e4caa>] ? __kmalloc+0xf2/0xfe
[   24.688806]  [<ffffffff8106821f>] ? update_curr+0xb3/0x119
[   24.689226]  [<ffffffff810b269f>] __get_free_pages+0x12/0x6f
[   24.689654]  [<ffffffff810e4bf2>] __kmalloc+0x3a/0xfe
[   24.690042]  [<ffffffff814316c9>] p9pdu_readf+0x7b0/0x9a9
[   24.690450]  [<ffffffff8142de87>] ? p9_parse_header+0x5a/0xa8
[   24.694910]  [<ffffffff8142e6ed>] ? p9_client_rpc+0x289/0x37d
[   24.700030]  [<ffffffff8142f9d0>] p9_client_walk+0xb3/0x162
[   24.702082]  [<ffffffff81156d43>] v9fs_fid_clone+0x22/0x24
[   24.705526]  [<ffffffff811555dd>] v9fs_file_open+0x98/0x1a7
[   24.708970]  [<ffffffff81155545>] ? v9fs_write_begin+0x90/0x90
[   24.709412]  [<ffffffff810f1acc>] do_dentry_open.isra.19+0x177/0x225
[   24.709957]  [<ffffffff810f1c1e>] vfs_open+0xa4/0xa9
[   24.710407]  [<ffffffff8110025e>] do_last.isra.62+0x862/0xa93
[   24.710861]  [<ffffffff811006cb>] path_openat+0x23c/0x582
[   24.711285]  [<ffffffff81100a46>] do_filp_open+0x35/0x7a
[   24.711699]  [<ffffffff8110b3c3>] ? __alloc_fd+0x145/0x155
[   24.712136]  [<ffffffff810f29c9>] do_sys_open+0x143/0x1d0
[   24.712556]  [<ffffffff81446bce>] ? system_call_after_swapgs+0x5b/0xf3
[   24.713128]  [<ffffffff810f2a6f>] SyS_open+0x19/0x1b
[   24.713524]  [<ffffffff81446c88>] system_call_fastpath+0x22/0x27
[   24.714013]  [<ffffffff81446bce>] ? system_call_after_swapgs+0x5b/0xf3
[   24.714522] ---[ end trace a0ddfa743f856ebe ]---

Do you plan to send this patch upstream? :-)

matttbe added a commit that referenced this issue Oct 18, 2018

sch_netem: restore skb->dev after dequeuing from the rbtree
Upstream commit bffa72c ("net: sk_buff rbnode reorg") got
backported as commit 6b92153 ("net: sk_buff rbnode reorg").

However, the backport does not include the changes in sch_netem.c

We need this change to restore skb->dev.

Github-issue: #285

Fixes: 6b92153 ("net: sk_buff rbnode reorg")
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
@cpaasch

This comment has been minimized.

Copy link
Member

commented Oct 18, 2018

@matttbe - Yes, I am submitting it upstream.

Indeed, the WARN you get looks totally unrelated to MPTCP IMO.

@cpaasch

This comment has been minimized.

Copy link
Member

commented Oct 18, 2018

Closing this here. @webratz if you still see the issue, please open a new issue.

@cpaasch cpaasch closed this Oct 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.