Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in kernel mode with 10 Gbps IXGBE driver on amd64 #136

Closed
cffs opened this issue Sep 18, 2014 · 4 comments
Closed

Crash in kernel mode with 10 Gbps IXGBE driver on amd64 #136

cffs opened this issue Sep 18, 2014 · 4 comments

Comments

@cffs
Copy link
Contributor

cffs commented Sep 18, 2014

Hello,

I tried to use Click in linuxmodule mode for various kernels between 2.6.32 and 3.16 with 10 Gbps IXGBE cards, and it always crashes with a page fault in ixgbe_xmit_frame_ring() after sending a few packets (only 144 in my last test with only 2 packets per second). Our server is an Intel-based amd64 platform, using Debian jessie (except for the 3.16 kernel try, which used sid).

The relevant part of the kernel panic stack trace is as follows:

panic
oops_end
no_context
__do_page_fault
KernelErrorHandler::buffer_store
KernelErrorHandler::log_line
click_lfree
KernelErrorHandler::emit
page_fault
ixgbe_xmit_frame_ring
ToDevice::queue_packet
ToDevice::run_task

Investigating ToDevice::queue_packet() with the help of added click_chatter() and early return, I think the problem occurs in the call to dev->netdev_ops->ndo_start_xmit(skb1, dev).

I can reproduce the problem even with the trimmed-down configuration that follows:

// Echoes back any packet arriving on eth2 without any processing
FromDevice(eth2, PROMISC 1) -> Queue() -> ToDevice(eth2)

For completeness, I had to take the following steps to compile kernel-mode Click (most of them suggested in the other issues).

  • Merge /usr/src/linux-headers-VERSION-amd64 and /usr/src/linux-headers-VERSION-common.
  • Copy /boot/config-VERSION and /boot/Sytem.map-VERSION to the source directory as .config and System.map, respectively.
  • Symlink /usr/src/linux-headers-VERSION-merged/include/generated/autoconf.h to /usr/src/linux-headers-VERSION-merged/include/linux/autoconf.h.
  • Add an #undef DEPRECATED in include/click/handler.hh (as it is defined in linux/printk.h, which is apparently included by handler.hh).
  • ./configure --disable-userlevel --enable-linux-module --with-linux=/usr/src/linux-headers-VERSION-merged (I tried first with other options like multithread but the problem also occurs with this simpler configuration).
  • Fix minor headers issues depending on kernel version.
@tbarbette
Copy link
Collaborator

I have more info about this problem (I work with @devmusings).

It works fine if I use the last ixgbe driver from intel at http://sourceforge.net/projects/e1000/files/ixgbe%20stable/3.22.3/ .
We run the standard debian kernel 3.16 and looking at dmesg the current driver is 3.19.1-k. A 3.19.1 variant. But the difference between the kernel ixgbe and the intel one is not clear...?
The kernel 3.18 version seems to also have that same driver version, so I didn't test it and should probably bug too...

The problem is always in ixgbe_xmit_frame_ring.

So to conclude, the problem is ixgbe-related, and is corrected in the last version from intel, but is still in the most recent kernel. Maybe a bug report should be reported as it could come from one of the modification in the kernel version of the driver, but it is quite hard to explain the problem as I couldn't really find the source of the problem. For now if someone has the same problem, just use intel's driver...

Also, it appears only with some packet generators. Using a loop configuration with Netmap, all goes fine, but using a Tilera to generate packets it fails. The generated packets are the same, generated with quite the same program...

Here is the kernel last messages recovered with kdump. The two first lines are the output of the last packet passing through Print() and then a click_chatter I added. The last packet seems fine.

[ 1838.399421] chatter: 60 | 90e2ba46 f2e067c6 697351ff 08004510 002e0000 40004011
[ 1838.401131] chatter: Packet 0xffff8803d2821500, length 60, txq 0xffff8800369d0000 (dev = 0xffff8800369c0000, state = 0), netdev 0xffff8800369c0000 (name = eth2, state = 3, id=0, port=0)
[ 1838.404559] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
[ 1838.406268] IP: [] ixgbe_xmit_frame_ring+0x79/0xc70 [ixgbe]
[ 1838.407955] PGD 0
[ 1838.409610] Oops: 0000 [#1] SMP
[ 1838.411264] Modules linked in: click(O) proclikefs(O) ipt_REJECT xt_LOG xt_limit xt_multiport iptable_filter ip_tables x_tables bnep bluetooth 6lowpan_iphc binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc nls_utf8 nls_cp437 vfat fat fuse snd_hda_codec_hdmi joydev x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm snd_hda_codec_realtek crc32_pclmul snd_hda_codec_generic ghash_clmulni_intel aesni_intel eeepc_wmi aes_x86_64 asus_wmi lrw sparse_keymap gf128mul snd_hda_intel rfkill glue_helper ablk_helper cryptd video nvidia(PO) snd_hda_controller iTCO_wdt psmouse iTCO_vendor_support pcspkr serio_raw mxm_wmi snd_hda_codec evdev snd_hwdep sb_edac edac_core snd_pcm lpc_ich snd_timer snd mfd_core i2c_i801 soundcore drm shpchp tpm_infineon tpm_tis processor tpm wmi thermal_sys
[ 1838.419652] mei_me mei button ext4 crc16 mbcache jbd2 dm_mod raid0 hid_generic usbhid hid md_mod sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci ehci_pci libata xhci_hcd ehci_hcd igb firewire_ohci i2c_algo_bit scsi_mod firewire_core usbcore crc_itu_t i2c_core ixgbe usb_common dca ptp pps_core mdio
[ 1838.424613] CPU: 0 PID: 12072 Comm: kclick Tainted: P O 3.16.0-4-amd64 #1 Debian 3.16.7-ckt2-1
[ 1838.426230] Hardware name: PRIMINFO UNLOCK INSTALL/P9X79-E WS, BIOS 1501 01/15/2014
[ 1838.427832] task: ffff88041d744ca0 ti: ffff8803cd37c000 task.ti: ffff8803cd37c000
[ 1838.429380] RIP: 0010:[] [] ixgbe_xmit_frame_ring+0x79/0xc70 [ixgbe]
[ 1838.430932] RSP: 0018:ffff8803cd37fcf0 EFLAGS: 00010246
[ 1838.432458] RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000001
[ 1838.433967] RDX: 0000000000000000 RSI: ffff8800369c08c0 RDI: ffff8803d2821500
[ 1838.435478] RBP: 0000000000000000 R08: ffff8803cd2aa440 R09: 0000000000000000
[ 1838.436973] R10: ffff8803d1438000 R11: 0000000000005000 R12: ffff8803d2821500
[ 1838.438453] R13: 0000000000000008 R14: ffff8800369c08c0 R15: ffff8803d2821500
[ 1838.439928] FS: 0000000000000000(0000) GS:ffff88042fc00000(0000) knlGS:0000000000000000
[ 1838.441389] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1838.442829] CR2: 0000000000000058 CR3: 0000000001813000 CR4: 00000000001407f0
[ 1838.444270] Stack:
[ 1838.445690] ffff88041a40314c 0000000000000003 ffff88041a403140 ffff880419e3d30c
[ 1838.447099] 00000000000000a5 ffff88041a496a00 0000000000000000 ffff8800369c0000
[ 1838.448512] ffff8803d2821500 ffff8800369d0000 ffff8803d2821500 ffffffffa11ad487
[ 1838.449907] Call Trace:
[ 1838.451291] [] ? _ZN8ToDevice12queue_packetEP6PacketP12netdev_queue+0xc7/0x200 [click]
[ 1838.452672] [] ? _ZN8ToDevice8run_taskEP4Task+0xf8/0x480 [click]
[ 1838.454030] [] ? _ZN12RouterThread6driverEv+0x42d/0x5f0 [click]
[ 1838.455368] [] ? _ZL11click_schedPv+0x158/0x320 [click]
[ 1838.456688] [] ? __schedule+0x2b1/0x710
[ 1838.458004] [] ? _Z19click_cleanup_schedv+0x160/0x160 [click]
[ 1838.459280] [] ? kthread+0xbd/0xe0
[ 1838.460548] [] ? kthread_create_on_node+0x180/0x180
[ 1838.461795] [] ? ret_from_fork+0x7c/0xb0
[ 1838.463013] [] ? kthread_create_on_node+0x180/0x180
[ 1838.464227] Code: 83 e9 01 31 c0 45 0f b7 c9 49 83 c1 01 49 c1 e1 04 90 41 8b 7c 00 3c 48 83 c0 10 8d 97 ff 3f 00 00 c1 ea 0e 01 d1 4c 39 c8 75 e7 <0f> b7 43 58 0f b7 73 5a 83 c1 03 31 d2 66 39 f0 66 0f 43 53 54
[ 1838.466871] RIP [] ixgbe_xmit_frame_ring+0x79/0xc70 [ixgbe]
[ 1838.468126] RSP
[ 1838.469356] CR2: 0000000000000058

@pallas
Copy link
Contributor

pallas commented Dec 19, 2014

If this makes you feel any better (or worse) my team had similar experiences with that driver and we always use the Intel version now. Sorry I didn't see this issue earlier since I probably could have saved you some time.

@tbarbette
Copy link
Collaborator

I spoke too quickly. It's maybe another problem but it doesn't work if I use multiqueue...

This seems to be because even with single-thread click, packet_notifier_hook() in fromdevice.cc can be called concurrently, as they are multiple interrupts comming from the card on multiple CPUs. Adding a big lock fixes the problem. I double check that, think a little to a better solution (I'd say atomic increment on the queue head) and come back with a patch...

@tbarbette
Copy link
Collaborator

This was solved by #182

@cffs cffs closed this as completed Jul 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants