Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A strange core-dump issue #209

Open
iyuvalk opened this issue Jun 18, 2019 · 18 comments

Comments

Projects
None yet
2 participants
@iyuvalk
Copy link

commented Jun 18, 2019

Hi,

I'm trying to use netsniff-ng as some sort of a virtual tap for Docker containers running in Kubernetes environment inside servers in the AWS EC2 environment to copy traffic from one interface to another virtual interface which is created by OpenVPN so that I would be able to analyze the traffic that arrives at or leaves my servers in a single AWS EC2 instance. I don't have lots of traffic nor lots of servers (only 15 servers) in my environment. I sometimes see that netsniff-ng crashes with a core dump. I fully understand that from your POV it's not much to go on and investigate the problem, so I guess that what would be best is that if you would be able to tell me which data you'd like me to collect to help you. Also, I'm guessing that such core-dump issues are a pretty rare thing so if you can tell me which problems usually triggered such problems in the past and I would look to see if there're similarities between the cases.

Thanks a lot in advance,
Yuval

@tklauser

This comment has been minimized.

Copy link
Contributor

commented Jun 18, 2019

Hi @iyuvalk

Thanks for your report. I looks like there is no easy way for me to reproduce these without setting up quite some infrastructure :)

Did you try looking at the generated core dump using gdb to check where netsniff-ng crashed? It would be helpful to to know at which code position netsniff-ng coredumps and maybe we could firgure it out together from there if you're willing to patch your netsniff-ng sources and recompile.

@iyuvalk

This comment has been minimized.

Copy link
Author

commented Jun 19, 2019

Hi,

I have managed to generate a core dump of the error (zipped & attached to this message) and I also tried to use gdb to analyze it and here's the result:

[New LWP 14373]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/local/sbin/netsniff-ng --ring-size 64MiB -s -i eth0 -o tap0 not port 1194'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:371
371	../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.

I have to add that I don't have any experience with gdb so this it my first time and I tried to use some of the tips listed here and this is the result:

(gdb) thread apply all bt full

Thread 1 (Thread 0x7f3bc21eab80 (LWP 14373)):
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:371
No locals.
#1  0x00005650de1f9976 in receive_to_xmit ()
No symbol table info available.
#2  0x00005650de1f23ab in main ()
No symbol table info available.
(gdb) 

Despite the fact that I lack some experience here (although I do have over 20 years of experience in software development and debugging in various languages and technologies) I'll be more than happy to help you help me here... (-;

Anything you need that I can do to help - just name it and it will be done.
Also, last but not least... I forgot to mention that the version of netsniff-ng that I'm using is the one on github (I compiled it from the sources here)

netsniff-ng.coredump.zip

Thanks in advance,
Yuval.

@tklauser

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2019

Hi Yuval

Thanks a lot for the information. Could you try building netsniff-ng with debug symbols enabled, so we could get a better view of the backtrack and be able to inspect variables, i.e. build with make DEBUG=1. This will disable optimizations though and it looks like the crash happens in an optimized memmove provided by glibc. So if the debug build does not crash anymore, could you try temporarily replacing the -O0 by -O3 here:

CFLAGS_DEF += -O0

and then build again with make DEBUG=1?

Thanks
Tobias

@iyuvalk

This comment has been minimized.

Copy link
Author

commented Jun 19, 2019

Hi,

I followed your instructions to the letter and here's the result. A zip file that contains two crash dumps and the executable that I got after compiling the sources with the debug flag. Can you take a look now?

netsniff-ng_DEBUG_CRASH_AND_EXECUTABLE.zip

@iyuvalk

This comment has been minimized.

Copy link
Author

commented Jun 19, 2019

Just to save you some time, here are my results from gdb:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/local/sbin/netsniff-ng --ring-size 64MiB -s -i eth0 -o tap0 not port 1194'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:371
371	../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) thread apply all bt full

Thread 1 (Thread 0x7fac51931b80 (LWP 16876)):
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:371
No locals.
#1  0x000055cb88c4b1d3 in receive_to_xmit (ctx=0x7ffeeabbb190) at netsniff-ng.c:521
        ifflags = 4419
        in = 0x7fac4bb00fff <error: Cannot access memory at address 0x7fac4bb00fff>
        out = 0x7fac47af1020 <error: Cannot access memory at address 0x7fac47af1020>
        rx_sock = 3
        ifindex_in = 2
        ifindex_out = 284
        ret = 1
        size_in = 67108864
        size_out = 67108864
        it_in = 0
        it_out = 0
        hdr_in = 0x7fac4baf1000
        hdr_out = 0x7fac47af1000
        tx_ring = {frames = 0x7fac47a70080, 
          mm_space = 0x7fac47af1000 <error: Cannot access memory at address 0x7fac47af1000>, mm_len = 67108864, s_ll = {
            sll_family = 17, sll_protocol = 0, sll_ifindex = 284, sll_hatype = 0, sll_pkttype = 0 '\000', 
            sll_halen = 0 '\000', sll_addr = "\000\000\000\000\000\000\000"}, {layout = {tp_block_size = 16384, 
              tp_block_nr = 4096, tp_frame_size = 2048, tp_frame_nr = 32768}, layout3 = {tp_block_size = 16384, 
              tp_block_nr = 4096, tp_frame_size = 2048, tp_frame_nr = 32768, tp_retire_blk_tov = 0, tp_sizeof_priv = 0, 
              tp_feature_req_word = 0}, raw = 0 '\000'}}
        rx_ring = {frames = 0x7fac51751080, 
          mm_space = 0x7fac4baf1000 <error: Cannot access memory at address 0x7fac4baf1000>, mm_len = 67108864, s_ll = {
            sll_family = 17, sll_protocol = 768, sll_ifindex = 2, sll_hatype = 0, sll_pkttype = 0 '\000', 
            sll_halen = 0 '\000', sll_addr = "\000\000\000\000\000\000\000"}, {layout = {tp_block_size = 16384, 
              tp_block_nr = 4096, tp_frame_size = 2048, tp_frame_nr = 32768}, layout3 = {tp_block_size = 16384, 
              tp_block_nr = 4096, tp_frame_size = 2048, tp_frame_nr = 32768, tp_retire_blk_tov = 0, tp_sizeof_priv = 0, 
              tp_feature_req_word = 0}, raw = 0 '\000'}}
        rx_poll = {fd = 3, events = 73, revents = 65}
        bpf_ops = {len = 24, filter = 0x55cb8a48e040}
---Type <return> to continue, or q <return> to quit--- 
#2  0x000055cb88c4e7c1 in main (argc=9, argv=0x7ffeeabbb378) at netsniff-ng.c:1657
        ptr = 0x7ffeeabbc693 "MiB"
        c = -1
        i = 3
        j = 5
        cpu_tmp = 1496408823
        ops_touched = 0
        vals = {104857600, 4194304, 104857600, 4194304}
        prio_high = false
        setsockmem = true
        main_loop = 0x55cb88c4acf2 <receive_to_xmit>
        ctx = {device_in = 0x55cb8a480c30 "eth0", device_out = 0x55cb8a480c50 "tap0", device_trans = 0x0, 
          filter = 0x55cb8a480c70 "not port 1194", prefix = 0x55cb8a480c90 "dump-", cpu = -1, rfraw = 0, dump = 0, 
          print_mode = 5, dump_dir = 0, packet_type = -1, lo_ifindex = 0, kpull = 0, dump_interval = 60, tx_bytes = 0, 
          tx_packets = 0, reserve_size = 67108864, randomize = false, promiscuous = true, enforce = false, jumbo = false, 
          dump_bpf = false, hwtimestamp = true, verbose = false, pcap = PCAP_OPS_SG, dump_mode = DUMP_INTERVAL_TIME, uid = 0, 
          gid = 0, link_type = 1, magic = 2712847316, fanout_group = 0, fanout_type = 3, pkts_seen = 131073, pkts_recvd = 0, 
          pkts_drops = 0, pkts_recvd_last = 0, pkts_drops_last = 0, pkts_skipd_last = 0, overwrite_interval = 0, 
          file_number = 0}
        __PRETTY_FUNCTION__ = "main"
(gdb) 
@iyuvalk

This comment has been minimized.

Copy link
Author

commented Jun 20, 2019

Hi,

I just wanted to make sure... Is there anything else that I can do to help you find the problem and fix it?

@tklauser

This comment has been minimized.

Copy link
Contributor

commented Jun 20, 2019

Thanks for the coredump. It indeed looks like the crash happens in glibc's optimized memmove from the memcpy call here:

memcpy(out, in, hdr_in->tp_h.tp_len);

The memory areas are the tx/rx ring buffers and the length comes directly from the kernel. Thus, I'd suspect some issue in either the kernel version (or more precisely with the tun/tap interface driver in combination with tpacket) or the glibc version you're using. Which kernel/glibc versions do you currently use?

@iyuvalk

This comment has been minimized.

Copy link
Author

commented Jun 20, 2019

Mmmm.... How can I tell? Is there a command that I can run to get that information? All I know is that it runs, as I mentioned earlier, inside a docker container that is a part of kubernetes cluster that itself runs on Amazon EC2 cluster of Amazon core OS VMs. The docker image is based on Ubuntu 18.04

@tklauser

This comment has been minimized.

Copy link
Contributor

commented Jun 20, 2019

Kernel version: cat /proc/version
glibc version: /lib/x86_64-linux-gnu/libc.so.6 (executing the glibc .so file will print version information)

@iyuvalk

This comment has been minimized.

Copy link
Author

commented Jun 20, 2019

Here they are:

root@ip-xx-xx-xx-xx:/netsniff-ng# cat /proc/version
Linux version 4.19.25-coreos (jenkins@ip-10-7-32-103) (gcc version 7.3.0 (Gentoo Hardened 7.3.0-r3 p1.4)) #1 SMP Sat Mar 9 01:05:06 -00 2019
root@ip-xx-xx-xx-xx:/netsniff-ng# 
root@ip-xx-xx-xx-xx:/netsniff-ng# /lib/x86_64-linux-gnu/libc.so.6
GNU C Library (Ubuntu GLIBC 2.27-3ubuntu1) stable release version 2.27.
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 7.3.0.
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<https://bugs.launchpad.net/ubuntu/+source/glibc/+bugs>.
root@ip-xx-xx-xx-xx:/netsniff-ng# 

BTW: Even if it is a problem with the kernel or glibc, wouldn't it be better to just drop a few packets and log this instead of going kaboom with a core dump? (-;

@tklauser

This comment has been minimized.

Copy link
Contributor

commented Jun 20, 2019

BTW: Even if it is a problem with the kernel or glibc, wouldn't it be better to just drop a few packets and log this instead of going kaboom with a core dump? (-;

Well, if the bug is not within netsniff-ng we cannot really prevent from going kaboom. We cannot circumvent against every possible bug.

@iyuvalk

This comment has been minimized.

Copy link
Author

commented Jun 20, 2019

I'm hoping that I'm not talking absolute rubbish here... but I have developed in C# and Python quite a lot, there we had the option of using try-catch (C#) or try-except (Python) to wrap potentially problematic code. Isn't that an option?

@tklauser

This comment has been minimized.

Copy link
Contributor

commented Jun 20, 2019

No, C doesn't offer try-catch and even if, it wouldn't help in this case because something is accessing memory it shouldn't be accessing. Moreover the memory for the ring buffers is accessed both by userspace (i.e. netsniff-ng) and kernel space. This unfortunately makes this issue very hard to debug/circumvent. I currently have not idea how to proceed further :(

@iyuvalk

This comment has been minimized.

Copy link
Author

commented Jun 20, 2019

Wow.... The information about the kernel/glibc helped? Do they need an upgrade or something?

@tklauser

This comment has been minimized.

Copy link
Contributor

commented Jun 20, 2019

If that's option, you could try updating them to the latest version, yes.

Or maybe @borkmann has an idea about how we could debug this?

@iyuvalk

This comment has been minimized.

Copy link
Author

commented Jun 20, 2019

Also, based on the location of the error in your code, do you think that the segmentation fault issue will happen more frequently on servers that have a high amount of traffic and less frequently on server that have less amount of traffic? The reason I'm asking is because it happened more frequently on some servers than others in our environment even though the OS and all the parameters I could think of were essentially the same...

@tklauser

This comment has been minimized.

Copy link
Contributor

commented Jun 20, 2019

Yes, as the segfault happens in the main receive/transmit loop I suspect it would happen more frequently on systems with higher amount of traffic.

@iyuvalk

This comment has been minimized.

Copy link
Author

commented Jun 30, 2019

Hi again @tklauser,

Do you or @borkmann have any ideas/insights on how we can solve/work around this issue?

....)-:

T.I.A,
Yuval

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.