Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in _check_interface_entry_for_updates #107

Closed
episodeiv opened this issue Apr 29, 2020 · 35 comments
Closed

Segfault in _check_interface_entry_for_updates #107

episodeiv opened this issue Apr 29, 2020 · 35 comments

Comments

@episodeiv
Copy link

Hi all,

for a while I've had problems with snmpd crashing irregularly with a segfault. After finally catching a core dump and analyzing it I've found the following:

snmpd crashes at mibgroup/if-mib/ifTable/ifTable_data_access.c:317. The rowreq_ctx->data object looks like this:

(gdb) print rowreq_ctx->data
$2 = {ifLinkUpDownTrapEnable = 0, ifAlias = '\000' <repeats 63 times>, ifAlias_len = 0, ifCounterDiscontinuityTime = 0, ifentry = 0x0}

So there is no ifLastChange member. The system has an assortment of VLAN interfaces but nothing my other machines don't have.

Is this something anyone has seen before?

Regards,
Dennis

@bvanassche
Copy link
Contributor

With which Net-SNMP version has this been observed and on which platform has this been observed (Linux, FreeBSD, ...)?

@episodeiv
Copy link
Author

Sorry, it's a Gentoo Linux system running Net-SNMP 5.8.
Gentoo applies a few patches on top of 5.8 (see https://gitweb.gentoo.org/repo/gentoo.git/tree/net-analyzer/net-snmp/files) but as far as I can tell nothing that touches the mentioned code.

@bvanassche
Copy link
Contributor

I took a close look at the IF-MIB implementation and I don't see how the rowreq_ctx pointer could be NULL other than as the result of memory corruption. As one can see in _add_new_interface(), CONTAINER_INSERT() is only called if rowreq_ctx != NULL. How about trying to reproduce this issue with snmpd running under Valgrind?

@episodeiv
Copy link
Author

The pointer isn't NULL, it's just the ifLastChange field. As far as I can tell, that one gets set in _add_new_interface() from netsnmp_get_agent_uptime(). Perhaps that can fail in some way?

@bvanassche
Copy link
Contributor

I think that the missing ifLastChange field indicates that there is something wrong with the debug information that was used by the debugger to print rowreq_ctx->data.

Anyway, it would be appreciated if the same workload could be applied to the latest version of the v5.8 branch that is run under Valgrind. BTW, two memory corruption fixes have been checked in earlier today on that branch. I'm not sure however whether these fixes are related to what has been reported above.

@sgarcialaguna-mms
Copy link

We've encountered the same issue with the same crash dump at work. I probably wouldn't even have noticed if our IT department hadn't pinged me. We keep restarting the daemon, but it then crashes again at the same location some hours later.

This is on an Ubuntu 20.04 machine running Net-SNMP 5.8

Wild guess: The machine in question is a CI machine where we keep spooling up and discarding Docker containers throughout the days. Possibly a race condition when Docker adds and / or removes network interfaces?

@bvanassche
Copy link
Contributor

Net-SNMP v5.8 is no longer supported. Is this reproducible with Net-SNMP v5.9?

@episodeiv
Copy link
Author

Docker seems to be a likely culprit - our affected machine is running it as well.

5.9 is still affected by this.

@sgarcialaguna-mms
Copy link

I've installed Net-SNMP v5.9 on our machine. No crashes so far, I'll let you know if that changes.

@sgarcialaguna-mms
Copy link

It crashed again.

@singh4jitendra
Copy link

Any update on fix for this ?

@po5857
Copy link
Contributor

po5857 commented Jul 23, 2021

This is happening to me as well in 5.9, but only on vms running docker containers which are spun up/down regularly, similar to @sgarcialaguna-mms comment.

@bvanassche
Copy link
Contributor

@po5857, @singh4jitendra and @sgarcialaguna-mms , what made you decide that you ran into the same crash as the original reporter of this bug? Please provide more information about the crashes that you observed. Which Net-SNMP version are you using? Has Net-SNMP been compiled from source or has a binary package been installed that was provided by a Linux distributor? In the latter case, which Linux distribution are you using? Did the crash happen while snmpd was running or during shutdown? Is there anything unusual about your setup, e.g. a large number of disks or a large number of network interfaces? Is snmpd running inside Docker or not?

@po5857
Copy link
Contributor

po5857 commented Jul 24, 2021

@bvanassche, because it's crashing at the same exact place as the original reporter:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f0d1ecc3cc3 in _check_interface_entry_for_updates (rowreq_ctx=0x557e69269450, cdc=0x7fffca8656d0)
    at mibgroup/if-mib/ifTable/ifTable_data_access.c:317
317         int lastchanged = rowreq_ctx->data.ifLastChange;

and more info from gdb:

(gdb) print *rowreq_ctx
$3 = {oid_idx = {len = 139694326086400, oids = 0x557e69215930}, oid_tmp = {0}, tbl_idx = {ifIndex = 0}, data = {ifLinkUpDownTrapEnable = 0, ifAlias = '\000' <repeats 63 times>, ifAlias_len = 0, ifCounterDiscontinuityTime = 0, ifentry = 0x0}, undo = 0x0, column_set_flags = 0, rowreq_flags = 0, known_missing = 1 '\001', undo_ref_count = 0 '\000', ifTable_data_list = 0x0}
(gdb) print rowreq_ctx->data
$4 = {ifLinkUpDownTrapEnable = 0, ifAlias = '\000' <repeats 63 times>, ifAlias_len = 0, ifCounterDiscontinuityTime = 0, ifentry = 0x0}

In my case I'm on Fedora 34, which has net-snmp-5.9-9 rpm. Snmpd is not running inside the docker containers - only on the host. It crashes regularly - today's crashes for instance:

Jul 24 02:40
Jul 24 06:34
Jul 24 09:10

I've had to tweak the systemd service file to add "Restart=on-failure" to avoid constant manual restarts.

I just compiled a new rpm with optflags -O0 so I can avoid optimized out vars. Aside from that, let me know if there's something else I can do to assist in tracking this down.

@po5857
Copy link
Contributor

po5857 commented Jul 25, 2021

Disabling optimizations led to a different backtrace unfortunately.

#0  0x00007f0a85fec789 in ?? () from /lib64/libc.so.6
#1  0x00007f0a862e2f13 in netsnmp_access_interface_entry_free (entry=0x56248f374470) at mibgroup/if-mib/data_access/interface.c:346
#2  0x00007f0a862f711f in ifTable_rowreq_ctx_cleanup (rowreq_ctx=rowreq_ctx@entry=0x56248f377ef0) at mibgroup/if-mib/ifTable/ifTable.c:241
#3  0x00007f0a862f721a in ifTable_release_rowreq_ctx (rowreq_ctx=0x56248f377ef0) at mibgroup/if-mib/ifTable/ifTable_interface.c:626
#4  0x00007f0a861ee028 in _ssll_for_each (c=<optimized out>, f=0x7f0a862f7780 <_delete_missing_interface>, context=0x56248f233020)
    at /usr/src/debug/net-snmp-5.9-667.IB.fc34.x86_64/snmplib/container_list_ssll.c:284
#5  0x00007f0a862f889f in ifTable_container_load (container=0x56248f233020) at mibgroup/if-mib/ifTable/ifTable_data_access.c:643
#6  0x00007f0a8652d9e4 in _cache_load (cache=0x56248f232fb0) at helpers/cache_handler.c:735
#7  0x00007f0a861cebb6 in run_alarms () at /usr/src/debug/net-snmp-5.9-667.IB.fc34.x86_64/snmplib/snmp_alarm.c:214
#8  0x000056248da837e9 in receive () at /usr/src/debug/net-snmp-5.9-667.IB.fc34.x86_64/agent/snmpd.c:1345
#9  0x000056248da83115 in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/net-snmp-5.9-667.IB.fc34.x86_64/agent/snmpd.c:1088

It is now crashing in netsnmp_access_interface_entry_free here:

    if (NULL != entry->old_stats)
        free(entry->old_stats);

old_stats is referencing junk:

(gdb) inspect *entry
$3 = {oid_index = {len = 94715021680608, oids = 0x56248f38b960}, ns_flags = 2786431, index = 49, name = 0x56248f36fde0 "\357\313p\355!V", 
  descr = 0x56248f36eb00 "\276\016~\355!V", type = 6, speed = 4294967295, speed_high = 10000, paddr = 0x56248f341630 "\261\031~\355!V", 
  paddr_len = 6, mtu = 1500, retransmit_v4 = 0, retransmit_v6 = 0, reachable_time = 0, lastchange = 0, discontinuity = 0, reasm_max_v4 = 65535, 
  reasm_max_v6 = 65535, admin_status = 1 '\001', oper_status = 1 '\001', promiscuous = 0 '\000', connector_present = 1 '\001', 
  forwarding_v6 = 0 '\000', v6_if_id_len = 8 '\b', v6_if_id = "́l\377\376\063O!", os_flags = 4163, stats = {ibytes = {high = 0, low = 0}, 
    iall = {high = 0, low = 0}, iucast = {high = 0, low = 0}, imcast = {high = 0, low = 0}, ibcast = {high = 0, low = 0}, ierrors = 0, 
    idiscards = 0, iunknown_protos = 0, inucast = 0, obytes = {high = 0, low = 294929}, oucast = {high = 0, low = 2014}, omcast = {high = 0, 
      low = 0}, obcast = {high = 0, low = 0}, oerrors = 0, odiscards = 0, oqlen = 0, collisions = 0, onucast = 0}, old_stats = 0x150}

I can only guess that the change in optimization level moved the (apparent) race condition around.

@bvanassche
Copy link
Contributor

Or maybe memory corruption is involved. Does the list of network interfaces on your setup change after snmpd has been started (ls /sys/class/net)?

@po5857
Copy link
Contributor

po5857 commented Jul 26, 2021

Yes, docker creates a "veth" interface for each container. Containers are spinning up/down all the time, thus the list is constantly changing. One hackish workaround for this would be to allow users to ignore specific interfaces via config file. For instance, if we could ignore "veth*" (which we do not care about monitoring anyway) I suspect the crashing would stop.

# ls /sys/class/net/
docker0      veth0d13ee2  veth2ea919c  veth4f46f3e  veth61fd265  veth786f7c4  veth9b15f95  vethc9ba9bd  vethdf0bf2e  vethf57dced
eth0         veth1add6e9  veth3cc3718  veth52303fe  veth6d03cdb  veth800a3c8  vethb1b52b7  vethcd22503  vethe9461f6
lo           veth1c2b336  veth4261bc0  veth5677e4a  veth745b794  veth882ffa7  vethb452cb6  vethdb2f97d  vethee14b1a
veth0170ffd  veth2c74ab5  veth45413d5  veth5985973  veth76860fe  veth8d94d56  vethb4ce705  vethde3f9df  vethf46cb28

@po5857
Copy link
Contributor

po5857 commented Jul 26, 2021

I guess "include_ifmib_iface_prefix eth" would be the hackish workaround I'm looking for.

@po5857
Copy link
Contributor

po5857 commented Jul 27, 2021

Well, "include_ifmib_iface_prefix eth" did not help...it still crashes.

@bvanassche
Copy link
Contributor

I was asking for this information because I wanted to try to reproduce the issue. So far without success unfortunately. Can you run snmpd under Valgrind and reproduce the issue? Running snmpd under Valgrind is easier from the command line than from a service file and can be done e.g. as follows: sudo valgrind /usr/sbin/snmpd -f -Lo |& tee snmpd-valgrind-log.txt. Please share the valgrind output if any "Invalid read", "Conditional jump or move depends on uninitialized value" or other errors are reported. An (old) example is available here: https://gist.github.com/sonots/103546646a98fbba6c80

@po5857
Copy link
Contributor

po5857 commented Jul 27, 2021

So far 3 of these:

==2500341== Invalid read of size 8
==2500341==    at 0x4C51FAB: netsnmp_compare_netsnmp_index (container.c:600)
==2500341==    by 0x4C4A3E7: binary_search (container_binary_array.c:184)
==2500341==    by 0x4C4A630: UnknownInlinedFun (container_binary_array.c:411)
==2500341==    by 0x4C4A630: netsnmp_binary_array_remove (container_binary_array.c:388)
==2500341==    by 0x4C49F44: CONTAINER_REMOVE (container.c:381)
==2500341==    by 0x498B7AB: _delete_missing_interface (ifTable_data_access.c:542)
==2500341==    by 0x4C47027: _ssll_for_each (container_list_ssll.c:284)
==2500341==    by 0x498C89E: ifTable_container_load (ifTable_data_access.c:643)
==2500341==    by 0x486D9E3: _cache_load (cache_handler.c:735)
==2500341==    by 0x4C27BB5: run_alarms (snmp_alarm.c:214)
==2500341==    by 0x10D7E8: receive (snmpd.c:1345)
==2500341==    by 0x10D114: main (snmpd.c:1088)
==2500341==  Address 0x6230a88 is 8 bytes inside a block of size 160 free'd
==2500341==    at 0x48430E4: free (vg_replace_malloc.c:755)
==2500341==    by 0x4C47027: _ssll_for_each (container_list_ssll.c:284)
==2500341==    by 0x498C89E: ifTable_container_load (ifTable_data_access.c:643)
==2500341==    by 0x486D9E3: _cache_load (cache_handler.c:735)
==2500341==    by 0x4C27BB5: run_alarms (snmp_alarm.c:214)
==2500341==    by 0x10D7E8: receive (snmpd.c:1345)
==2500341==    by 0x10D114: main (snmpd.c:1088)
==2500341==  Block was alloc'd at
==2500341==    at 0x4845464: calloc (vg_replace_malloc.c:1117)
==2500341==    by 0x498B5A3: ifTable_allocate_rowreq_ctx (ifTable_interface.c:585)
==2500341==    by 0x498C53C: _add_new_interface (ifTable_data_access.c:496)
==2500341==    by 0x4C473A1: UnknownInlinedFun (container_binary_array.c:429)
==2500341==    by 0x4C473A1: _ba_for_each (container_binary_array.c:705)
==2500341==    by 0x498C8BF: ifTable_container_load (ifTable_data_access.c:652)
==2500341==    by 0x486D9E3: _cache_load (cache_handler.c:735)
==2500341==    by 0x4C27BB5: run_alarms (snmp_alarm.c:214)
==2500341==    by 0x10D7E8: receive (snmpd.c:1345)
==2500341==    by 0x10D114: main (snmpd.c:1088)

@bvanassche
Copy link
Contributor

So far I haven't been able to reproduce this issue. But I took considerable time to verify the source code of the ifTable implementation. I fixed multiple bugs but I'm not sure that the reported segmentation fault has been fixed by the changes I made. Would it be possible to run the latest version of the v5.9.1 branch on your setup under Valgrind? The latest version of the v5.9.1 branch can be downloaded from https://github.com/net-snmp/net-snmp/archive/refs/heads/V5-9-patches.zip.

@po5857
Copy link
Contributor

po5857 commented Aug 2, 2021

I've been running this code for a few hours. So far got the below output:

==662376== Memcheck, a memory error detector
==662376== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==662376== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==662376== Command: /usr/sbin/snmpd -f -Lo
==662376== 
==662376== Warning: invalid file descriptor 1030 in syscall close()
==662376== Warning: invalid file descriptor 1029 in syscall close()
==662376== Warning: invalid file descriptor 1028 in syscall close()
==662376== Warning: invalid file descriptor 1027 in syscall close()
==662376==    Use --log-fd=<number> to select an alternative log fd.
==662376== Warning: invalid file descriptor 1026 in syscall close()
==662376== Warning: invalid file descriptor 1025 in syscall close()
==662376== Warning: invalid file descriptor 1024 in syscall close()
Can't find directory of RPM packageserror finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
error finding row index in _ifXTable_container_row_restore
NET-SNMP version 5.9.1
ioctl 35123 returned -1
netsnmp_assert if_index != 0 failed mibgroup/if-mib/data_access/interface_linux.c:737 netsnmp_arch_interface_container_load()
ioctl 35123 returned -1
ioctl 35111 returned -1
ioctl 35091 returned -1
ioctl 35105 returned -1
ioctl 35123 returned -1
netsnmp_assert if_index != 0 failed mibgroup/if-mib/data_access/interface_linux.c:737 netsnmp_arch_interface_container_load()
ioctl 35123 returned -1
ioctl 35111 returned -1
ioctl 35091 returned -1
ioctl 35105 returned -1
Name of an interface changed. Such interfaces will keep its old name in IF-MIB.
ioctl 35123 returned -1
netsnmp_assert if_index != 0 failed mibgroup/if-mib/data_access/interface_linux.c:737 netsnmp_arch_interface_container_load()
ioctl 35123 returned -1
ioctl 35111 returned -1
ioctl 35091 returned -1
ioctl 35105 returned -1
ioctl 35123 returned -1
netsnmp_assert if_index != 0 failed mibgroup/if-mib/data_access/interface_linux.c:737 netsnmp_arch_interface_container_load()
ioctl 35123 returned -1
ioctl 35111 returned -1
ioctl 35091 returned -1
ioctl 35105 returned -1
ioctl 35123 returned -1
netsnmp_assert if_index != 0 failed mibgroup/if-mib/data_access/interface_linux.c:737 netsnmp_arch_interface_container_load()
ioctl 35123 returned -1
ioctl 35111 returned -1
ioctl 35091 returned -1
ioctl 35105 returned -1
ioctl 35123 returned -1
netsnmp_assert if_index != 0 failed mibgroup/if-mib/data_access/interface_linux.c:737 netsnmp_arch_interface_container_load()
ioctl 35123 returned -1
ioctl 35111 returned -1
ioctl 35091 returned -1
ioctl 35105 returned -1
==662376== Invalid read of size 8
==662376==    at 0x4C514EB: netsnmp_compare_netsnmp_index (container.c:610)
==662376==    by 0x4C49977: binary_search (container_binary_array.c:185)
==662376==    by 0x4C49BC0: UnknownInlinedFun (container_binary_array.c:412)
==662376==    by 0x4C49BC0: netsnmp_binary_array_remove (container_binary_array.c:389)
==662376==    by 0x4C494D4: CONTAINER_REMOVE (container.c:390)
==662376==    by 0x498A89B: UnknownInlinedFun (ifTable_data_access.c:551)
==662376==    by 0x498A89B: __delete_missing_interface (ifTable_data_access.c:558)
==662376==    by 0x4C465E7: _ssll_for_each (container_list_ssll.c:285)
==662376==    by 0x498C54E: ifTable_container_load (ifTable_data_access.c:651)
==662376==    by 0x486D9E3: _cache_load (cache_handler.c:735)
==662376==    by 0x4C27165: run_alarms (snmp_alarm.c:214)
==662376==    by 0x10D808: receive (snmpd.c:1352)
==662376==    by 0x10D12C: main (snmpd.c:1084)
==662376==  Address 0x6152788 is 8 bytes inside a block of size 160 free'd
==662376==    at 0x48430E4: free (vg_replace_malloc.c:755)
==662376==    by 0x4C465E7: _ssll_for_each (container_list_ssll.c:285)
==662376==    by 0x498C54E: ifTable_container_load (ifTable_data_access.c:651)
==662376==    by 0x486D9E3: _cache_load (cache_handler.c:735)
==662376==    by 0x4C27165: run_alarms (snmp_alarm.c:214)
==662376==    by 0x10D808: receive (snmpd.c:1352)
==662376==    by 0x10D12C: main (snmpd.c:1084)
==662376==  Block was alloc'd at
==662376==    at 0x4845464: calloc (vg_replace_malloc.c:1117)
==662376==    by 0x4983493: ifTable_allocate_rowreq_ctx (ifTable_interface.c:548)
==662376==    by 0x498C19C: UnknownInlinedFun (ifTable_data_access.c:500)
==662376==    by 0x498C19C: __add_new_interface (ifTable_data_access.c:538)
==662376==    by 0x4C46961: UnknownInlinedFun (container_binary_array.c:430)
==662376==    by 0x4C46961: _ba_for_each (container_binary_array.c:706)
==662376==    by 0x498C56F: ifTable_container_load (ifTable_data_access.c:658)
==662376==    by 0x486D9E3: _cache_load (cache_handler.c:735)
==662376==    by 0x4C27165: run_alarms (snmp_alarm.c:214)
==662376==    by 0x10D808: receive (snmpd.c:1352)
==662376==    by 0x10D12C: main (snmpd.c:1084)
==662376== 
==662376== Invalid read of size 8
==662376==    at 0x4C514F3: netsnmp_compare_netsnmp_index (container.c:610)
==662376==    by 0x4C49977: binary_search (container_binary_array.c:185)
==662376==    by 0x4C49BC0: UnknownInlinedFun (container_binary_array.c:412)
==662376==    by 0x4C49BC0: netsnmp_binary_array_remove (container_binary_array.c:389)
==662376==    by 0x4C494D4: CONTAINER_REMOVE (container.c:390)
==662376==    by 0x498A89B: UnknownInlinedFun (ifTable_data_access.c:551)
==662376==    by 0x498A89B: __delete_missing_interface (ifTable_data_access.c:558)
==662376==    by 0x4C465E7: _ssll_for_each (container_list_ssll.c:285)
==662376==    by 0x498C54E: ifTable_container_load (ifTable_data_access.c:651)
==662376==    by 0x486D9E3: _cache_load (cache_handler.c:735)
==662376==    by 0x4C27165: run_alarms (snmp_alarm.c:214)
==662376==    by 0x10D808: receive (snmpd.c:1352)
==662376==    by 0x10D12C: main (snmpd.c:1084)
==662376==  Address 0x6152780 is 0 bytes inside a block of size 160 free'd
==662376==    at 0x48430E4: free (vg_replace_malloc.c:755)
==662376==    by 0x4C465E7: _ssll_for_each (container_list_ssll.c:285)
==662376==    by 0x498C54E: ifTable_container_load (ifTable_data_access.c:651)
==662376==    by 0x486D9E3: _cache_load (cache_handler.c:735)
==662376==    by 0x4C27165: run_alarms (snmp_alarm.c:214)
==662376==    by 0x10D808: receive (snmpd.c:1352)
==662376==    by 0x10D12C: main (snmpd.c:1084)
==662376==  Block was alloc'd at
==662376==    at 0x4845464: calloc (vg_replace_malloc.c:1117)
==662376==    by 0x4983493: ifTable_allocate_rowreq_ctx (ifTable_interface.c:548)
==662376==    by 0x498C19C: UnknownInlinedFun (ifTable_data_access.c:500)
==662376==    by 0x498C19C: __add_new_interface (ifTable_data_access.c:538)
==662376==    by 0x4C46961: UnknownInlinedFun (container_binary_array.c:430)
==662376==    by 0x4C46961: _ba_for_each (container_binary_array.c:706)
==662376==    by 0x498C56F: ifTable_container_load (ifTable_data_access.c:658)
==662376==    by 0x486D9E3: _cache_load (cache_handler.c:735)
==662376==    by 0x4C27165: run_alarms (snmp_alarm.c:214)
==662376==    by 0x10D808: receive (snmpd.c:1352)
==662376==    by 0x10D12C: main (snmpd.c:1084)
==662376== 
==662376== Invalid read of size 8
==662376==    at 0x4BF1153: snmp_oid_compare (snmp_api.c:6985)
==662376==    by 0x4C514FA: netsnmp_compare_netsnmp_index (container.c:610)
==662376==    by 0x4C49977: binary_search (container_binary_array.c:185)
==662376==    by 0x4C49BC0: UnknownInlinedFun (container_binary_array.c:412)
==662376==    by 0x4C49BC0: netsnmp_binary_array_remove (container_binary_array.c:389)
==662376==    by 0x4C494D4: CONTAINER_REMOVE (container.c:390)
==662376==    by 0x498A89B: UnknownInlinedFun (ifTable_data_access.c:551)
==662376==    by 0x498A89B: __delete_missing_interface (ifTable_data_access.c:558)
==662376==    by 0x4C465E7: _ssll_for_each (container_list_ssll.c:285)
==662376==    by 0x498C54E: ifTable_container_load (ifTable_data_access.c:651)
==662376==    by 0x486D9E3: _cache_load (cache_handler.c:735)
==662376==    by 0x4C27165: run_alarms (snmp_alarm.c:214)
==662376==    by 0x10D808: receive (snmpd.c:1352)
==662376==    by 0x10D12C: main (snmpd.c:1084)
==662376==  Address 0x6152790 is 16 bytes inside a block of size 160 free'd
==662376==    at 0x48430E4: free (vg_replace_malloc.c:755)
==662376==    by 0x4C465E7: _ssll_for_each (container_list_ssll.c:285)
==662376==    by 0x498C54E: ifTable_container_load (ifTable_data_access.c:651)
==662376==    by 0x486D9E3: _cache_load (cache_handler.c:735)
==662376==    by 0x4C27165: run_alarms (snmp_alarm.c:214)
==662376==    by 0x10D808: receive (snmpd.c:1352)
==662376==    by 0x10D12C: main (snmpd.c:1084)
==662376==  Block was alloc'd at
==662376==    at 0x4845464: calloc (vg_replace_malloc.c:1117)
==662376==    by 0x4983493: ifTable_allocate_rowreq_ctx (ifTable_interface.c:548)
==662376==    by 0x498C19C: UnknownInlinedFun (ifTable_data_access.c:500)
==662376==    by 0x498C19C: __add_new_interface (ifTable_data_access.c:538)
==662376==    by 0x4C46961: UnknownInlinedFun (container_binary_array.c:430)
==662376==    by 0x4C46961: _ba_for_each (container_binary_array.c:706)
==662376==    by 0x498C56F: ifTable_container_load (ifTable_data_access.c:658)
==662376==    by 0x486D9E3: _cache_load (cache_handler.c:735)
==662376==    by 0x4C27165: run_alarms (snmp_alarm.c:214)
==662376==    by 0x10D808: receive (snmpd.c:1352)
==662376==    by 0x10D12C: main (snmpd.c:1084)
==662376== 

@po5857
Copy link
Contributor

po5857 commented Aug 3, 2021

I enabled some various debug options, and got this around your newly added assert:

9:access:ifcontainer: processing 'vethf99433e:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0'
access:interface:ioctl: ifindex_get
ioctl 35123 returned -1
access:interface:ioctl: ifindex_get error on inerface 'vethf99433e'
netsnmp_assert if_index != 0 failed mibgroup/if-mib/data_access/interface_linux.c:737 netsnmp_arch_interface_container_load()
access:interface:entry: create
access:interface:find: index
access:interface:ioctl: ifindex_get
ioctl 35123 returned -1
access:interface:ioctl: ifindex_get error on inerface 'vethf99433e'
access:interface:ifIndex: saved ifIndex 0 for vethf99433e
access:interface:ioctl: physaddr_get
ioctl 35111 returned -1
access:interface:ioctl: flags_get
ioctl 35091 returned -1
access:interface:ioctl: mtu_get
ioctl 35105 returned -1
9:access:ifcontainer: processing 'veth08c2ec8:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0'
access:interface:ioctl: ifindex_get
access:interface:entry: create
access:interface:ifIndex: saved ifIndex 2968299 for veth08c2ec8
access:interface:ioctl: physaddr_get
access:interface:ioctl: flags_get
access:interface:ioctl: mtu_get

And here's another

9:access:ifcontainer: processing 'veth5babc22:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0'
access:interface:ioctl: ifindex_get
ioctl 35123 returned -1
access:interface:ioctl: ifindex_get error on inerface 'veth5babc22'
netsnmp_assert if_index != 0 failed mibgroup/if-mib/data_access/interface_linux.c:737 netsnmp_arch_interface_container_load()
access:interface:entry: create
access:interface:find: index
access:interface:ioctl: ifindex_get
ioctl 35123 returned -1
access:interface:ioctl: ifindex_get error on inerface 'veth5babc22'
access:interface:ifIndex: saved ifIndex 0 for veth5babc22
access:interface:ioctl: physaddr_get
ioctl 35111 returned -1
access:interface:ioctl: flags_get
ioctl 35091 returned -1
access:interface:ioctl: mtu_get
ioctl 35105 returned -1
9:access:ifcontainer: processing 'vethf891085:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0'
access:interface:ioctl: ifindex_get
ioctl 35123 returned -1
access:interface:ioctl: ifindex_get error on inerface 'vethf891085'
netsnmp_assert if_index != 0 failed mibgroup/if-mib/data_access/interface_linux.c:737 netsnmp_arch_interface_container_load()
access:interface:entry: create
access:interface:find: index
access:interface:ioctl: ifindex_get
ioctl 35123 returned -1
access:interface:ioctl: ifindex_get error on inerface 'vethf891085'
access:interface:ifIndex: saved ifIndex 0 for vethf891085
access:interface:ioctl: physaddr_get
ioctl 35111 returned -1
access:interface:ioctl: flags_get
ioctl 35091 returned -1
access:interface:ioctl: mtu_get
ioctl 35105 returned -1

@bvanassche
Copy link
Contributor

Thank you for having provided the entire snmpd output and not only the Valgrind complaints. The assertion failure has been fixed yesterday on the v5.9 and master branches. I had not yet reported this because I'm not sure whether the changes I made yesterday fix the Valgrind complaint. Anyway, if you have the time to retest the latest version of the v5.9 branch under Valgrind, that would be welcome.

@po5857
Copy link
Contributor

po5857 commented Aug 3, 2021

Thanks, I'm running that version now. Will advise if it complains.

Completely unrelated: I've been dragging along a patch to add a feature we use for a few years, and I'd like to get it merged upstream. What's the best way to submit it for review? Mailing list?

@bvanassche
Copy link
Contributor

Please submit a pull request to https://github.com/net-snmp/net-snmp/. If that would be inconvenient, please create a patch with git format-patch and mail it with git send-email to net-snmp-coders@lists.sourceforge.net.

@po5857
Copy link
Contributor

po5857 commented Aug 3, 2021

Thanks, will do.

In other news, valgrind is still complaining, including some new traces I haven't seen before. It's a long list, so attached as a file.

snmp.txt

@bvanassche
Copy link
Contributor

I'm wondering whether the reported behavior could have been caused by a bug in the Net-SNMP binary array implementation (libsnmp/container_binary_array.c). I have fixed several small bugs in that code and also added two consistency checks. Please retest the latest version of the v5.9 branch under Valgrind. If Valgrind would report a memory error, please share the entire snmpd log up to and including the first Valgrind complaint.

@po5857
Copy link
Contributor

po5857 commented Aug 5, 2021

I ran this new version for the past 3 hours or so. Attached is all the valgrind output.

snmpdlog2.txt

bvanassche added a commit that referenced this issue Aug 6, 2021
Disable the consistency checks added by the previous commit since the test
these were intended for has completed. See also
#107 .
@bvanassche
Copy link
Contributor

Thank you for having run another test and for having shared the Valgrind output. That output made it clear that I introduced a regression on August 2 (600c541). A fix has been checked in (d4b58c6). Would it be possible to rerun the test once more?

@po5857
Copy link
Contributor

po5857 commented Aug 6, 2021

So far I've been running this on 2 systems for the past 8 hours and no valgrind complaints. Will leave it running over the weekend and let you know, thanks!

@po5857
Copy link
Contributor

po5857 commented Aug 9, 2021

Two things:

  1. I screwed up the rpmbuild on the prior round, and wasn't testing your latest fixes - apologies for the noise.
  2. This round definitely looks good - no complaints from valgrind all weekend on either system. Only some random messages from snmpd itself:
NET-SNMP version 5.9.1
IfIndex of an interface changed. Such interfaces will appear multiple times in IF-MIB.
Cannot statfs /var/disk/docker/overlay2/66c4294ecf7df9cc273ac163133d56f0878c6b41b874146e9c3a9aa9c962044a/merged: No such file or directory
error on subcontainer 'interface container' insert (-1)
Encountered interface with index 2990891 twice: veth6005475 <> veth6005475error on subcontainer 'interface container' insert (-1)
error on subcontainer 'interface container' insert (-1)

The statfs error is for similar reasons as the interfaces - docker is adding/removing mounts regularly.

@po5857
Copy link
Contributor

po5857 commented Aug 13, 2021

Been running for a week now and aside from the occasional "error on subcontainer 'interface container' insert (-1)" error, it all looks good. Thanks!

@bvanassche
Copy link
Contributor

Thanks for the help with testing and also for confirming that this issue has been solved. Let's close this bug report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants