ShellUTest errors sometimes with Debug build #5

ferrouswheel · 2018-08-02T05:13:27Z

I get sporadic failures with ShellUTest, but have only noticed them with a Debug build.

Compiler is standard for Ubuntu 16.04 gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609

Maybe related to opencog/opencog#2542 ? Though a cause was found and fixed in that issue.

The output is:

Start  5: ShellUTest
*** Error in `/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest': corrupted size vs. prev_size: 0x00007f7b38000a60 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f7b5db067e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x7e9dc)[0x7f7b5db0d9dc]
/lib/x86_64-linux-gnu/libc.so.6(+0x81cde)[0x7f7b5db10cde]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f7b5db13184]
/lib/x86_64-linux-gnu/libc.so.6(_IO_file_doallocate+0x55)[0x7f7b5dafc1d5]
/lib/x86_64-linux-gnu/libc.so.6(_IO_doallocbuf+0x34)[0x7f7b5db0a594]
/lib/x86_64-linux-gnu/libc.so.6(_IO_file_underflow+0x1fc)[0x7f7b5db0969c]
/lib/x86_64-linux-gnu/libc.so.6(_IO_default_uflow+0xe)[0x7f7b5db0a60e]
/lib/x86_64-linux-gnu/libc.so.6(_IO_getline_info+0xaa)[0x7f7b5dafdc6a]
/lib/x86_64-linux-gnu/libc.so.6(fgets+0xad)[0x7f7b5dafcb7d]
/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest(_Z8cmd_execB5cxx11PKc+0x100)[0x4200a4]
/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest(_Z11gargleblastii+0x9f)[0x4201e3]
/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest(_ZNSt12_Bind_simpleIFPFviiEiiEE9_M_invokeIILm0ELm1EEEEvSt12_Index_tupleIIXspT_EEE+0x5a)[0x42faea]
/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest(_ZNSt12_Bind_simpleIFPFviiEiiEEclEv+0x1d)[0x42f945]
/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest(_ZNSt6thread5_ImplISt12_Bind_simpleIFPFviiEiiEEE6_M_runEv+0x1c)[0x42f842]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7f7b5e344c80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f7b5de606ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f7b5db9641d]
======= Memory map: ========
<snip>

The text was updated successfully, but these errors were encountered:

linas · 2018-08-02T05:30:45Z

That backtrace is a typical "double free or malloc pool corruption" bug. It happens because the dependencies between the various different shared libs in the atomspace (and/or in opencog) are messed up. what's happening is that various shlib dtors are run twice (instead of once) because ... something somewhere gets confused. For example, deleting the cogserver might sometimes cause this. Deleting the attentionbank sometimes causes this.

The shared library dtors that run twice are usually the ones that are creating the atom types .. something causes the atom types to be deleted twice. You can see this directly, by putting a printf in the dtor, and watch it get printed twice. or a gdb breakpoint. you can view both stack traces, and say to yourself that they both look perfectly valid, perfectly legit ways to get to the dtor, and so you have to poke around to see why ... and its usually something goofy like deleteing cogserver causes atomspace to be deleted causes attentionbank to be deleted causes the atomspace to be deleted a second time ... stuff like that. Its done by the dtors, because there are global singleton shared pointers that hold these objects, which means they cannot get deleted until the shared library is unloaded .. but by the time the shared library is unloaded, the methods on the C++ objects are already gone (because some other needed shared lib has been unloaded...) .. circular dependency hell.

Its kind of tedious to track these down, and the (ugly) work-around is to just not delete things, and let glibc do the cleanup itself on process exit. In short, problems with the shared lib plumbing. Best handled by someone with a lot of spare time on their hands, and lots of linker/loader experience (or who wants to learn how the linker/loader works, which is actually a rather large hairy OS subsystem -- so not for the weak of heart).

linas · 2018-08-02T05:37:21Z

Why it's intermittent, I'm not sure. its possible (for example) that the cogserver gets one more command over the network, and so it thinks it has one more thing to do, but the atomspace is already gone. I thought I'd fixed all of these, by making sure that all network connections are closed, all command buffers flushed, and all threads are done and exited, before allowing the cogserver to shut down. But I dunno. Maybe I missed a spot. Again -- its really hairy code.

linas · 2018-08-02T05:39:56Z

Things like reworking the agent-runner/system activity table could trigger this, e.g. because you are holding a pointer to a system activity table when the singleton has already been deleted. Stuff like that. Did you ever see this bug, before reworking the system activity table code?

ngeiswei · 2018-10-03T07:56:28Z

I have so far observed 3 types of failures

From testMultiStream

terminate called after throwing an instance of 'concurrent_queue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>::Canceled'
  what():  std::exception

From testDrain

ShellUTest.cxxtest:78: Error: Expected (281 <= reso.size()), found (281 > 280)

From testControlC

ShellUTest.cxxtest:265: Error: Expected (end - start < 2), found (42 >= 2)

I've only observed that last one from CircleCI, never locally, but it's the most frequent cause of failure there unfortunately.

linas · 2018-10-03T14:03:00Z

Error 1 happens if some thread attempts to use the queue, after it has been canceled. The queue is canceled when the network connection is being closed. Guess: maybe there is some race, with more traffic coming across the network, even though its being shut down.

Error 2: given the sizes, it would seem that the final carriage-return was not delivered. Why? Dunno. Another race.

Error 3: The control-C message should be two bytes long. Why is it 42 bytes long? Dunno .. that's a lot of bytes that should not be there. Maybe they are an error message? Bytes that should have gone to someone else? Printing them out would provide a clue.

linas · 2018-10-03T14:06:23Z

You might think that someone out there would have written a generic shell server, that would be fast, bullet-proof, support login passwords and ssh, etc. Then we could replace the cogserver by that. But I was never able to find such a thing.

ngeiswei · 2018-10-08T07:48:36Z

Not sure it's related but I noticed that ServerSocket::SetCloseAndDelete() doesn't always have time to complete because the CogServer shuts down (leading the unit test to halt) before that.

ngeiswei · 2018-10-08T08:59:09Z

OK, I understand why ServerSocket::SetCloseAndDelete() doesn't always complete, it is detached from the NetworkServer at its creation, see NetworkServer::listen()

        std::thread(&ConsoleSocket::handle_connection, ss).detach();

then when the connection stops, the while loop in ServerSocket::handle_connection() breaks, and it eventually calls its own destruction

    delete this;

So if the program ends before that line is completed it never gets fully closed.

I have no idea however if that could be a problem, and no idea whether it is related to that issue.

I've added a sleep(1) in ShellUTest::tearDown() and obviously now ServerSocket::SetCloseAndDelete() always completes, however I don't know if that is doing anything to fix the bug (ran 27 times without crash, but it could be just luck since it occurs rarely).

linas · 2018-10-08T19:03:01Z

Thanks @ngeiswei !! That's a very plausible explanation. The code was carefully crafted to try to avoid exactly this kind of race, but it is sufficiently complicated that perhaps a spot was missed. There locks and flags and semaphores that get set and double-checked, to prevent the out-of-order shutdown.

The only way I know of debugging this is decidedly old-school: put prints or loggers or traces at all the places that should be passed through, during orderly shutdown, and then decode what went out-of-order when it doesn't.

linas · 2018-10-08T19:11:04Z

The cogserver destructor is supposed to wait for the network server destructor to finish. The network server destructor is supposed to wait for all threads to join. The threads are not supposed to join, until they are all done. So, in principle, everything should have waited for ServerSocket to finish. ...

linas · 2019-12-01T18:07:29Z

Testing, I've got zero failures on 30 repeats over 90 minutes of testing. This is with boost-1.62

linas · 2019-12-01T18:09:48Z

I'm on Debian stable, gcc --version gcc (Debian 8.3.0-6) 8.3.0 and ldd --version ldd (Debian GLIBC 2.28-10) 2.28 and bost version 1.62

I'm wondering if it's a glibc bug of a boost:asio bug.

linas · 2019-12-01T20:42:45Z

5 failures out of 22 repeats with boost-1.67 -- so I guess its a boost bug. will double-check for boost-1.62 ... crap. Debian no longer has boost 1.62 ... Can't test.

linas · 2019-12-01T20:49:40Z

the boost-1.67 crashes include at least 4 different kinds: corrupted size vs. prev_size and Segmentation fault and double free detected in tcache and corrupted size vs. prev_size

Seems that Debian ONLY has boost-1.67 and without some easy way of getting back to a working version of boost, I'm not sure what to do ...

linas · 2019-12-01T20:54:03Z

Presumably, its a boost::asio bug .. boost::asio has been a headache for 5+ years ... the easiest solution would be to re-write the code to just not use it ... which maybe isn't hard ...?

linas · 2019-12-02T02:41:01Z

OK, as a boost asio replacement, there is just plain-old asio: in Debian, libasio-dev otherwise https://think-async.com/Asio/ .. but it seems like boost asio is just a copy/fork of this. Given how buggy and unstable asio has been over the 5+ years, this feels like the wrong direction to go in...

linas · 2019-12-02T05:01:17Z

If we instead assume that asio is not buggy, and that, instead, it is throwing some new kind of error that we are not yet handling, then network/ServerSocket.cc is the place to look. It already handles nine different kinds of asio errors, maybe there's a new tenth kind of error ...

Disable, until opencog/cogserver#5 is fixed. The failing ShellUTest makes circleci report false negatives.

vsbogd · 2020-06-15T14:31:17Z

Is this this issue still actual?

linas · 2020-06-15T14:37:40Z

Well, boost asio is certainly still as buggy as ever. Debian-stable is still on boost-1.67 .... I'd like to see all uses of boost in opencog removed, but its a long, hard task...

vsbogd · 2020-06-15T14:48:11Z

I mean does this issue repeat? I don't see it for a long time.

linas · 2020-06-15T14:52:40Z

Well, in comment #5 (comment) I put it into a loop and saw 5 crashes out of 22 runs. Nothing has changed since then ... I can try the test again. Here's my shell script:

#! /bin/bash

date
cnt=0
for i in `seq 1 1000000`;
do
    ./tests/shell/ShellUTest   >>/dev/null
    # ./tests/scm/MultiThreadUTest      >> /dev/null
    # ./tests/rule-engine/backwardchainer/BackwardChainerUTest  >> /dev/null
    rc=$?
    if [ $rc != 0 ]; then
        cnt=$((cnt+1))
        date
        # Halt on the first crash, save the logfile
        # exit 1;
    fi
    #if [ $(($i % 10)) == 0 ]; then
       echo duude iter $i result ---------  rc = $rc  ncrashes = $cnt
    #fi
done

I re-ran this, and am getting 8 crashes out of 48 tests, for an average of 1 out of 6 failure rate. This is debian-stable, boost 1.67, gcc 8.3.0 and glibc 2.28

linas · 2020-06-15T14:55:47Z

Other intermittent test failures include the pattern-miner SurprisingnessUTest .. between that and the rule-engine, I have to hit the circle-ci "re-run unit tests" button 2-3-4 times before I can get a run that passes all tests. The intermittent failures are very annoying ...

Merge opencog -> singnet

linas transferred this issue from opencog/opencog Sep 14, 2019

AlucardNosferatu mentioned this issue Nov 17, 2019

test failed #19

Closed

Radivarig mentioned this issue Dec 1, 2019

[cogserver] ShellUTest sometimes fails (corrupted size vs. prev_size) opencog/opencog-nix#40

Closed

linas mentioned this issue Dec 2, 2019

Include attention and cogserver in circleci dependencies. opencog/atomspace#2380

Merged

linas added a commit to linas/atomspace that referenced this issue Dec 10, 2019

Temporarily disable the cogserver unit test

5d998ce

Disable, until opencog/cogserver#5 is fixed. The failing ShellUTest makes circleci report false negatives.

linas mentioned this issue Dec 10, 2019

Temporarily disable the cogserver unit test opencog/atomspace#2432

Merged

vsbogd mentioned this issue Dec 11, 2019

ShellUTest fails because of NumberNode formatting change #22

Closed

linas pushed a commit that referenced this issue Jul 4, 2021

Merge pull request #5 from vsbogd/merge-opencog-to-singnet

992b57b

Merge opencog -> singnet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ShellUTest errors sometimes with Debug build #5

ShellUTest errors sometimes with Debug build #5

ferrouswheel commented Aug 2, 2018

linas commented Aug 2, 2018

linas commented Aug 2, 2018

linas commented Aug 2, 2018 •

edited

Loading

ngeiswei commented Oct 3, 2018

linas commented Oct 3, 2018

linas commented Oct 3, 2018

ngeiswei commented Oct 8, 2018 •

edited

Loading

ngeiswei commented Oct 8, 2018 •

edited

Loading

linas commented Oct 8, 2018

linas commented Oct 8, 2018

linas commented Dec 1, 2019 •

edited

Loading

linas commented Dec 1, 2019 •

edited

Loading

linas commented Dec 1, 2019 •

edited

Loading

linas commented Dec 1, 2019

linas commented Dec 1, 2019

linas commented Dec 2, 2019

linas commented Dec 2, 2019

vsbogd commented Jun 15, 2020

linas commented Jun 15, 2020

vsbogd commented Jun 15, 2020

linas commented Jun 15, 2020 •

edited

Loading

linas commented Jun 15, 2020

ShellUTest errors sometimes with Debug build #5

ShellUTest errors sometimes with Debug build #5

Comments

ferrouswheel commented Aug 2, 2018

linas commented Aug 2, 2018

linas commented Aug 2, 2018

linas commented Aug 2, 2018 • edited Loading

ngeiswei commented Oct 3, 2018

linas commented Oct 3, 2018

linas commented Oct 3, 2018

ngeiswei commented Oct 8, 2018 • edited Loading

ngeiswei commented Oct 8, 2018 • edited Loading

linas commented Oct 8, 2018

linas commented Oct 8, 2018

linas commented Dec 1, 2019 • edited Loading

linas commented Dec 1, 2019 • edited Loading

linas commented Dec 1, 2019 • edited Loading

linas commented Dec 1, 2019

linas commented Dec 1, 2019

linas commented Dec 2, 2019

linas commented Dec 2, 2019

vsbogd commented Jun 15, 2020

linas commented Jun 15, 2020

vsbogd commented Jun 15, 2020

linas commented Jun 15, 2020 • edited Loading

linas commented Jun 15, 2020

linas commented Aug 2, 2018 •

edited

Loading

ngeiswei commented Oct 8, 2018 •

edited

Loading

ngeiswei commented Oct 8, 2018 •

edited

Loading

linas commented Dec 1, 2019 •

edited

Loading

linas commented Dec 1, 2019 •

edited

Loading

linas commented Dec 1, 2019 •

edited

Loading

linas commented Jun 15, 2020 •

edited

Loading