Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ShellUTest errors sometimes with Debug build #5

Open
ferrouswheel opened this issue Aug 2, 2018 · 22 comments
Open

ShellUTest errors sometimes with Debug build #5

ferrouswheel opened this issue Aug 2, 2018 · 22 comments

Comments

@ferrouswheel
Copy link
Contributor

I get sporadic failures with ShellUTest, but have only noticed them with a Debug build.

Compiler is standard for Ubuntu 16.04 gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609

Maybe related to opencog/opencog#2542 ? Though a cause was found and fixed in that issue.

The output is:

Start  5: ShellUTest
*** Error in `/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest': corrupted size vs. prev_size: 0x00007f7b38000a60 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f7b5db067e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x7e9dc)[0x7f7b5db0d9dc]
/lib/x86_64-linux-gnu/libc.so.6(+0x81cde)[0x7f7b5db10cde]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f7b5db13184]
/lib/x86_64-linux-gnu/libc.so.6(_IO_file_doallocate+0x55)[0x7f7b5dafc1d5]
/lib/x86_64-linux-gnu/libc.so.6(_IO_doallocbuf+0x34)[0x7f7b5db0a594]
/lib/x86_64-linux-gnu/libc.so.6(_IO_file_underflow+0x1fc)[0x7f7b5db0969c]
/lib/x86_64-linux-gnu/libc.so.6(_IO_default_uflow+0xe)[0x7f7b5db0a60e]
/lib/x86_64-linux-gnu/libc.so.6(_IO_getline_info+0xaa)[0x7f7b5dafdc6a]
/lib/x86_64-linux-gnu/libc.so.6(fgets+0xad)[0x7f7b5dafcb7d]
/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest(_Z8cmd_execB5cxx11PKc+0x100)[0x4200a4]
/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest(_Z11gargleblastii+0x9f)[0x4201e3]
/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest(_ZNSt12_Bind_simpleIFPFviiEiiEE9_M_invokeIILm0ELm1EEEEvSt12_Index_tupleIIXspT_EEE+0x5a)[0x42faea]
/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest(_ZNSt12_Bind_simpleIFPFviiEiiEEclEv+0x1d)[0x42f945]
/home/joel/work/opencog/opencog/build_debug/tests/server/ShellUTest(_ZNSt6thread5_ImplISt12_Bind_simpleIFPFviiEiiEEE6_M_runEv+0x1c)[0x42f842]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7f7b5e344c80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f7b5de606ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f7b5db9641d]
======= Memory map: ========
<snip>
@linas
Copy link
Member

linas commented Aug 2, 2018

That backtrace is a typical "double free or malloc pool corruption" bug. It happens because the dependencies between the various different shared libs in the atomspace (and/or in opencog) are messed up. what's happening is that various shlib dtors are run twice (instead of once) because ... something somewhere gets confused. For example, deleting the cogserver might sometimes cause this. Deleting the attentionbank sometimes causes this.

The shared library dtors that run twice are usually the ones that are creating the atom types .. something causes the atom types to be deleted twice. You can see this directly, by putting a printf in the dtor, and watch it get printed twice. or a gdb breakpoint. you can view both stack traces, and say to yourself that they both look perfectly valid, perfectly legit ways to get to the dtor, and so you have to poke around to see why ... and its usually something goofy like deleteing cogserver causes atomspace to be deleted causes attentionbank to be deleted causes the atomspace to be deleted a second time ... stuff like that. Its done by the dtors, because there are global singleton shared pointers that hold these objects, which means they cannot get deleted until the shared library is unloaded .. but by the time the shared library is unloaded, the methods on the C++ objects are already gone (because some other needed shared lib has been unloaded...) .. circular dependency hell.

Its kind of tedious to track these down, and the (ugly) work-around is to just not delete things, and let glibc do the cleanup itself on process exit. In short, problems with the shared lib plumbing. Best handled by someone with a lot of spare time on their hands, and lots of linker/loader experience (or who wants to learn how the linker/loader works, which is actually a rather large hairy OS subsystem -- so not for the weak of heart).

@linas
Copy link
Member

linas commented Aug 2, 2018

Why it's intermittent, I'm not sure. its possible (for example) that the cogserver gets one more command over the network, and so it thinks it has one more thing to do, but the atomspace is already gone. I thought I'd fixed all of these, by making sure that all network connections are closed, all command buffers flushed, and all threads are done and exited, before allowing the cogserver to shut down. But I dunno. Maybe I missed a spot. Again -- its really hairy code.

@linas
Copy link
Member

linas commented Aug 2, 2018

Things like reworking the agent-runner/system activity table could trigger this, e.g. because you are holding a pointer to a system activity table when the singleton has already been deleted. Stuff like that. Did you ever see this bug, before reworking the system activity table code?

@ngeiswei
Copy link
Member

ngeiswei commented Oct 3, 2018

I have so far observed 3 types of failures

  1. From testMultiStream
terminate called after throwing an instance of 'concurrent_queue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>::Canceled'
  what():  std::exception
  1. From testDrain
ShellUTest.cxxtest:78: Error: Expected (281 <= reso.size()), found (281 > 280)
  1. From testControlC
ShellUTest.cxxtest:265: Error: Expected (end - start < 2), found (42 >= 2)

I've only observed that last one from CircleCI, never locally, but it's the most frequent cause of failure there unfortunately.

@linas
Copy link
Member

linas commented Oct 3, 2018

Error 1 happens if some thread attempts to use the queue, after it has been canceled. The queue is canceled when the network connection is being closed. Guess: maybe there is some race, with more traffic coming across the network, even though its being shut down.

Error 2: given the sizes, it would seem that the final carriage-return was not delivered. Why? Dunno. Another race.

Error 3: The control-C message should be two bytes long. Why is it 42 bytes long? Dunno .. that's a lot of bytes that should not be there. Maybe they are an error message? Bytes that should have gone to someone else? Printing them out would provide a clue.

@linas
Copy link
Member

linas commented Oct 3, 2018

You might think that someone out there would have written a generic shell server, that would be fast, bullet-proof, support login passwords and ssh, etc. Then we could replace the cogserver by that. But I was never able to find such a thing.

@ngeiswei
Copy link
Member

ngeiswei commented Oct 8, 2018

Not sure it's related but I noticed that ServerSocket::SetCloseAndDelete() doesn't always have time to complete because the CogServer shuts down (leading the unit test to halt) before that.

@ngeiswei
Copy link
Member

ngeiswei commented Oct 8, 2018

OK, I understand why ServerSocket::SetCloseAndDelete() doesn't always complete, it is detached from the NetworkServer at its creation, see NetworkServer::listen()

        std::thread(&ConsoleSocket::handle_connection, ss).detach();

then when the connection stops, the while loop in ServerSocket::handle_connection() breaks, and it eventually calls its own destruction

    delete this;

So if the program ends before that line is completed it never gets fully closed.

I have no idea however if that could be a problem, and no idea whether it is related to that issue.

I've added a sleep(1) in ShellUTest::tearDown() and obviously now ServerSocket::SetCloseAndDelete() always completes, however I don't know if that is doing anything to fix the bug (ran 27 times without crash, but it could be just luck since it occurs rarely).

@linas
Copy link
Member

linas commented Oct 8, 2018

Thanks @ngeiswei !! That's a very plausible explanation. The code was carefully crafted to try to avoid exactly this kind of race, but it is sufficiently complicated that perhaps a spot was missed. There locks and flags and semaphores that get set and double-checked, to prevent the out-of-order shutdown.

The only way I know of debugging this is decidedly old-school: put prints or loggers or traces at all the places that should be passed through, during orderly shutdown, and then decode what went out-of-order when it doesn't.

@linas
Copy link
Member

linas commented Oct 8, 2018

The cogserver destructor is supposed to wait for the network server destructor to finish. The network server destructor is supposed to wait for all threads to join. The threads are not supposed to join, until they are all done. So, in principle, everything should have waited for ServerSocket to finish. ...

@linas
Copy link
Member

linas commented Dec 1, 2019

Testing, I've got zero failures on 30 repeats over 90 minutes of testing. This is with boost-1.62

@linas
Copy link
Member

linas commented Dec 1, 2019

I'm on Debian stable, gcc --version gcc (Debian 8.3.0-6) 8.3.0 and ldd --version ldd (Debian GLIBC 2.28-10) 2.28 and bost version 1.62

I'm wondering if it's a glibc bug of a boost:asio bug.

@linas
Copy link
Member

linas commented Dec 1, 2019

5 failures out of 22 repeats with boost-1.67 -- so I guess its a boost bug. will double-check for boost-1.62 ... crap. Debian no longer has boost 1.62 ... Can't test.

@linas
Copy link
Member

linas commented Dec 1, 2019

the boost-1.67 crashes include at least 4 different kinds: corrupted size vs. prev_size and Segmentation fault and double free detected in tcache and corrupted size vs. prev_size

Seems that Debian ONLY has boost-1.67 and without some easy way of getting back to a working version of boost, I'm not sure what to do ...

@linas
Copy link
Member

linas commented Dec 1, 2019

Presumably, its a boost::asio bug .. boost::asio has been a headache for 5+ years ... the easiest solution would be to re-write the code to just not use it ... which maybe isn't hard ...?

@linas
Copy link
Member

linas commented Dec 2, 2019

OK, as a boost asio replacement, there is just plain-old asio: in Debian, libasio-dev otherwise https://think-async.com/Asio/ .. but it seems like boost asio is just a copy/fork of this. Given how buggy and unstable asio has been over the 5+ years, this feels like the wrong direction to go in...

@linas
Copy link
Member

linas commented Dec 2, 2019

If we instead assume that asio is not buggy, and that, instead, it is throwing some new kind of error that we are not yet handling, then network/ServerSocket.cc is the place to look. It already handles nine different kinds of asio errors, maybe there's a new tenth kind of error ...

linas added a commit to linas/atomspace that referenced this issue Dec 10, 2019
Disable, until opencog/cogserver#5
is fixed. The failing ShellUTest makes circleci report false
negatives.
@vsbogd
Copy link
Collaborator

vsbogd commented Jun 15, 2020

Is this this issue still actual?

@linas
Copy link
Member

linas commented Jun 15, 2020

Well, boost asio is certainly still as buggy as ever. Debian-stable is still on boost-1.67 .... I'd like to see all uses of boost in opencog removed, but its a long, hard task...

@vsbogd
Copy link
Collaborator

vsbogd commented Jun 15, 2020

I mean does this issue repeat? I don't see it for a long time.

@linas
Copy link
Member

linas commented Jun 15, 2020

Well, in comment #5 (comment) I put it into a loop and saw 5 crashes out of 22 runs. Nothing has changed since then ... I can try the test again. Here's my shell script:

#! /bin/bash

date
cnt=0
for i in `seq 1 1000000`;
do
    ./tests/shell/ShellUTest   >>/dev/null
    # ./tests/scm/MultiThreadUTest      >> /dev/null
    # ./tests/rule-engine/backwardchainer/BackwardChainerUTest  >> /dev/null
    rc=$?
    if [ $rc != 0 ]; then
        cnt=$((cnt+1))
        date
        # Halt on the first crash, save the logfile
        # exit 1;
    fi
    #if [ $(($i % 10)) == 0 ]; then
       echo duude iter $i result ---------  rc = $rc  ncrashes = $cnt
    #fi
done    

I re-ran this, and am getting 8 crashes out of 48 tests, for an average of 1 out of 6 failure rate. This is debian-stable, boost 1.67, gcc 8.3.0 and glibc 2.28

@linas
Copy link
Member

linas commented Jun 15, 2020

Other intermittent test failures include the pattern-miner SurprisingnessUTest .. between that and the rule-engine, I have to hit the circle-ci "re-run unit tests" button 2-3-4 times before I can get a run that passes all tests. The intermittent failures are very annoying ...

linas pushed a commit that referenced this issue Jul 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants