New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash at LogCabin::RPC::MessageSocket::read (RPC/MessageSocket.cc:292) #66

Closed
nhardt opened this Issue Dec 3, 2014 · 8 comments

Comments

Projects
None yet
2 participants
@nhardt
Copy link
Contributor

nhardt commented Dec 3, 2014

on commit 6077f82 (Set default port for clients to 61023)

Steps to reproduce:

configure 3 node cluster
-- all nodes look like this:
[root@scale-192-168-43-25 ~ ]# cat /etc/hosts
...
192.168.43.25 scale-192-168-43-25 logcabin
192.168.43.27 scale-192-168-43-27 logcabin
192.168.43.26 scale-192-168-43-26 logcabin

LogCabin --bootstrap --id 1
(onAll) LogCabin --id X
Reconfigure '192.168.43.25:61023;192.168.43.26:61023;192.168.43.27:61023'

-- as long as Reconfigure is running, the nodes it is talking to crashes with the following:

(gdb) bt
#0  0x00000031fc432635 in raise () from /lib64/libc.so.6
#1  0x00000031fc433e15 in abort () from /lib64/libc.so.6
#2  0x00000000004a9989 in LogCabin::RPC::MessageSocket::read (this=0x7fffd40026e0, buf=0x7fffd40026f8, maxBytes=12) at build/RPC/MessageSocket.cc:292
#3  0x00000000004a9657 in LogCabin::RPC::MessageSocket::readable (this=0x7fffd40026e0) at build/RPC/MessageSocket.cc:229
#4  0x00000000004a8ecf in LogCabin::RPC::MessageSocket::ReceiveSocket::handleFileEvent (this=0x7fffd40027b8, events=16) at build/RPC/MessageSocket.cc:102
#5  0x00000000004b93ab in LogCabin::Event::Loop::runForever (this=0x7fffffffe2b0) at build/Event/Loop.cc:152
#6  0x0000000000440d96 in LogCabin::Server::Globals::run (this=0x7fffffffe270) at build/Server/Globals.cc:136
#7  0x000000000041b724 in main (argc=5, argv=0x7fffffffe588) at build/Server/Main.cc:128

(errno was 107)

@ongardie ongardie added the bug label Dec 3, 2014

@ongardie

This comment has been minimized.

Copy link
Member

ongardie commented Dec 9, 2014

I don't think I can repro this the same way on the current master branch due to 83733fe. Still, I don't think the underlying problem has been fixed, so it's worth going back to 6077f82 and tracking this one down.

@ongardie ongardie self-assigned this Dec 9, 2014

@ongardie

This comment has been minimized.

Copy link
Member

ongardie commented Dec 9, 2014

Still wasn't able to reproduce. Tried 3 servers on localhost on:

The leader (server 1) would happily reconfigure to something that looked like:

1418092989.730796 Server/RaftConsensus.cc:508 in setConfiguration() NOTICE[1:LeaderDisk]: Activating configuration 4:
prev_configuration {
  servers {
    server_id: 1
    address: "192.168.2.1:61023;192.168.2.3:61023;192.168.2.2:61023"
  }
}

The other servers would get stuck on:

1418092986.080155 Server/RaftConsensus.cc:833 in init() NOTICE[3:evloop]: No configuration, waiting to receive one.

And Reconfigure would exit with status 0:

~/logcabin:issue-66$ ./build/Examples/Reconfigure '192.168.2.1:61023;192.168.2.3:61023;192.168.2.2:61023'
Configuration 1:
- 1: 192.168.2.1:61023

Reconfiguration OK
Configuration 4:
- 1: 192.168.2.1:61023;192.168.2.3:61023;192.168.2.2:61023

Maybe it takes running this on distinct servers to get the crash? Or maybe I'm just doing something differently.

@ongardie

This comment has been minimized.

Copy link
Member

ongardie commented Dec 11, 2014

ok, got this to repro on a cluster with @nhardt . It crashes the non-leader servers with the log looking like:

1418255976.203230 Server/RaftConsensus.cc:833 in init() NOTICE[2:evloop]: No configuration, waiting to receive one.
1418256260.690257 Server/RaftConsensus.cc:1020 in handleAppendEntries() NOTICE[2:RaftService(8)]: All hail leader 1 for term 2
1418256260.692106 Server/RaftConsensus.cc:1180 in handleAppendSnapshotChunk() NOTICE[2:RaftService(8)]: Loading in new snapshot from leader
1418256260.788653 Server/RaftConsensus.cc:2007 in readSnapshot() NOTICE[2:RaftService(8)]: Reading snapshot which covers log entries 1 through 9 (inclusive)
1418256260.907996 Server/RaftConsensus.cc:508 in setConfiguration() NOTICE[2:RaftService(8)]: Activating configuration 9:
prev_configuration {
  servers {
    server_id: 1
    address: "192.168.51.30:61023;192.168.51.31:61023;192.168.51.32:61023"
  }
}
next_configuration {
  servers {
    server_id: 1
    address: "192.168.51.30:61023 192.168.51.31:61023 192.168.51.32:61023"
  }
}

1418256260.908193 Server/StateMachine.cc:191 in applyThreadMain() NOTICE[2:StateMachine]: Loading snapshot through entry 9 into state machine
1418256260.979215 Server/RaftConsensus.cc:2145 in startNewElection() NOTICE[2:startNewElection]: Running for election in term 3
1418256261.050477 RPC/Address.cc:188 in refresh() WARNING[2:Peer(1)]: Unknown error from getaddrinfo("192.168.51.30:61023 192.168.51.31:61023 192.168.51.32", "61023"): Temporary failure in name resolution
1418256261.050639 RPC/MessageSocket.cc:292 in read() ERROR[2:evloop]: Error while reading from socket: Transport endpoint is not connected Exiting...

That was after passing a quoted, space-delimited Reconfigure:

$ Reconfigure 192.168.51.30:61023 192.168.51.31:61023 192.168.51.32:61023

So it seems that the wrong path is taken once RPC::Address fails to resolve the address string.

@ongardie

This comment has been minimized.

Copy link
Member

ongardie commented Dec 11, 2014

Backtrace during the name resolution warning:

#0  LogCabin::RPC::Address::refresh (this=0x7ffff62369a0) at build/RPC/Address.cc:190
#1  0x00000000004a2b52 in LogCabin::RPC::Address::Address (this=0x7ffff62369a0, str=
    "192.168.51.30:61023 192.168.51.31:61023 192.168.51.32:61023", defaultPort=61023) at build/RPC/Address.cc:60
#2  0x000000000041c6cb in LogCabin::Server::RaftConsensusInternal::Peer::getSession (this=0x74b5b0, lockGuard=...)
    at build/Server/RaftConsensus.cc:295
#3  0x000000000041c0fb in LogCabin::Server::RaftConsensusInternal::Peer::callRPC (this=0x74b5b0, 
    opCode=LogCabin::Protocol::Raft::REQUEST_VOTE, request=..., response=..., lockGuard=...) at build/Server/RaftConsensus.cc:249
#4  0x0000000000425f49 in LogCabin::Server::RaftConsensusInternal::RaftConsensus::requestVote (this=0x7467d0, lockGuard=..., peer=...)
    at build/Server/RaftConsensus.cc:2091
#5  0x0000000000423186 in LogCabin::Server::RaftConsensusInternal::RaftConsensus::peerThreadMain (this=0x7467d0, 
    peer=std::shared_ptr (count 6) 0x74b5b0) at build/Server/RaftConsensus.cc:1545
#6  0x000000000043a4e5 in std::_Mem_fn<void (LogCabin::Server::RaftConsensusInternal::RaftConsensus::*)(std::shared_ptr<LogCabin::Server::RaftConsensusInternal::Peer>)>::operator() (this=0x74e898, __object=0x7467d0, __args#0=std::shared_ptr (count 6) 0x74b5b0)
    at /usr/include/c++/4.4.7/tr1_impl/functional:552
#7  0x000000000043a35b in std::_Bind<std::_Mem_fn<void (LogCabin::Server::RaftConsensusInternal::RaftConsensus::*)(std::shared_ptr<LogCabin::Server::RaftConsensusInternal::Peer>)>(LogCabin::Server::RaftConsensusInternal::RaftConsensus*, std::shared_ptr<LogCabin::Server::RaftConsensusInternal::Peer>)>::__call<, 0, 1>(const std::tuple<> &, std::_Index_tuple<0, 1>) (this=0x74e898, __args=Traceback (most recent call last):
  File "/usr/lib64/../share/gdb/python/libstdcxx/v6/printers.py", line 262, in children
    return self._iterator (self.val)
  File "/usr/lib64/../share/gdb/python/libstdcxx/v6/printers.py", line 220, in __init__
    raise "Top of tuple tree does not consist of a single node."
TypeError: exceptions must be old-style classes or derived from BaseException, not str
std::tuple containing) at /usr/include/c++/4.4.7/tr1_impl/functional:1137
#8  0x000000000043a267 in std::_Bind<std::_Mem_fn<void (LogCabin::Server::RaftConsensusInternal::RaftConsensus::*)(std::shared_ptr<LogCabin::Server::RaftConsensusInternal::Peer>)>(LogCabin::Server::RaftConsensusInternal::RaftConsensus*, std::shared_ptr<LogCabin::Server::RaftConsensusInternal::Peer>)>::operator()<>(void) (this=0x74e898) at /usr/include/c++/4.4.7/tr1_impl/functional:1191
#9  0x000000000043a0e4 in _Impl<std::_Bind<std::_Mem_fn<void (LogCabin::Server::RaftConsensusInternal::RaftConsensus::*)(std::shared_ptr<LogCabin::Server::RaftConsensusInternal::Peer>)>(LogCabin::Server::RaftConsensusInternal::RaftConsensus*, std::shared_ptr<LogCabin::Server::RaftConsensusInternal::Peer>)> >::_M_run(void) (this=0x74e880) at /usr/include/c++/4.4.7/thread:114
#10 0x00007ffff6d32470 in ?? () from /usr/lib64/libstdc++.so.6
#11 0x00007ffff7bc79d1 in start_thread () from /lib64/libpthread.so.0
#12 0x00007ffff653686d in clone () from /lib64/libc.so.6
@ongardie

This comment has been minimized.

Copy link
Member

ongardie commented Dec 11, 2014

Ah, and these things are actually happening concurrently:

Peer thread:

Thread 2 (Thread 0x7ffff6237700 (LWP 21570)):
#0  LogCabin::Server::RaftConsensusInternal::Peer::getSession (this=0x74b5b0, lockGuard=...) at build/Server/RaftConsensus.cc:297
#1  0x000000000041c0fb in LogCabin::Server::RaftConsensusInternal::Peer::callRPC (this=0x74b5b0, 
    opCode=LogCabin::Protocol::Raft::REQUEST_VOTE, request=..., response=..., lockGuard=...) at build/Server/RaftConsensus.cc:249
#2  0x0000000000425f49 in LogCabin::Server::RaftConsensusInternal::RaftConsensus::requestVote (this=0x7467d0, lockGuard=..., peer=...)
    at build/Server/RaftConsensus.cc:2091
#3  0x0000000000423186 in LogCabin::Server::RaftConsensusInternal::RaftConsensus::peerThreadMain (this=0x7467d0, 
    peer=std::shared_ptr (count 6) 0x74b5b0) at build/Server/RaftConsensus.cc:1545

Event loop thread:

Thread 1 (Thread 0x7ffff7fec760 (LWP 21567)):
#0  0x00007ffff6480635 in raise () from /lib64/libc.so.6
#1  0x00007ffff6481e15 in abort () from /lib64/libc.so.6
#2  0x00000000004a9989 in LogCabin::RPC::MessageSocket::read (this=0x7fffe4002670, buf=0x7fffe4002688, maxBytes=12)
    at build/RPC/MessageSocket.cc:292
#3  0x00000000004a9657 in LogCabin::RPC::MessageSocket::readable (this=0x7fffe4002670) at build/RPC/MessageSocket.cc:229
#4  0x00000000004a8ecf in LogCabin::RPC::MessageSocket::ReceiveSocket::handleFileEvent (this=0x7fffe4002748, events=16)
    at build/RPC/MessageSocket.cc:102
#5  0x00000000004b93ab in LogCabin::Event::Loop::runForever (this=0x7fffffffe2f0) at build/Event/Loop.cc:152
#6  0x0000000000440d96 in LogCabin::Server::Globals::run (this=0x7fffffffe2b0) at build/Server/Globals.cc:136
#7  0x000000000041b724 in main (argc=5, argv=0x7fffffffe5c8) at build/Server/Main.cc:128
@ongardie

This comment has been minimized.

Copy link
Member

ongardie commented Dec 11, 2014

Seems that the Address will come back and be invalid, but then the ClientSession will go ahead and use the Address, calling connect() with a 0-length sockaddr. That should return -1, EINVAL, and all should be ok.

Strangely, that's the behavior I see when running under strace/valgrind (with infinite retries), but I get the "Transport endpoint not connected" without strace/valgrind.

If I'm not mistaken, connect() is returning 0 under gdb, then creating the ClientMessageSocket and resulting in a crash. Next step is probably to try connect(fd, NULL, 0) in a C program and see what happens.

@ongardie

This comment has been minimized.

Copy link
Member

ongardie commented Dec 11, 2014

Yeah, connect() to a 0-length sockaddr returns 0 when not run through strace on this box:

[root@scale-192-168-51-31 ~ ]# cat connect.c
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/socket.h>

int main() {
    int fd = socket(AF_INET, SOCK_STREAM, 0);
    if (fd < 0) {
        fprintf(stderr,  "Failed to create socket: %s\n", strerror(errno));
        return 1;
    }

    int r = connect(fd, NULL, 0);
    if (r < 0) {
        fprintf(stderr,  "Failed to connect: %s\n", strerror(errno));
        return 1;
    } else {
        fprintf(stderr,  "Connect succeeded?\n");
    }

    return 0;
}
[root@scale-192-168-51-31 ~ ]# gcc -Wall connect.c
[root@scale-192-168-51-31 ~ ]# ./a.out
Connect succeeded?
[root@scale-192-168-51-31 ~ ]# strace ./a.out
execve("./a.out", ["./a.out"], [/* 25 vars */]) = 0
brk(0)                                  = 0x15f6000
[...snip...]
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
connect(3, NULL, 0)                     = -1 EINVAL (Invalid argument)
write(2, "Failed to connect: Invalid argum"..., 36Failed to connect: Invalid argument
) = 36
exit_group(1)                           = ?
[root@scale-192-168-51-31 ~ ]# 

On my laptop, OTOH, I always get invalid argument.

@ongardie

This comment has been minimized.

Copy link
Member

ongardie commented Dec 11, 2014

I can't find any reference to this bug or change around the interwebs. For the record, the server with the problem is running glibc 2.12, kernel 2.6.32 (something like RHEL 6.4 or 6.5). My laptop without the problem is running glibc 2.19, kernel 3.16 (Debian).

I'll add an if statement with a big comment in the code.

@ongardie ongardie closed this in 8011bf4 Dec 13, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment