Server can't join cluster #300

elcct · 2014-03-05T16:15:56Z

Not sure if I did this right, but wanted to try a scenario when one of the servers in 2 node cluster dies.

Stopped server2
Deleted /data and /tmp
Started server2

server2 crashes immediately with:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x10 pc=0x594e51]

goroutine 494 [running]:
runtime.panic(0x872e60, 0xfde6a8)
    /home/vagrant/bin/go/src/pkg/runtime/panic.c:266 +0xb6
engine.(*QueryEngine).runAggregates(0xc2106a5280)
    /home/vagrant/influxdb/src/engine/engine.go:463 +0x251
engine.(*QueryEngine).Close(0xc2106a5280)
    /home/vagrant/influxdb/src/engine/engine.go:220 +0x1a1
cluster.(*ShardData).Query(0xc21015e000, 0xc21010fbd0, 0xc210b01000, 0x2, 0x2)
    /home/vagrant/influxdb/src/cluster/shard.go:213 +0x27a
created by coordinator.(*CoordinatorImpl).runQuerySpec
    /home/vagrant/influxdb/src/coordinator/coordinator.go:257 +0x34d

http://pastebin.com/4hrdM0JS

Every time i start server2 (db01 in this case) it gets added to the list of servers in cluster:

When i start server2 in the log of server1 I see:

[2014/03/05 16:12:25 UTC] [EROR] (coordinator.(*ProtobufServer).handleConnection:83) Error reading from connection (10.0.0.6:46192): EOF
[2014/03/05 16:12:26 UTC] [EROR] (coordinator.(*ProtobufClient).reconnect:199) failed to connect to db01:8099

Once I stopped server1 and server2, deleted /data and /tmp on both servers and started them again, both servers connected and worked fine.

I was able to reproduce it second time.

The text was updated successfully, but these errors were encountered:

pauldix · 2014-03-07T15:42:48Z

I think this is because when you brought down server 2, you deleted the data (and Raft log). So when you brought it back up, server 1 started sending data that server 2 missed while it was down. But server 2 didn't have any config or know any of this stuff.

The right way would be to down server 2, kill all the data, then remove the server from the cluster in the admin UI (which you can't currently do). Then when you bring it up it should join the cluster successfully.

If you kill all the data on a server, it won't just automatically get everything copied over. Issue #67 will add functionality for you to do this.

elcct · 2014-03-07T18:44:08Z

Thank you. That makes sense.

pauldix · 2014-03-10T17:27:28Z

Just opened #322 to remove server from cluster. That paired with #67 should handle everything here.

pauldix added this to the 0.5.0 milestone Mar 5, 2014

pauldix closed this as completed Mar 10, 2014

macat mentioned this issue May 13, 2014

Second node crashes after joining cluster #536

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server can't join cluster #300

Server can't join cluster #300

elcct commented Mar 5, 2014

pauldix commented Mar 7, 2014

elcct commented Mar 7, 2014

pauldix commented Mar 10, 2014

Server can't join cluster #300

Server can't join cluster #300

Comments

elcct commented Mar 5, 2014

pauldix commented Mar 7, 2014

elcct commented Mar 7, 2014

pauldix commented Mar 10, 2014