Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server can't join cluster #300

Closed
elcct opened this issue Mar 5, 2014 · 3 comments
Closed

Server can't join cluster #300

elcct opened this issue Mar 5, 2014 · 3 comments
Milestone

Comments

@elcct
Copy link
Contributor

elcct commented Mar 5, 2014

Not sure if I did this right, but wanted to try a scenario when one of the servers in 2 node cluster dies.

  1. Stopped server2
  2. Deleted /data and /tmp
  3. Started server2

server2 crashes immediately with:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x10 pc=0x594e51]

goroutine 494 [running]:
runtime.panic(0x872e60, 0xfde6a8)
    /home/vagrant/bin/go/src/pkg/runtime/panic.c:266 +0xb6
engine.(*QueryEngine).runAggregates(0xc2106a5280)
    /home/vagrant/influxdb/src/engine/engine.go:463 +0x251
engine.(*QueryEngine).Close(0xc2106a5280)
    /home/vagrant/influxdb/src/engine/engine.go:220 +0x1a1
cluster.(*ShardData).Query(0xc21015e000, 0xc21010fbd0, 0xc210b01000, 0x2, 0x2)
    /home/vagrant/influxdb/src/cluster/shard.go:213 +0x27a
created by coordinator.(*CoordinatorImpl).runQuerySpec
    /home/vagrant/influxdb/src/coordinator/coordinator.go:257 +0x34d

http://pastebin.com/4hrdM0JS

Every time i start server2 (db01 in this case) it gets added to the list of servers in cluster:

cluster

When i start server2 in the log of server1 I see:

[2014/03/05 16:12:25 UTC] [EROR] (coordinator.(*ProtobufServer).handleConnection:83) Error reading from connection (10.0.0.6:46192): EOF
[2014/03/05 16:12:26 UTC] [EROR] (coordinator.(*ProtobufClient).reconnect:199) failed to connect to db01:8099

Once I stopped server1 and server2, deleted /data and /tmp on both servers and started them again, both servers connected and worked fine.

I was able to reproduce it second time.

@pauldix pauldix added this to the 0.5.0 milestone Mar 5, 2014
@pauldix
Copy link
Member

pauldix commented Mar 7, 2014

I think this is because when you brought down server 2, you deleted the data (and Raft log). So when you brought it back up, server 1 started sending data that server 2 missed while it was down. But server 2 didn't have any config or know any of this stuff.

The right way would be to down server 2, kill all the data, then remove the server from the cluster in the admin UI (which you can't currently do). Then when you bring it up it should join the cluster successfully.

If you kill all the data on a server, it won't just automatically get everything copied over. Issue #67 will add functionality for you to do this.

@elcct
Copy link
Contributor Author

elcct commented Mar 7, 2014

Thank you. That makes sense.

@pauldix
Copy link
Member

pauldix commented Mar 10, 2014

Just opened #322 to remove server from cluster. That paired with #67 should handle everything here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants