Fix cluster-wide restart issue. #2283

benbjohnson · 2015-04-14T19:45:38Z

Overview

This pull request fixes some error checking and adds a delay when starting to allow the brokers to elect a leader when starting in the case where the whole cluster is restarted.

otoolep · 2015-04-14T19:52:48Z

cmd/influxd/run.go

 		if len(joinURLs) > 0 {
 			joinServer(s, cmd.config.ClusterURL(), joinURLs)
 			return s
 		}

 		if err := s.Initialize(cmd.config.ClusterURL()); err != nil {
-			log.Fatalf("server initialization error: %s", err)
+			log.Fatalf("server initialization error(0): %s", err)


You can make better log messages than this. :-) E.g.

server initialization error (server ID 0) ?

I think the 0 is to delineate the two places where Initialize could have returned an error. Here and L606. Don't think it's the server ID but could be wrong.

OK, you could be right. It would still be better to see clearer log messages to help the next guy. E.g. server initialization error during ..... -- I presume the software is in different states depending at the different points.

This whole file needs some rework so I just wanted to differentiate between this error message and another with the same text below (as @jwilder mentioned).

otoolep · 2015-04-14T20:39:25Z

+1 from me. CHANGELOG needs to be updated too.

Issue opened regarding possible re-work of delay: #2285

benbjohnson · 2015-04-14T20:44:37Z

CHANGELOG updated, merging.

Fix cluster-wide restart issue.

otoolep · 2015-04-14T20:51:15Z

Adding 1 final note here for posterity.

This issue is specifically about multi-node restart, because at restart-time the cluster already exists as a multi-node cluster. Therefore leader election requires more than 1 node to vote. This is not the case when a multi-node cluster is being created for the first time because it always starts as a single-node, and leader-election takes place synchronously with the start of the first broker.

Single-node clusters would not suffer from this restart problem either.

Fix cluster-wide restart issue.

c5bdb5a

benbjohnson added the 2 - Working label Apr 14, 2015

otoolep reviewed Apr 14, 2015
View reviewed changes

otoolep mentioned this pull request Apr 14, 2015

Server (data node) should retry joining leader #2285

Closed

CHANGELOG

47be5fe

benbjohnson added a commit that referenced this pull request Apr 14, 2015

Merge pull request #2283 from influxdb/join-fix

03d5ffc

Fix cluster-wide restart issue.

benbjohnson merged commit 03d5ffc into master Apr 14, 2015

benbjohnson removed the 2 - Working label Apr 14, 2015

benbjohnson deleted the join-fix branch April 14, 2015 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cluster-wide restart issue. #2283

Fix cluster-wide restart issue. #2283

benbjohnson commented Apr 14, 2015

otoolep Apr 14, 2015

jwilder Apr 14, 2015

otoolep Apr 14, 2015

benbjohnson Apr 14, 2015

otoolep commented Apr 14, 2015

benbjohnson commented Apr 14, 2015

otoolep commented Apr 14, 2015

Fix cluster-wide restart issue. #2283

Fix cluster-wide restart issue. #2283

Conversation

benbjohnson commented Apr 14, 2015

Overview

otoolep Apr 14, 2015

Choose a reason for hiding this comment

jwilder Apr 14, 2015

Choose a reason for hiding this comment

otoolep Apr 14, 2015

Choose a reason for hiding this comment

benbjohnson Apr 14, 2015

Choose a reason for hiding this comment

otoolep commented Apr 14, 2015

benbjohnson commented Apr 14, 2015

otoolep commented Apr 14, 2015