You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On dgraph 5b93fb4 (v1.0.5-dev, 2018-05-03 06:17:34 -0700), roughly one test in 20 winds up with a node stuck in the cluster join process for over a minute, refusing to serve requests. In 20180507T151028.000-0500.zip, alpha on n2 gets stuck calling JoinCluster:
...
2018/05/07 13:10:53 draft.go:180: Node ID: 2 with GroupID: 1
2018/05/07 13:10:53 node.go:240: Group 1 found 0 entries
2018/05/07 13:10:53 draft.go:930: Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
2018/05/07 13:10:54 pool.go:108: == CONNECT ==> Setting n3:5080
2018/05/07 13:10:54 draft.go:930: Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
2018/05/07 13:10:55 draft.go:895: Calling IsPeer
2018/05/07 13:10:55 draft.go:900: Done with IsPeer call
2018/05/07 13:10:55 draft.go:947: New Node for group: 1
2018/05/07 13:10:55 draft.go:952: Retrieving snapshot.
2018/05/07 13:10:55 draft.go:955: Trying to join peers.
2018/05/07 13:10:55 draft.go:878: Calling JoinCluster
... where other nodes (e.g. n4) concurrently make it through JoinCluster, or don't seem to call JoinCluster at all. I haven't seen this cluster recover yet, but my automation gives up after a little over a minute, so this might just be a slow (60s?) timeout or something.
The text was updated successfully, but these errors were encountered:
aphyr
changed the title
Another deadlock in cluster join
Another deadlock in cluster join?
May 7, 2018
On dgraph 5b93fb4 (v1.0.5-dev, 2018-05-03 06:17:34 -0700), roughly one test in 20 winds up with a node stuck in the cluster join process for over a minute, refusing to serve requests. In 20180507T151028.000-0500.zip, alpha on n2 gets stuck calling JoinCluster:
... where other nodes (e.g. n4) concurrently make it through JoinCluster, or don't seem to call JoinCluster at all. I haven't seen this cluster recover yet, but my automation gives up after a little over a minute, so this might just be a slow (60s?) timeout or something.
The text was updated successfully, but these errors were encountered: