Skip to content

Another deadlock in cluster join? #2376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aphyr opened this issue May 7, 2018 · 3 comments
Closed

Another deadlock in cluster join? #2376

aphyr opened this issue May 7, 2018 · 3 comments
Labels
investigate Requires further investigation

Comments

@aphyr
Copy link

aphyr commented May 7, 2018

On dgraph 5b93fb4 (v1.0.5-dev, 2018-05-03 06:17:34 -0700), roughly one test in 20 winds up with a node stuck in the cluster join process for over a minute, refusing to serve requests. In 20180507T151028.000-0500.zip, alpha on n2 gets stuck calling JoinCluster:

...
2018/05/07 13:10:53 draft.go:180: Node ID: 2 with GroupID: 1
2018/05/07 13:10:53 node.go:240: Group 1 found 0 entries
2018/05/07 13:10:53 draft.go:930: Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
2018/05/07 13:10:54 pool.go:108: == CONNECT ==> Setting n3:5080
2018/05/07 13:10:54 draft.go:930: Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
2018/05/07 13:10:55 draft.go:895: Calling IsPeer
2018/05/07 13:10:55 draft.go:900: Done with IsPeer call
2018/05/07 13:10:55 draft.go:947: New Node for group: 1
2018/05/07 13:10:55 draft.go:952: Retrieving snapshot.
2018/05/07 13:10:55 draft.go:955: Trying to join peers.
2018/05/07 13:10:55 draft.go:878: Calling JoinCluster

... where other nodes (e.g. n4) concurrently make it through JoinCluster, or don't seem to call JoinCluster at all. I haven't seen this cluster recover yet, but my automation gives up after a little over a minute, so this might just be a slow (60s?) timeout or something.

@aphyr aphyr changed the title Another deadlock in cluster join Another deadlock in cluster join? May 7, 2018
@manishrjain manishrjain added the investigate Requires further investigation label Jun 14, 2018
@manishrjain manishrjain self-assigned this Jun 14, 2018
@manishrjain manishrjain removed their assignment Aug 14, 2018
@mkcp
Copy link

mkcp commented Aug 25, 2018

Looks like it's resolved in 1.0.8-rc1! We can close this out

@manishrjain
Copy link
Contributor

Thanks for confirming, @mkcp !

@manishrjain
Copy link
Contributor

If this was not already fixed, the commit 8779066 fixed this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigate Requires further investigation
Development

No branches or pull requests

3 participants