Description
On the build @manishrjain submitted for testing on Friday, May 18, 2018, single-record set tests can stall indefinitely during reading. Although the cluster is stable, all nodes are running, and the network is totally connected, and all test-initiated predicate migrations appear to have completed, every read request will time out. This condition can last for at least an hour.
For instance, see 20180522T125649.000-0500.zip, where in the middle of the final reads, transactions just... start timing out.
2018-05-22 13:04:41,203{GMT} INFO [jepsen worker 7] jepsen.dgraph.set: Forcing conflict by inserting 7517
2018-05-22 13:04:41,297{GMT} INFO [jepsen worker 6] jepsen.dgraph.set: Forcing conflict by inserting 5551
2018-05-22 13:04:41,297{GMT} INFO [jepsen worker 5] jepsen.dgraph.set: Forcing conflict by inserting 6854
2018-05-22 13:04:41,297{GMT} INFO [jepsen worker 0] jepsen.dgraph.set: Forcing conflict by inserting 7368
2018-05-22 13:04:46,198{GMT} INFO [jepsen worker 4] jepsen.util: 4 :info :read nil :timeout
2018-05-22 13:04:46,202{GMT} INFO [jepsen worker 9] jepsen.util: 9 :info :read nil :timeout
2018-05-22 13:04:46,230{GMT} INFO [jepsen worker 7] jepsen.util: 7 :info :read nil :timeout
2018-05-22 13:04:46,234{GMT} INFO [jepsen worker 2] jepsen.util: 2 :info :read nil :timeout
2018-05-22 13:04:46,271{GMT} INFO [jepsen worker 1] jepsen.util: 1 :info :read nil :timeout
Although each process goes on to retry the read process, no subsequent query ever returns:
8-05-22 13:52:27,172{GMT} INFO [jepsen worker 9] jepsen.util: 839 :invoke :read nil
2018-05-22 13:52:28,607{GMT} INFO [jepsen worker 8] jepsen.util: 798 :info :read nil :timeout
2018-05-22 13:52:32,179{GMT} INFO [jepsen worker 9] jepsen.util: 839 :info :read nil :timeout
2018-05-22 13:52:38,642{GMT} INFO [jepsen worker 2] jepsen.util: 892 :invoke :read nil
2018-05-22 13:52:40,683{GMT} INFO [jepsen worker 3] jepsen.util: 793 :invoke :read nil
2018-05-22 13:52:43,650{GMT} INFO [jepsen worker 2] jepsen.util: 892 :info :read nil :timeout
2018-05-22 13:52:45,689{GMT} INFO [jepsen worker 3] jepsen.util: 793 :info :read nil :timeout
About ten seconds after operations start timing out, Zero on n1 logs:
2018/05/22 11:04:57 tablet.go:179:
Groups sorted by size: [{gid:1 size:17} {gid:2 size:2235}]
2018/05/22 11:04:57 tablet.go:184: size_diff 2218
2018/05/22 11:04:57 tablet.go:73: Going to move predicate: [value], size: [1.1 kB] from group 2 to 1
2018/05/22 11:24:57 tablet.go:221: Got error during move: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/05/22 11:24:57 tablet.go:60: Error while trying to move predicate value from 2 to 1: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/05/22 11:24:57 tablet.go:179:
Groups sorted by size: [{gid:1 size:17} {gid:2 size:2235}]
2018/05/22 11:24:57 tablet.go:184: size_diff 2218
2018/05/22 11:24:57 tablet.go:73: Going to move predicate: [value], size: [1.1 kB] from group 2 to 1
2018/05/22 11:44:57 tablet.go:221: Got error during move: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/05/22 11:44:57 tablet.go:60: Error while trying to move predicate value from 2 to 1: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/05/22 11:44:57 tablet.go:179:
Groups sorted by size: [{gid:1 size:17} {gid:2 size:2235}]
2018/05/22 11:44:57 tablet.go:184: size_diff 2218
2018/05/22 11:44:57 tablet.go:73: Going to move predicate: [value], size: [1.1 kB] from group 2 to 1
And n4 logs corresponding predicate moves:
2018/05/22 11:04:57 predicate_move.go:301: Move predicate request for pred: [value], src: [2], dst: [1]
2018/05/22 11:04:57 predicate_move.go:311: Trying to abort pending mutations. Loop: 0
2018/05/22 11:05:11 draft.go:825: Writing snapshot at index: 13819, applied mark: 26530
2018/05/22 11:24:57 predicate_move.go:301: Move predicate request for pred: [value], src: [2], dst: [1]
2018/05/22 11:24:57 predicate_move.go:311: Trying to abort pending mutations. Loop: 0
2018/05/22 11:44:57 predicate_move.go:301: Move predicate request for pred: [value], src: [2], dst: [1]
2018/05/22 11:44:57 predicate_move.go:311: Trying to abort pending mutations. Loop: 0
No other node logs anything after 11:04. Is it possible that an automatic predicate migration started, then got stuck somehow? The 20-minute intervals between migrations suggest that it's retrying the migration, at least, but timing out every request looks... odd.
You can reproduce this with Jepsen b0b458d32e43c072f257b75ea786431ea0d0c7a5 by running:
lein run test --local-binary ~/tmp/dgraph --force-download --nemesis move-tablet --sequencing client --upsert-schema --time-limit 300 --concurrency 2n --nemesis-interval 5 --workload uid-set --retry-db-setup --test-count 10