Skip to content

All reads time out indefinitely, possibly due to stalled predicate migration? #2405

Closed
@aphyr

Description

@aphyr

On the build @manishrjain submitted for testing on Friday, May 18, 2018, single-record set tests can stall indefinitely during reading. Although the cluster is stable, all nodes are running, and the network is totally connected, and all test-initiated predicate migrations appear to have completed, every read request will time out. This condition can last for at least an hour.

For instance, see 20180522T125649.000-0500.zip, where in the middle of the final reads, transactions just... start timing out.

2018-05-22 13:04:41,203{GMT}	INFO	[jepsen worker 7] jepsen.dgraph.set: Forcing conflict by inserting 7517
2018-05-22 13:04:41,297{GMT}	INFO	[jepsen worker 6] jepsen.dgraph.set: Forcing conflict by inserting 5551
2018-05-22 13:04:41,297{GMT}	INFO	[jepsen worker 5] jepsen.dgraph.set: Forcing conflict by inserting 6854
2018-05-22 13:04:41,297{GMT}	INFO	[jepsen worker 0] jepsen.dgraph.set: Forcing conflict by inserting 7368
2018-05-22 13:04:46,198{GMT}	INFO	[jepsen worker 4] jepsen.util: 4	:info	:read	nil	:timeout
2018-05-22 13:04:46,202{GMT}	INFO	[jepsen worker 9] jepsen.util: 9	:info	:read	nil	:timeout
2018-05-22 13:04:46,230{GMT}	INFO	[jepsen worker 7] jepsen.util: 7	:info	:read	nil	:timeout
2018-05-22 13:04:46,234{GMT}	INFO	[jepsen worker 2] jepsen.util: 2	:info	:read	nil	:timeout
2018-05-22 13:04:46,271{GMT}	INFO	[jepsen worker 1] jepsen.util: 1	:info	:read	nil	:timeout

Although each process goes on to retry the read process, no subsequent query ever returns:

8-05-22 13:52:27,172{GMT}	INFO	[jepsen worker 9] jepsen.util: 839	:invoke	:read	nil
2018-05-22 13:52:28,607{GMT}	INFO	[jepsen worker 8] jepsen.util: 798	:info	:read	nil	:timeout
2018-05-22 13:52:32,179{GMT}	INFO	[jepsen worker 9] jepsen.util: 839	:info	:read	nil	:timeout
2018-05-22 13:52:38,642{GMT}	INFO	[jepsen worker 2] jepsen.util: 892	:invoke	:read	nil
2018-05-22 13:52:40,683{GMT}	INFO	[jepsen worker 3] jepsen.util: 793	:invoke	:read	nil
2018-05-22 13:52:43,650{GMT}	INFO	[jepsen worker 2] jepsen.util: 892	:info	:read	nil	:timeout
2018-05-22 13:52:45,689{GMT}	INFO	[jepsen worker 3] jepsen.util: 793	:info	:read	nil	:timeout

About ten seconds after operations start timing out, Zero on n1 logs:

2018/05/22 11:04:57 tablet.go:179: 

Groups sorted by size: [{gid:1 size:17} {gid:2 size:2235}]

2018/05/22 11:04:57 tablet.go:184: size_diff 2218
2018/05/22 11:04:57 tablet.go:73: Going to move predicate: [value], size: [1.1 kB] from group 2 to 1
2018/05/22 11:24:57 tablet.go:221: Got error during move: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/05/22 11:24:57 tablet.go:60: Error while trying to move predicate value from 2 to 1: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded

2018/05/22 11:24:57 tablet.go:179: 

Groups sorted by size: [{gid:1 size:17} {gid:2 size:2235}]

2018/05/22 11:24:57 tablet.go:184: size_diff 2218
2018/05/22 11:24:57 tablet.go:73: Going to move predicate: [value], size: [1.1 kB] from group 2 to 1
2018/05/22 11:44:57 tablet.go:221: Got error during move: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/05/22 11:44:57 tablet.go:60: Error while trying to move predicate value from 2 to 1: While calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded

2018/05/22 11:44:57 tablet.go:179: 

Groups sorted by size: [{gid:1 size:17} {gid:2 size:2235}]

2018/05/22 11:44:57 tablet.go:184: size_diff 2218
2018/05/22 11:44:57 tablet.go:73: Going to move predicate: [value], size: [1.1 kB] from group 2 to 1

And n4 logs corresponding predicate moves:

2018/05/22 11:04:57 predicate_move.go:301: Move predicate request for pred: [value], src: [2], dst: [1]
2018/05/22 11:04:57 predicate_move.go:311: Trying to abort pending mutations. Loop: 0
2018/05/22 11:05:11 draft.go:825: Writing snapshot at index: 13819, applied mark: 26530
2018/05/22 11:24:57 predicate_move.go:301: Move predicate request for pred: [value], src: [2], dst: [1]
2018/05/22 11:24:57 predicate_move.go:311: Trying to abort pending mutations. Loop: 0
2018/05/22 11:44:57 predicate_move.go:301: Move predicate request for pred: [value], src: [2], dst: [1]
2018/05/22 11:44:57 predicate_move.go:311: Trying to abort pending mutations. Loop: 0

No other node logs anything after 11:04. Is it possible that an automatic predicate migration started, then got stuck somehow? The 20-minute intervals between migrations suggest that it's retrying the migration, at least, but timing out every request looks... odd.

You can reproduce this with Jepsen b0b458d32e43c072f257b75ea786431ea0d0c7a5 by running:

lein run test --local-binary ~/tmp/dgraph --force-download --nemesis move-tablet --sequencing client --upsert-schema --time-limit 300 --concurrency 2n --nemesis-interval 5 --workload uid-set --retry-db-setup --test-count 10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions