Closed
Description
In the build @manishrjain provided for testing on 2018-05-16 (sorry, don't have a SHA!), with predicate moves roughly every 10 seconds, Dgraph went down: 20180516T194304.000-0500.zip
In particular, it looks like a predicate move caused a single node's Alpha to crash with the following message:
2018/05/16 17:47:21 predicate_move.go:300: Move predicate request for pred: [amount_0], src: [2], dst: [1]
2018/05/16 17:47:21 predicate_move.go:308: Trying to abort pending mutations. Loop: 0
2018/05/16 17:47:21 predicate_move.go:176: Sent 2 keys for predicate amount_0
2018/05/16 17:47:21 predicate_move.go:244: Got ReceivePredicate. Group: 2. Am leader: true
2018/05/16 17:47:21 predicate_move.go:193: Receiving predicate. Batching and proposing key values
2018/05/16 17:47:21 predicate_move.go:213: Predicate being received: type_0
2018/05/16 17:47:21 index.go:934: Dropping predicate: [type_0]
2018/05/16 17:47:21 schema.go:63: Deleting schema for predicate: [type_0]
2018/05/16 17:47:22 predicate_move.go:43: Writing 3 keys
2018/05/16 17:47:22 schema.go:252: Setting schema for attr type_0: string, tokenizer: [exact], directive: INDEX, count: false
2018/05/16 17:47:22 predicate_move.go:278: Proposed 3 keys. Error: <nil>
2018/05/16 17:47:22 predicate_move.go:300: Move predicate request for pred: [type_1], src: [2], dst: [1]
2018/05/16 17:47:22 predicate_move.go:308: Trying to abort pending mutations. Loop: 0
2018/05/16 17:47:22 predicate_move.go:308: Trying to abort pending mutations. Loop: 1
2018/05/16 17:47:22 unexpected meta: 0
github.com/dgraph-io/dgraph/x.Fatalf
/home/mrjn/go/src/github.com/dgraph-io/dgraph/x/error.go:100
github.com/dgraph-io/dgraph/posting.ReadPostingList
/home/mrjn/go/src/github.com/dgraph-io/dgraph/posting/mvcc.go:424
github.com/dgraph-io/dgraph/worker.movePredicateHelper
/home/mrjn/go/src/github.com/dgraph-io/dgraph/worker/predicate_move.go:124
github.com/dgraph-io/dgraph/worker.(*grpcWorker).MovePredicate
/home/mrjn/go/src/github.com/dgraph-io/dgraph/worker/predicate_move.go:321
github.com/dgraph-io/dgraph/protos/intern._Worker_MovePredicate_Handler
/home/mrjn/go/src/github.com/dgraph-io/dgraph/protos/intern/internal.pb.go:2772
google.golang.org/grpc.(*Server).processUnaryRPC
/home/mrjn/go/src/google.golang.org/grpc/server.go:1011
google.golang.org/grpc.(*Server).handleStream
/home/mrjn/go/src/google.golang.org/grpc/server.go:1249
google.golang.org/grpc.(*Server).serveStreams.func1.1
/home/mrjn/go/src/google.golang.org/grpc/server.go:680
runtime.goexit
/usr/lib/go-1.10/src/runtime/asm_amd64.s:2361
Oddly, the cluster was never able to recover from this state, even though five Zero nodes and four Alpha nodes were running and connected. All alpha nodes logged
2018/05/16 17:47:24 oracle.go:425: Error while fetching minTs from group 2, err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused"
2018/05/16 17:47:28 pool.go:158: Echo error from n3:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused
... indefinitely, and Zero nodes logged the same thing:
2018/05/16 17:47:24 oracle.go:425: Error while fetching minTs from group 2, err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused"
2018/05/16 17:47:28 pool.go:158: Echo error from n3:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused"
Every request either timed out (on n3), or returned UNAVAILABLE. Attempts to perform further predicate moves all timed out as well, though Zero would answer /state
queries:
{:counter "1079",
:groups
{1
{:members
{1 {:id "1", :groupId 1, :addr "n1:7080"},
2
{:id "2",
:groupId 1,
:addr "n2:7080",
:leader true,
:lastUpdate "1526517895"},
4 {:id "4", :groupId 1, :addr "n4:7080"}},
:tablets
{:amount_0 {:groupId 1, :predicate "amount_0", :force true},
:key_2 {:groupId 1, :predicate "key_2", :force true},
:type_2 {:groupId 1, :predicate "type_2", :force true}}},
2
{:members
{3
{:id "3",
:groupId 2,
:addr "n3:7080",
:leader true,
:lastUpdate "1526517898"},
5 {:id "5", :groupId 2, :addr "n5:7080"}},
:tablets
{:_predicate_ {:groupId 2, :predicate "_predicate_", :force true},
:amount_1 {:groupId 2, :predicate "amount_1", :force true},
:amount_2 {:groupId 2, :predicate "amount_2", :force true},
:key_0 {:groupId 2, :predicate "key_0", :force true},
:key_1 {:groupId 2, :predicate "key_1", :force true},
:type_0 {:groupId 2, :predicate "type_0", :force true},
:type_1 {:groupId 2, :predicate "type_1", :force true}}}},
:zeros
{1 {:id "1", :addr "n1:5080"},
2 {:id "2", :addr "n2:5080"},
3 {:id "3", :addr "n3:5080", :leader true},
4 {:id "4", :addr "n4:5080"},
5 {:id "5", :addr "n5:5080"}},
:maxLeaseId "20000",
:maxTxnTs "20000"}