Skip to content

Predicate moves can take down Dgraph indefinitely #2397

Closed
@aphyr

Description

@aphyr

In the build @manishrjain provided for testing on 2018-05-16 (sorry, don't have a SHA!), with predicate moves roughly every 10 seconds, Dgraph went down: 20180516T194304.000-0500.zip

latency-raw

In particular, it looks like a predicate move caused a single node's Alpha to crash with the following message:

2018/05/16 17:47:21 predicate_move.go:300: Move predicate request for pred: [amount_0], src: [2], dst: [1]
2018/05/16 17:47:21 predicate_move.go:308: Trying to abort pending mutations. Loop: 0
2018/05/16 17:47:21 predicate_move.go:176: Sent 2 keys for predicate amount_0
2018/05/16 17:47:21 predicate_move.go:244: Got ReceivePredicate. Group: 2. Am leader: true
2018/05/16 17:47:21 predicate_move.go:193: Receiving predicate. Batching and proposing key values
2018/05/16 17:47:21 predicate_move.go:213: Predicate being received: type_0
2018/05/16 17:47:21 index.go:934: Dropping predicate: [type_0]
2018/05/16 17:47:21 schema.go:63: Deleting schema for predicate: [type_0]
2018/05/16 17:47:22 predicate_move.go:43: Writing 3 keys
2018/05/16 17:47:22 schema.go:252: Setting schema for attr type_0: string, tokenizer: [exact], directive: INDEX, count: false
2018/05/16 17:47:22 predicate_move.go:278: Proposed 3 keys. Error: <nil>
2018/05/16 17:47:22 predicate_move.go:300: Move predicate request for pred: [type_1], src: [2], dst: [1]
2018/05/16 17:47:22 predicate_move.go:308: Trying to abort pending mutations. Loop: 0
2018/05/16 17:47:22 predicate_move.go:308: Trying to abort pending mutations. Loop: 1
2018/05/16 17:47:22 unexpected meta: 0
github.com/dgraph-io/dgraph/x.Fatalf
	/home/mrjn/go/src/github.com/dgraph-io/dgraph/x/error.go:100
github.com/dgraph-io/dgraph/posting.ReadPostingList
	/home/mrjn/go/src/github.com/dgraph-io/dgraph/posting/mvcc.go:424
github.com/dgraph-io/dgraph/worker.movePredicateHelper
	/home/mrjn/go/src/github.com/dgraph-io/dgraph/worker/predicate_move.go:124
github.com/dgraph-io/dgraph/worker.(*grpcWorker).MovePredicate
	/home/mrjn/go/src/github.com/dgraph-io/dgraph/worker/predicate_move.go:321
github.com/dgraph-io/dgraph/protos/intern._Worker_MovePredicate_Handler
	/home/mrjn/go/src/github.com/dgraph-io/dgraph/protos/intern/internal.pb.go:2772
google.golang.org/grpc.(*Server).processUnaryRPC
	/home/mrjn/go/src/google.golang.org/grpc/server.go:1011
google.golang.org/grpc.(*Server).handleStream
	/home/mrjn/go/src/google.golang.org/grpc/server.go:1249
google.golang.org/grpc.(*Server).serveStreams.func1.1
	/home/mrjn/go/src/google.golang.org/grpc/server.go:680
runtime.goexit
	/usr/lib/go-1.10/src/runtime/asm_amd64.s:2361

Oddly, the cluster was never able to recover from this state, even though five Zero nodes and four Alpha nodes were running and connected. All alpha nodes logged

2018/05/16 17:47:24 oracle.go:425: Error while fetching minTs from group 2, err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused"
2018/05/16 17:47:28 pool.go:158: Echo error from n3:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused

... indefinitely, and Zero nodes logged the same thing:

2018/05/16 17:47:24 oracle.go:425: Error while fetching minTs from group 2, err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused"
2018/05/16 17:47:28 pool.go:158: Echo error from n3:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.122.13:7080: connect: connection refused"

Every request either timed out (on n3), or returned UNAVAILABLE. Attempts to perform further predicate moves all timed out as well, though Zero would answer /state queries:

{:counter "1079",
 :groups
 {1
  {:members
   {1 {:id "1", :groupId 1, :addr "n1:7080"},
    2
    {:id "2",
     :groupId 1,
     :addr "n2:7080",
     :leader true,
     :lastUpdate "1526517895"},
    4 {:id "4", :groupId 1, :addr "n4:7080"}},
   :tablets
   {:amount_0 {:groupId 1, :predicate "amount_0", :force true},
    :key_2 {:groupId 1, :predicate "key_2", :force true},
    :type_2 {:groupId 1, :predicate "type_2", :force true}}},
  2
  {:members
   {3
    {:id "3",
     :groupId 2,
     :addr "n3:7080",
     :leader true,
     :lastUpdate "1526517898"},
    5 {:id "5", :groupId 2, :addr "n5:7080"}},
   :tablets
   {:_predicate_ {:groupId 2, :predicate "_predicate_", :force true},
    :amount_1 {:groupId 2, :predicate "amount_1", :force true},
    :amount_2 {:groupId 2, :predicate "amount_2", :force true},
    :key_0 {:groupId 2, :predicate "key_0", :force true},
    :key_1 {:groupId 2, :predicate "key_1", :force true},
    :type_0 {:groupId 2, :predicate "type_0", :force true},
    :type_1 {:groupId 2, :predicate "type_1", :force true}}}},
 :zeros
 {1 {:id "1", :addr "n1:5080"},
  2 {:id "2", :addr "n2:5080"},
  3 {:id "3", :addr "n3:5080", :leader true},
  4 {:id "4", :addr "n4:5080"},
  5 {:id "5", :addr "n5:5080"}},
 :maxLeaseId "20000",
 :maxTxnTs "20000"}

Metadata

Metadata

Assignees

Labels

kind/bugSomething is broken.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions