Skip to content

[disaster recovery] Cluster unable to recover after crash during intensive writing operations #5836

@christian-roggia

Description

@christian-roggia

What version of Dgraph are you using?

This issue has been consistently observed in all the following versions:

  • v20.03
  • v20.07.0-beta.Jun15
  • v20.07.0-beta.Jun22

Have you tried reproducing the issue with the latest release?

Yes.

What is the hardware spec (RAM, OS)?

  • Google Kubernetes Engine (e2-highmem-4)
  • Helm chart (3 replicas of alpha and zero)
  • 4 CPUs
  • 32 GB of RAM
  • SSD storage
  • Preemptible

Steps to reproduce the issue (command/config used to run Dgraph).

  1. Launch dgraph on a preemptible node were a sudden restart of the node with no graceful shutdown might happen.
  2. Start heavy writing operations.
  3. Wait for a node to be restarted.

Expected behavior and actual result.

The expected result is that dgraph is able to recover after a sudden restart.
It is also expected to observe partial data loss as dgraph was not shutdown gracefully.

NOTE: Preemptible machines are not intended to host databases but are a very good simulation of a disaster (e.g. outage in a Cloud provider, network failure, the unexpected crash of dgraph).

The actual behavior of dgraph is instead a complete failure of the cluster which is unable to recover and enters a state where it is no longer reachable.

Received first state update from Zero: counter:3111932 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"dgraph-alpha-0.dgraph-alpha-headless.parser.svc.cluster.local:7080" last_update:1593993800 > > members:<key:2 value:<id:2 group_id:1 addr:"dgraph-alpha-1.dgraph-alpha-headless.parser.svc.cluster.local:7080" last_update:1593986968 > > members:<key:3 value:<id:3 group_id:1 addr:"dgraph-alpha-2.dgraph-alpha-headless.parser.svc.cluster.local:7080" last_update:1593951535 > > tablets:<......> > snapshot_ts:5445471 checksum:17334002985932778878 > > zeros:<key:1 value:<id:1 addr:"dgraph-zero-0.dgraph-zero-headless.parser.svc.cluster.local:5080" > > zeros:<key:2 value:<id:2 addr:"dgraph-zero-1.dgraph-zero-headless.parser.svc.cluster.local:5080" leader:true > > zeros:<key:3 value:<id:3 addr:"dgraph-zero-2.dgraph-zero-headless.parser.svc.cluster.local:5080" > > maxLeaseId:5550000 maxTxnTs:6690000 maxRaftId:3 cid:"27dc9254-8fd4-4c73-bf77-659f319d0830" license:<maxNodes:18446744073709551615 expiryTs:1596482528 enabled:true > 
Sending Snapshot Created batch of size: 175 MB in 12.405286971s.
Sending Snapshot Time elapsed: 01m18s, bytes sent: 588 MB, speed: 7.5 MB/sec
Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255449
Sending Snapshot Created batch of size: 255 MB in 15.353450423s.
Sending Snapshot Time elapsed: 01m37s, bytes sent: 843 MB, speed: 8.7 MB/sec
Sending Snapshot Created batch of size: 243 MB in 6.770549492s.
Sending Snapshot Time elapsed: 01m45s, bytes sent: 1.1 GB, speed: 10 MB/sec
1 is starting a new election at term 7
1 became pre-candidate at term 7
1 received MsgPreVoteResp from 1 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 2 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 3 at term 7
1 is starting a new election at term 7
1 became pre-candidate at term 7
1 received MsgPreVoteResp from 1 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 3 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 2 at term 7
1 is starting a new election at term 7
1 became pre-candidate at term 7
1 received MsgPreVoteResp from 1 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 2 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 3 at term 7
Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255448
Sending Snapshot Created batch of size: 110 MB in 43.377727473s.
Sending Snapshot Created batch of size: 101 MB in 46.845357675s.
Sending Snapshot Sent 20942742 keys
Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
Streaming done. Sent 20942742 entries. Waiting for ACK...
Received ACK with done: true
Stream snapshot: OK
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529

What is important to notice here is

1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7

and

Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529

Attempted solution to the issue

We tried the following steps to force the recovery of the database:

  • Scale the cluster back to 0 replicas (gracefully) and scale back up to 3 replicas
  • Restart 2 out of 3 replicas with clean persistent storage (no data inside)
  • Scale the cluster to a single replica

After scaling down the entire cluster to 0 replicas and back to 3 replicas for a clean restart the following logs appear:

I0706 05:37:06.533661      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:38:06.551211      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:39:06.445989      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:40:06.478813      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:41:06.475061      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:42:06.509597      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:43:06.479442      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:44:06.513435      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:45:06.462318      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:46:06.454675      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:47:06.557153      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:48:06.572650      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:49:06.539283      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:50:06.562204      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529

Additional notes

We suspect that the issue is caused by dgraph-alpha which accepts and continues to process new entries even when the cluster is not ready (the majority of replicas are still offline). The error might also be caused by dgraph-zero not being available or by the sudden crash of the entire cluster which can be compared to a disaster.

We just enabled liveness and readiness probes on our deployments. We urge the dgraph team to enable such probes by default as it is not expected to have a Helm chart with probes disabled. We also suspect that having probes disabled might largely contribute to this issue as the cluster resumes writing operation too early and the cluster is not yet ready.

This behavior was also observed when a crash occurred only partially, with the majority of replicas going offline and restarting (2 out of 3 replicas not available).

We strongly invite the dgraph team to simulate disasters through preemptible machines as this is a valuable experiment to verify dgraph's disaster recovery capabilities. If a scenario like the one here described was to happen in a production environment, where the cluster experiences a sudden disruption of the services, and it is not able to recover anymore, it would likely have a tremendous impact on normal operations as well as business operations with a long downtime for the service which relies on dgraph.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/performancePerformance related issues.status/acceptedWe accept to investigate/work on it.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions