Skip to content

Fix move tablet Jepsen test failure #4496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jan 9, 2020

Conversation

manishrjain
Copy link
Contributor

@manishrjain manishrjain commented Jan 3, 2020

Jepsen bank tests with "move-tablet" nemesis were failing due to:

  • The destination group was serving txns with start ts < move ts, before it had received the tablet.
  • The source group was serving txns with starts ts > move ts, after it had deleted the tablet.

For transactions which are doing writes, this wasn't an issue because commits require consultation with Zero. During commit, Zero ensures that the group which served the data is also the current owner of that predicate. Otherwise, it aborts the transaction. So, this was only an issue for read-only transactions where Zero isn't consulted.

This PR fixes the issue by doing two things:

  • It sets the move timestamp in Tablet and membership state, which the destination group uses to reject any transactions with start ts < move ts.
  • After the move, the group checksum changes. When sending tablet deletion command to the source group, this checksum is included in the request. The source group uses this info to block until it has the latest membership information from Zero. That way, it knows not to serve the predicate before deleting it.

With these changes, Jepsen tests with "move-tablet" nemesis are passing on my machine.


This change is Reviewable

@manishrjain manishrjain marked this pull request as ready for review January 3, 2020 03:56
@manishrjain manishrjain requested a review from a team January 3, 2020 03:56
Copy link
Contributor

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor comments but it :lgtm:

Reviewed 8 of 8 files at r1.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @manishrjain)


protos/pb.proto, line 28 at r1 (raw file):

import "api.proto";
import "github.com/dgraph-io/badger/pb/pb.proto";

Are you fine making this change? Usually I revert it back because that's what Daniel suggested. I am fine either way since the protobuf compiler still doesn't understand this type of paths.


worker/groups.go, line 398 at r1 (raw file):

fmt

minor: use errors.Errorf for consistency with the rest of the codebase.


worker/groups.go, line 426 at r1 (raw file):

	g.tablets[key] = out
	if out != nil && ts > 0 && ts < out.MoveTs {
		return 0, fmt.Errorf("StartTs: %d is from before MoveTs: %d for pred: %q",

minor: use errors.Errorf for consistency with the rest of the codebase.


worker/predicate_move.go, line 196 at r1 (raw file):

Quoted 4 lines of code…
	// This loop ensures that we have seen the latest membership update, where this predicate now
	// belongs to another group. So, we won't serve any transaction asking for this predicate.
  // Without ensuring that we have seen this update, if we delete the tablet below, we'd end up
  // serving wrong data and cause Jepsen failures.

This sentence was a bit confusing at first. Maybe something like.

This loop ensures that we have seen a membership update following the move where this predicate now belongs to another group. Without this check, this group could end up serving transactions asking for this predicate, even after
this tablet has been deleted. This issue is known to have caused Jepsen failures.


worker/predicate_move.go, line 214 at r1 (raw file):

Quoted 5 lines of code…
	if in.DestGid == 0 {
		glog.Infof("Was instructed to delete tablet: %v", in.Predicate)
		p := &pb.Proposal{CleanPredicate: in.Predicate}
		return &emptyPayload, groups().Node.proposeAndWait(ctx, p)
	}

Can this go before the part you added? I think it can since deletion implies we don't have to wait for another group to complete the move.

Copy link
Member

@mangalaman93 mangalaman93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @manishrjain)


protos/pb.proto, line 191 at r1 (raw file):

	bool remove      = 8;
  bool read_only   = 9; // If true, do not ask zero to serve any tablets.
	uint64 move_ts   = 10;

some weird indentation issue here.

Copy link
Contributor Author

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the comments.

Reviewed 8 of 8 files at r1.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @danielmai, @mangalaman93, and @martinmr)


protos/pb.proto, line 28 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

Are you fine making this change? Usually I revert it back because that's what Daniel suggested. I am fine either way since the protobuf compiler still doesn't understand this type of paths.

I talked to Daniel, he didn't see any objections to keeping it without the v2. He also was going to work on changing the regenerate script to do these fix ups in the generated Go code. CC: @danielmai Please do that before we forget.


protos/pb.proto, line 191 at r1 (raw file):

Previously, mangalaman93 (Aman Mangal) wrote…

some weird indentation issue here.

Fixed it. read_only had spaces.


worker/groups.go, line 398 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…
fmt

minor: use errors.Errorf for consistency with the rest of the codebase.

Done.


worker/groups.go, line 426 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

minor: use errors.Errorf for consistency with the rest of the codebase.

Done.


worker/predicate_move.go, line 196 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…
	// This loop ensures that we have seen the latest membership update, where this predicate now
	// belongs to another group. So, we won't serve any transaction asking for this predicate.
  // Without ensuring that we have seen this update, if we delete the tablet below, we'd end up
  // serving wrong data and cause Jepsen failures.

This sentence was a bit confusing at first. Maybe something like.

This loop ensures that we have seen a membership update following the move where this predicate now belongs to another group. Without this check, this group could end up serving transactions asking for this predicate, even after
this tablet has been deleted. This issue is known to have caused Jepsen failures.

Done.


worker/predicate_move.go, line 214 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…
	if in.DestGid == 0 {
		glog.Infof("Was instructed to delete tablet: %v", in.Predicate)
		p := &pb.Proposal{CleanPredicate: in.Predicate}
		return &emptyPayload, groups().Node.proposeAndWait(ctx, p)
	}

Can this go before the part you added? I think it can since deletion implies we don't have to wait for another group to complete the move.

If we delete before we have seen the update, then we would have the failure. The race cond would be: we'd delete the data first. A txn would serve the (non-existent) data. Then we'll get the membership update, saying don't serve the data.

@danielmai danielmai merged commit ec44550 into master Jan 9, 2020
@danielmai danielmai deleted the mrjn/jepsen-move-tablet-readonly-txn branch January 9, 2020 00:24
danielmai added a commit that referenced this pull request Jan 12, 2020
* Track move ts of a pred move, so we can reject any reads which happen before the move.

* Try with a timestamp jitter to update MaxAssigned.

* Added an expected checksum, so Alpha group would only delete predicate after it has latest membership info from Zero.

* Regenerate protos.

Co-authored-by: Daniel Mai <daniel@dgraph.io>
(cherry picked from commit ec44550)
@aphyr
Copy link

aphyr commented Jan 14, 2020

I'm a bit confused here--@manish, you said that this problem only impacted read-only transactions, but in #4543, we saw permanent effects, which suggests a transaction encountered read skew and managed to promote it back into a write. Any idea why we saw persistent changes in account totals?

@aphyr
Copy link

aphyr commented Jan 14, 2020

It also sounds like this might explain why we saw spurious null values, because the tablets aren't present at all, but... why would we see read skew where an actual value was present?

@manishrjain
Copy link
Contributor Author

but in #4543, we saw permanent effects, which suggests a transaction encountered read skew and managed to promote it back into a write

My guess is this:

  1. Race cond between read and tablet having just moved from src group to dst group.
  2. Dgraph Alpha does not yet have the latest membership info, and services the wrong read (because the server thought it still was serving the predicate, after having deleted the predicate. This is the fix this PR makes).
  3. Jepsen after doing the reads, does the writes in a separate network call, which goes to the dst group (membership info has been propagated by then), so can be written.
  4. During commit, Zero gets info about conflict keys and the map of which group served which predicate for the writes, Zero checks that the write was done to the correct group (i.e. dst group) and lets it be committed. Dgraph does NOT keep track of reads -- graph systems can read a lot of data.

I think not servicing such reads would fix the problem and so far the testing has been supportive of that. This PR fixed majority of the failures with bank/move-tablet. I'm chasing some of the edge cases around this fix right now (ensuring followers do the same actions as the leader in terms of waiting for membership update before deleting their data).

@manishrjain
Copy link
Contributor Author

It also sounds like this might explain why we saw spurious null values, because the tablets aren't present at all, but... why would we see read skew where an actual value was present?

That might be the different issue with posting list splits. In my branch, I've turned off the splits, and I haven't seen invalid values being served (only nil values, consistent with my above theory).

manishrjain added a commit that referenced this pull request Jan 17, 2020
In #4496, we introduced the concept of waiting for latest membership status, before the source Alpha proposes deletion of the tablet (/predicate). However, that fix only applied that check on the source Alpha group's leader. It should have been applied on each member of the Alpha group.

This PR fixes that by making the expected membership checksum part of the `CleanPredicate` proposal. That way, each Alpha can block until they see the latest membership update which says that they no longer serve that tablet, before deleting it. This fixes a bunch of `bank/move-tablet` failures, but we still see very rare `bank/move-tablet` violations.

Another fix this PR contains is to avoid doing posting list splits, which were causing `uid-set/move-tablet` failures, fixing #4538.

Changes:
* Martin's Split code has a bug. Let's not split posting lists.
* Better messaging around turning off split
* Make ExpectedChecksum part of the CleanPredicate proposal, so all followers also block deletion of the predicate until they have seen the latest membership update.
manishrjain added a commit that referenced this pull request Jan 17, 2020
In #4496, we introduced the concept of waiting for latest membership status, before the source Alpha proposes deletion of the tablet (/predicate). However, that fix only applied that check on the source Alpha group's leader. It should have been applied on each member of the Alpha group.

This PR fixes that by making the expected membership checksum part of the `CleanPredicate` proposal. That way, each Alpha can block until they see the latest membership update which says that they no longer serve that tablet, before deleting it. This fixes a bunch of `bank/move-tablet` failures, but we still see very rare `bank/move-tablet` violations.

Another fix this PR contains is to avoid doing posting list splits, which were causing `uid-set/move-tablet` failures, fixing #4538.

Changes:
* Martin's Split code has a bug. Let's not split posting lists.
* Better messaging around turning off split
* Make ExpectedChecksum part of the CleanPredicate proposal, so all followers also block deletion of the predicate until they have seen the latest membership update.
danielmai pushed a commit that referenced this pull request Jan 28, 2020
In #4496, we introduced the concept of waiting for latest membership status, before the source Alpha proposes deletion of the tablet (/predicate). However, that fix only applied that check on the source Alpha group's leader. It should have been applied on each member of the Alpha group.

This PR fixes that by making the expected membership checksum part of the `CleanPredicate` proposal. That way, each Alpha can block until they see the latest membership update which says that they no longer serve that tablet, before deleting it. This fixes a bunch of `bank/move-tablet` failures, but we still see very rare `bank/move-tablet` violations.

Another fix this PR contains is to avoid doing posting list splits, which were causing `uid-set/move-tablet` failures, fixing #4538.

Changes:
* Martin's Split code has a bug. Let's not split posting lists.
* Better messaging around turning off split
* Make ExpectedChecksum part of the CleanPredicate proposal, so all followers also block deletion of the predicate until they have seen the latest membership update.

(cherry picked from commit 2540a93)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants