Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UpdateDuties improvement 1 #5662

Closed
wants to merge 11 commits into from
Closed

UpdateDuties improvement 1 #5662

wants to merge 11 commits into from

Conversation

mcdee
Copy link
Contributor

@mcdee mcdee commented Apr 28, 2020

What type of PR is this?

Bug fix

What does this PR do? Why is it needed?

This PR stops some non-attesting calculation from being in the critical path for attesting.

At current UpdateDuties() calculates attestation subnets and aggregation function for current and next epoch before returning. It is possible for this to take a very long time if there is a high number of validators or slow CPU, due to the requirement to carry out two signatures per validator. At worst case it goes on for multiple slots, meaning the validator is not validating when it should.

This PR separates the subnet and aggregation piece in to a separate goroutine, to move it out of the critical path.

Which issues(s) does this PR fix?

There are a number of issues raised that show poor attestation numbers. This may or may not fix those individual issues, but it definitely fixes the situation where if the subnet calculation code takes too long to run it would result in missed proposals/attestations.

Other notes for review

This is part 1 of a three-part series. The three parts are:

  1. move calculation for subnets out of the critical path
  2. optimise calculation for subnets
  3. cache 'next epoch' subnet results and reuse them if appropriate

@mcdee mcdee requested a review from a team as a code owner April 28, 2020 13:21
@codecov
Copy link

codecov bot commented Apr 28, 2020

Codecov Report

Merging #5662 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #5662   +/-   ##
=======================================
  Coverage   59.87%   59.87%           
=======================================
  Files         312      312           
  Lines       26786    26786           
=======================================
  Hits        16039    16039           
  Misses       8604     8604           
  Partials     2143     2143           

Copy link
Member

@terencechain terencechain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you say that current_epoch is still a critical part? I think we all can agree next_epoch is not as critical

@mcdee
Copy link
Contributor Author

mcdee commented Apr 28, 2020

I wouldn't call it critical. I looked at it this way:

Proposals:

  • if UpdateDuties() doesn't finish until after the end of the first slot, any proposal will be missed
  • if the goroutine doesn't finish until after the end of the first slot, no worries

Attestations:

  • if UpdateDuties() doesn't finish until after the end of the first slot, any attestation will be late, and possibly not included depending on how late
  • if the goroutine doesn't finish until after the end of the first slot, the attestation will still be made and broadcast but relies on there being either a pre-existing subnet or another node one hop away that is subscribed to the subnet for it to make its way to the relevant nodes for aggregation

The consideration for proposals is easy. For attestations, I figure it is better to create the attestation and then potentially wait on the delivery of it than definitely wait on the creation (and even then have no guarantee that the delivery will be any better, given it can take time for the node to become a member of the subnet).

@terencechain
Copy link
Member

I wouldn't call it critical. I looked at it this way:

Proposals:

  • if UpdateDuties() doesn't finish until after the end of the first slot, any proposal will be missed
  • if the goroutine doesn't finish until after the end of the first slot, no worries

Attestations:

  • if UpdateDuties() doesn't finish until after the end of the first slot, any attestation will be late, and possibly not included depending on how late
  • if the goroutine doesn't finish until after the end of the first slot, the attestation will still be made and broadcast but relies on there being either a pre-existing subnet or another node one hop away that is subscribed to the subnet for it to make its way to the relevant nodes for aggregation

The consideration for proposals is easy. For attestations, I figure it is better to create the attestation and then potentially wait on the delivery of it than definitely wait on the creation (and even then have no guarantee that the delivery will be any better, given it can take time for the node to become a member of the subnet).

I agree with your reasoning. One last question, have you tested this change with attestant's validators?
This is something that can pass canary but fail run time in production

@mcdee
Copy link
Contributor Author

mcdee commented Apr 28, 2020

Great question. I've been running with a variant of this since day 2 of the new testnet, so I'm happy with the general idea. What I haven't done is run the code in this PR in isolation from some other changes (pretty much part 2 of the above). I'll set that up now and let it tick along for a day or so and report back.

@terencechain
Copy link
Member

Great question. I've been running with a variant of this since day 2 of the new testnet, so I'm happy with the general idea. What I haven't done is run the code in this PR in isolation from some other changes (pretty much part 2 of the above). I'll set that up now and let it tick along for a day or so and report back.

Sounds good, thanks! Ping us and we'll merge this

@terencechain terencechain added the Blocked Blocked by research or external factors label Apr 28, 2020
@terencechain
Copy link
Member

I also worked on steaming validator duty, see:
#5306
https://hackmd.io/KWBj0PCDQUuPKpLQnIiZbw?both

Unfortunately I just haven't had the bandwidth to drive it to completion so anyone is more than welcome to jump in, just let me know which part so we dont duplicate work here

CommitteeIds: subscribeCommitteeIDs,
IsAggregator: subscribeIsAggregator,
})
_, err = v.validatorClient.SubscribeCommitteeSubnets(ctx, &ethpb.CommitteeSubnetsSubscribeRequest{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please handle this error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do; thanks for the spot.

attesterSlot := duty.AttesterSlot
committeeIndex := duty.CommitteeIndex
go func() {
// Create a new context as the existing one will be canceled when the parent function exits.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you break all of this out into another method? (All of the stuff in this anonymous go func

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do; if I break it out simply the function call will look a little messy, i.e. v.subscribeToSubnets(parentCtx, slot, validatingKeys, req) I was planning on a proper rewrite of this code to functions for the next part (with the removal of logging to a separate call there are two pieces of code that are virtually identical). Three options:

  1. leave the code inline; deal with it in the next PR
  2. use the function call as outlined above; tidy up in the next PR
  3. bring forward the changes for the second phase in to this PR

Happy with any of these; let me know which you prefer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prestonvanloon any thoughts on the above?

@nisdas
Copy link
Member

nisdas commented May 12, 2020

@prestonvanloon can you take a look at this again ?

@rauljordan
Copy link
Contributor

Hi @mcdee, we have modified our duties implementation to be a server-side stream #5867, which the client listens to in the background. This new streaming approach solves the same problem as this PR, so we can close. Apologies for the long feedback cycle on this and thank you for the contribution nonetheless.

@rauljordan
Copy link
Contributor

These are still crucial:

1. optimise calculation for subnets
2. cache 'next epoch' subnet results and reuse them if appropriate

we won't be touching the duties code after the stream duties functionality, so those additions would be very much welcome. For now, closing this PR

@rauljordan rauljordan closed this May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocked Blocked by research or external factors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants