-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estates General: Rework DKG into a clearer state machine structure #104
Conversation
This interface only provides access to a string MemberID, but allows this id to be accessible on all member types.
There is currently only one such state, which immediately ends the state machine. However, this sets the groundwork for adding the remaining states back. The state machine itself enforces an active block period. If a state does not complete within that period, the key generation process fails and an error is returned. When a state is final, indicated by returning a nil next state, it is expected to have a valid thresholdgroup.Member. If it does not, the key generation process fails and an error is returned. State transitions are expected to occur as a result of a network message. Finally, incoming messages are serialized on an unbuffered channel so states don't have to manage synchronization overhead.
These states transition between each other and allow the completion of a full distributed key generation process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is like reviewing Mozart's work. So good. 🙌 Only minor suggestions to possibly improve readability.
) | ||
|
||
stateTransition := func() error { | ||
fmt.Printf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using log.Printf
because fmt.Printf
seems to be more for output, whereas this is a server app and we're just logging output. Plus, not that we need it here, but log has more features. (For all instances of fmt.Printf or fmt.Errorf below.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the following lines (38-39) can be removed:
// FIXME Need a way to time out in a given stage, especially the waiting
// ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding tests. For example, 1) testing with a few different variations of Threshold and GroupSize for "Did we get it?" 2) testing for proper state transitions: Initialization > join > commitment > sharing > accusing > justifying > keyed 3) testing signing messages using the shared keys 4) etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch re: the FIXME (I noticed this when I was talking to y'all about the code yesterday)!
Logging---we still need to figure out, but absolutely we want to be doing that instead of printing in the long run, excellent point.
Lastly, tests: my plan is to do a separate PR with tests. There's both opportunity for unit testing (of the states) and integration testing (of the whole flow) here, but this PR was so long and the testing work has enough open questions that I want it reviewed separately. I've been mulling approaches the past couple of days, hope to get something out in that dept sometime this week.
pkg/beacon/relay/dkg/states.go
Outdated
initiate() error | ||
groupMember() thresholdgroup.BaseMember | ||
// activePeriod is the period during which this state is active, in blocks. | ||
activePeriod() int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming to numBlocks
or numBlocksPerState
.
The following is a little more intuitive to me:
blockWaiter = blockCounter.BlockWaiter(currentState.numBlocksPerState())
than
blockWaiter = blockCounter.BlockWaiter(currentState.activePeriod())
Similarly, in the comments, I prefer this:
// initializationState is the starting state of key generation; it waits for
// numBlocksPerState and then enters joinState. No messages are valid in this state.
type initializationState struct {
channel net.BroadcastChannel
member *thresholdgroup.LocalMember
}
rather than
// initializationState is the starting state of key generation; it waits for
// activePeriod and then enters joinState. No messages are valid in this state.
type initializationState struct {
channel net.BroadcastChannel
member *thresholdgroup.LocalMember
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I threw this around a little when I was writing the code. The problem is that numBlocks
is unclear on what the number of blocks refers to, and numBlocksPerState
as a property of a single state doesn't sound right.
A hybrid, perhaps: activeBlocks
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also worth noting… At some point we'll need to deal with the fact that the active number of blocks may vary between chains, for example… But that's for another day.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
activeBlocks
👍 I like it.
pkg/beacon/relay/dkg/states.go
Outdated
return is.member | ||
} | ||
|
||
func (is *initializationState) activePeriod() int { return 15 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this value 15
be read from the environment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably log a message explaining to the user why a state appears to be hung due to an activePeriod being too low. Something like, "Your activePeriod has been exceeded. Consider increasing it and try again."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's ready from an environment, it will be the chain. It needs to be a system constant, as this is a synchronization mechanism across the network. Increasing it is not an option for any individual client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, also, there is no “hanging” due to low active period here. When the active period expires, if we're not ready to go to the next state, ExecuteDKG
returns an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I just woke up in a panic. While sleeping it occurred to me that the exceeding the active period does, indeed, return an error rather than hang.
Maybe we should add a hint to possibly help the user resolve the error?
. . .
[member:index 231] Failed to run DKG: [failed to complete state [*dkg.joinState] inside active period [1]].
panic: Failed to reach group size during DKG, aborting.
hint: Increase your Threshold value and try again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no things under the user's control that can fix this. They don't control threshold, the system does. They don't control the active period, the system does. If they change the active period locally, they will prevent DKG from completing correctly across the group because they won't be synchronized with other group members.
Remember this code is currently all running in one process, but in practice it will be running across many machines.
pkg/beacon/relay/dkg/states.go
Outdated
|
||
type keyGenerationState interface { | ||
initiate() error | ||
groupMember() thresholdgroup.BaseMember |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving groupMember()
and activePeriod()
below nextState()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other methods (initiate, receive, nextState) are more significant in that they do things, rather than groupMember and activePeriod which are property getters. I tend to put the more important things above others. Just a preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Importance is relatively vague, IMO, but I see what you're getting at. I prefer properties above actions, but agree that here we mix the two and that's not necessarily ideal. I'll adjust accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil | ||
} | ||
|
||
return fmt.Errorf("unexpected message for join state: [%#v]", msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using log.Errorf
and capitalizing "Unexpected
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm intentionally not printing or logging these errors (side-effects), but rather returning them as proper error
values for handling by the caller. That's also why this isn't a complete sentence---error
strings shouldn't be, as they are typically logged alongside other content.
EDIT: see https://github.com/golang/go/wiki/CodeReviewComments#error-strings re: error strings. Note that our linter should be catching this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the code review reference link!
pkg/beacon/relay/dkg/states.go
Outdated
} | ||
|
||
func (js *joinState) nextState() keyGenerationState { | ||
if js.member.ReadyForSharing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming js.member.ReadyForSharing()
to js.member.JoinGroupComplete()
to be consistent with others: CommitmentsComplete()
and SharesComplete()
make(map[bls.ID]struct{}), | ||
as.expectedAccusationCount, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding new (blank) line above the return statement.
func (js *justifyingState) nextState() keyGenerationState { | ||
if len(js.seenJustifications) == js.expectedJustificationCount { | ||
return &keyedState{js.member.FinalizeMember()} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding new (blank) line above the return statement.
Pushed a few tweaks from the above review; have a few follow-up questions above that will trigger another tweak or two. |
ReadyForSharing wasn't particularly descriptive on what made the member ready for sharing.
To clarify that the active period is in blocks, we rename activePeriod to activeBlocks for all states.
Existing style used full-word method receivers and often used value receivers instead of pointer receivers. The switch to short receivers and pointer receivers gives us some gains in efficiency, consistency, and brevity that are aligned with our style guidelines.
Added a few more stylistic tweaks that I'd had backed up in my brain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is really clean and clear.
Adding some instrumentation to the state machineIt is unusually difficult to test long running state machines. I have built In dkg.go add ( something similar to ):
Change
To
You would want to have the 'ctx' passed as the 1st parameter to each of the state You will need the ability for a message handler to return a nextState that is the Examples of useful new messages include:
|
Nice! A few thoughts:
Some more philosophical constraints:
I think the core goal you're presenting here actually applies to the overall system, and DKG is just a small subpart of it. And it's really part and parcel of the conversation on how we manage metrics and logging in general. I need to spin up an issue on that so we can discuss it a bit more as we start to pull it in. Let's not put that in this particular PR, but I will spin up that issue and we can continue the conversation there. |
Let's merge this - the code is excellent. |
keyGenerationState
is a new interface that represents one state in keygeneration. Each state knows the max block wait, how to initialize the state,
the next state at any given moment, and how to process a single message
from the broadcast channel.
dkg.go
is now a relatively simple loop that steps through the states untileither a block timeout occurs, a message handler returns an error, or the
next state is
nil
. When the next state isnil
, the state machine is consideredfinished, and we attempt to extract a
thresholdgroup.Member
from thefinal state. If that doesn't work, we fail key generation.
Logging (still as printlns at the moment) is done by the state machine rather
than the individual states, for now.
This is a pretty massive PR, but it's hopefully more bark than bite. There's a
decent amount of method declaration boilerplate in
states.go
, and you canignore all deleted code in
dkg.go
sinceExecuteDKG
has been completelyrewritten.
See #8.