Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estates General: Rework DKG into a clearer state machine structure #104

Merged
merged 13 commits into from
May 17, 2018

Conversation

Shadowfiend
Copy link
Contributor

@Shadowfiend Shadowfiend commented May 10, 2018

In France under the Old Regime, the Estates General (French: États généraux) or States-General was a legislative and consultative assembly (see The Estates) of the different classes (or estates) of French subjects. It had a separate assembly for each of the three estates, which were called and dismissed by the king. It had no true power in its own right—unlike the English parliament it was not required to approve royal taxation or legislation—instead it functioned as an advisory body to the king, primarily by presenting petitions from the various estates and consulting on fiscal policy. The Estates General met intermittently until 1614 and only once afterwards, but was not definitively dissolved until after the French Revolution.

keyGenerationState is a new interface that represents one state in key
generation. Each state knows the max block wait, how to initialize the state,
the next state at any given moment, and how to process a single message
from the broadcast channel.

dkg.go is now a relatively simple loop that steps through the states until
either a block timeout occurs, a message handler returns an error, or the
next state is nil. When the next state is nil, the state machine is considered
finished, and we attempt to extract a thresholdgroup.Member from the
final state. If that doesn't work, we fail key generation.

Logging (still as printlns at the moment) is done by the state machine rather
than the individual states, for now.

This is a pretty massive PR, but it's hopefully more bark than bite. There's a
decent amount of method declaration boilerplate in states.go, and you can
ignore all deleted code in dkg.go since ExecuteDKG has been completely
rewritten.

See #8.

This interface only provides access to a string MemberID, but allows
this id to be accessible on all member types.
There is currently only one such state, which immediately ends the state
machine. However, this sets the groundwork for adding the remaining
states back.

The state machine itself enforces an active block period. If a state
does not complete within that period, the key generation process fails
and an error is returned. When a state is final, indicated by returning
a nil next state, it is expected to have a valid thresholdgroup.Member.
If it does not, the key generation process fails and an error is
returned.

State transitions are expected to occur as a result of a network
message.

Finally, incoming messages are serialized on an unbuffered channel so
states don't have to manage synchronization overhead.
These states transition between each other and allow the completion of a
full distributed key generation process.
@Shadowfiend Shadowfiend requested a review from a team May 10, 2018 14:31
l3x
l3x previously approved these changes May 16, 2018
Copy link
Contributor

@l3x l3x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is like reviewing Mozart's work. So good. 🙌 Only minor suggestions to possibly improve readability.

)

stateTransition := func() error {
fmt.Printf(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using log.Printf because fmt.Printf seems to be more for output, whereas this is a server app and we're just logging output. Plus, not that we need it here, but log has more features. (For all instances of fmt.Printf or fmt.Errorf below.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the following lines (38-39) can be removed:

	// FIXME Need a way to time out in a given stage, especially the waiting
	//       ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding tests. For example, 1) testing with a few different variations of Threshold and GroupSize for "Did we get it?" 2) testing for proper state transitions: Initialization > join > commitment > sharing > accusing > justifying > keyed 3) testing signing messages using the shared keys 4) etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch re: the FIXME (I noticed this when I was talking to y'all about the code yesterday)!

Logging---we still need to figure out, but absolutely we want to be doing that instead of printing in the long run, excellent point.

Lastly, tests: my plan is to do a separate PR with tests. There's both opportunity for unit testing (of the states) and integration testing (of the whole flow) here, but this PR was so long and the testing work has enough open questions that I want it reviewed separately. I've been mulling approaches the past couple of days, hope to get something out in that dept sometime this week.

initiate() error
groupMember() thresholdgroup.BaseMember
// activePeriod is the period during which this state is active, in blocks.
activePeriod() int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming to numBlocks or numBlocksPerState.

The following is a little more intuitive to me:

blockWaiter = blockCounter.BlockWaiter(currentState.numBlocksPerState())

than

blockWaiter = blockCounter.BlockWaiter(currentState.activePeriod())

Similarly, in the comments, I prefer this:

// initializationState is the starting state of key generation; it waits for
// numBlocksPerState and then enters joinState. No messages are valid in this state.
type initializationState struct {
	channel net.BroadcastChannel
	member  *thresholdgroup.LocalMember
}

rather than

// initializationState is the starting state of key generation; it waits for
// activePeriod and then enters joinState. No messages are valid in this state.
type initializationState struct {
	channel net.BroadcastChannel
	member  *thresholdgroup.LocalMember
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I threw this around a little when I was writing the code. The problem is that numBlocks is unclear on what the number of blocks refers to, and numBlocksPerState as a property of a single state doesn't sound right.

A hybrid, perhaps: activeBlocks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also worth noting… At some point we'll need to deal with the fact that the active number of blocks may vary between chains, for example… But that's for another day.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

activeBlocks 👍 I like it.

return is.member
}

func (is *initializationState) activePeriod() int { return 15 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this value 15 be read from the environment?

Copy link
Contributor

@l3x l3x May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably log a message explaining to the user why a state appears to be hung due to an activePeriod being too low. Something like, "Your activePeriod has been exceeded. Consider increasing it and try again."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's ready from an environment, it will be the chain. It needs to be a system constant, as this is a synchronization mechanism across the network. Increasing it is not an option for any individual client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, also, there is no “hanging” due to low active period here. When the active period expires, if we're not ready to go to the next state, ExecuteDKG returns an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I just woke up in a panic. While sleeping it occurred to me that the exceeding the active period does, indeed, return an error rather than hang.

Maybe we should add a hint to possibly help the user resolve the error?

. . .
[member:index 231] Failed to run DKG: [failed to complete state [*dkg.joinState] inside active period [1]].
panic: Failed to reach group size during DKG, aborting.
hint:  Increase your Threshold value and try again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no things under the user's control that can fix this. They don't control threshold, the system does. They don't control the active period, the system does. If they change the active period locally, they will prevent DKG from completing correctly across the group because they won't be synchronized with other group members.

Remember this code is currently all running in one process, but in practice it will be running across many machines.


type keyGenerationState interface {
initiate() error
groupMember() thresholdgroup.BaseMember
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving groupMember() and activePeriod() below nextState()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other methods (initiate, receive, nextState) are more significant in that they do things, rather than groupMember and activePeriod which are property getters. I tend to put the more important things above others. Just a preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importance is relatively vague, IMO, but I see what you're getting at. I prefer properties above actions, but agree that here we mix the two and that's not necessarily ideal. I'll adjust accordingly.

Copy link
Contributor

@l3x l3x May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for considering my preference. Another reason why I like the getters at the bottom is b/c in my IDE (Goland) the Structure view of a source file puts fields below methods:

image

return nil
}

return fmt.Errorf("unexpected message for join state: [%#v]", msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using log.Errorf and capitalizing "Unexpected ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm intentionally not printing or logging these errors (side-effects), but rather returning them as proper error values for handling by the caller. That's also why this isn't a complete sentence---error strings shouldn't be, as they are typically logged alongside other content.

EDIT: see https://github.com/golang/go/wiki/CodeReviewComments#error-strings re: error strings. Note that our linter should be catching this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the code review reference link!

}

func (js *joinState) nextState() keyGenerationState {
if js.member.ReadyForSharing() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming js.member.ReadyForSharing() to js.member.JoinGroupComplete() to be consistent with others: CommitmentsComplete() and SharesComplete()

make(map[bls.ID]struct{}),
as.expectedAccusationCount,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding new (blank) line above the return statement.

func (js *justifyingState) nextState() keyGenerationState {
if len(js.seenJustifications) == js.expectedJustificationCount {
return &keyedState{js.member.FinalizeMember()}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding new (blank) line above the return statement.

@Shadowfiend
Copy link
Contributor Author

Pushed a few tweaks from the above review; have a few follow-up questions above that will trigger another tweak or two.

To clarify that the active period is in blocks, we rename activePeriod
to activeBlocks for all states.
Existing style used full-word method receivers and often used value
receivers instead of pointer receivers. The switch to short receivers
and pointer receivers gives us some gains in efficiency, consistency,
and brevity that are aligned with our style guidelines.
@Shadowfiend
Copy link
Contributor Author

Added a few more stylistic tweaks that I'd had backed up in my brain.

Copy link
Contributor

@pschlump pschlump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is really clean and clear.

@pschlump
Copy link
Contributor

Adding some instrumentation to the state machine

It is unusually difficult to test long running state machines. I have built
a number of them that had to run at remote locations without any local
maintenance. The biggest take away form all of this is the necessity
for having a message system where you can send a "context" changing
message to the machine and have it either report back or log information
while it is running.

In dkg.go add ( something similar to ):


type StateContext struct {
	lockContext sync.RWMutex				// 
	ContextInfo map[string]interface{}		// place to store arbitrary usage defined context information
}
...
ctx := StateContext{ ContextInfo: make(map[string]interface{}) }

Change

	  case msg := <-recvChan:
		...
		err := currentState.receive(msg)
		...
		nextState := currentState.nextState()

To

	  case msg := <-recvChan:
		...
		err := currentState.receive(ctx,msg)
		...
		nextState := currentState.nextState(ctx)

You would want to have the 'ctx' passed as the 1st parameter to each of the state
handling functions.

You will need the ability for a message handler to return a nextState that is the
current state. This would be a no-op state transition. Also a tool to inject
specific messages into the system. The message injection can then be used for
testing - this is how we can test a bad message - or how we can simulate an
unusually high message volume. This is also useful for testing out-of-order
messages and how the machine behaves.

Examples of useful new messages include:

  1. Create a message that can turn on/off logging.
  2. Create a message that will log the current state that the machine is in and how long it has been in that state.
  3. Track the amount of time that a state is using.
  4. Change the destination that logging information is going to (log rotation or send to a named pipe).

@Shadowfiend
Copy link
Contributor Author

Nice! A few thoughts:

  • This state machine isn't actually very long-running. It should typically complete within, in the current configuration, ~90 blocks. In practice it'll probably be fewer, since Ethereum blocks take longer to happen. There is a possibility that multiple instances of the state machine will run concurrently, however.
  • Our states are guaranteed to terminate after the given number of blocks, unless the block counter itself gets buggy. We definitely need to make sure we stay aware of whether the block counter falls over (and may even want to panic if that happens, depending on how important it is that it be functioning).
  • We currently do all our logging at the state machine level (i.e., not inside the actual states). This does have some shortcomings from a debugging perspective, but is related to the mutable state point below.
  • Re: the mutex, the state machine currently runs (intentionally) single-threaded. The higher up we can deal with threads and mutexes, the simpler our lower-down code can be (though of course we potentially compromise performance).

Some more philosophical constraints:

  • I don't like injecting mutable state and side effects throughout our system. Our challenge will be finding a good way to percolate any information we want up to a manager who deals with the side effects (in this case, the ExecuteDKG function). Logging is a side-effect, mutable contexts are a side-effect, etc.
  • I would rather avoid interface{} if we can. Better to (for example) define the messages we want, and give them a common parent.

I think the core goal you're presenting here actually applies to the overall system, and DKG is just a small subpart of it. And it's really part and parcel of the conversation on how we manage metrics and logging in general. I need to spin up an issue on that so we can discuss it a bit more as we start to pull it in.

Let's not put that in this particular PR, but I will spin up that issue and we can continue the conversation there.

@pschlump
Copy link
Contributor

Let's merge this - the code is excellent.

@pschlump pschlump merged commit 8bc8089 into master May 17, 2018
@Shadowfiend Shadowfiend deleted the estates-general branch May 17, 2018 14:38
@Shadowfiend Shadowfiend added this to the Relay Milestone 1 milestone May 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants