Estates General: Rework DKG into a clearer state machine structure #104

Shadowfiend · 2018-05-10T14:31:01Z

In France under the Old Regime, the Estates General (French: États généraux) or States-General was a legislative and consultative assembly (see The Estates) of the different classes (or estates) of French subjects. It had a separate assembly for each of the three estates, which were called and dismissed by the king. It had no true power in its own right—unlike the English parliament it was not required to approve royal taxation or legislation—instead it functioned as an advisory body to the king, primarily by presenting petitions from the various estates and consulting on fiscal policy. The Estates General met intermittently until 1614 and only once afterwards, but was not definitively dissolved until after the French Revolution.

keyGenerationState is a new interface that represents one state in key
generation. Each state knows the max block wait, how to initialize the state,
the next state at any given moment, and how to process a single message
from the broadcast channel.

dkg.go is now a relatively simple loop that steps through the states until
either a block timeout occurs, a message handler returns an error, or the
next state is nil. When the next state is nil, the state machine is considered
finished, and we attempt to extract a thresholdgroup.Member from the
final state. If that doesn't work, we fail key generation.

Logging (still as printlns at the moment) is done by the state machine rather
than the individual states, for now.

This is a pretty massive PR, but it's hopefully more bark than bite. There's a
decent amount of method declaration boilerplate in states.go, and you can
ignore all deleted code in dkg.go since ExecuteDKG has been completely
rewritten.

See #8.

This interface only provides access to a string MemberID, but allows this id to be accessible on all member types.

There is currently only one such state, which immediately ends the state machine. However, this sets the groundwork for adding the remaining states back. The state machine itself enforces an active block period. If a state does not complete within that period, the key generation process fails and an error is returned. When a state is final, indicated by returning a nil next state, it is expected to have a valid thresholdgroup.Member. If it does not, the key generation process fails and an error is returned. State transitions are expected to occur as a result of a network message. Finally, incoming messages are serialized on an unbuffered channel so states don't have to manage synchronization overhead.

These states transition between each other and allow the completion of a full distributed key generation process.

l3x

This is like reviewing Mozart's work. So good. 🙌 Only minor suggestions to possibly improve readability.

l3x · 2018-05-16T02:47:39Z

pkg/beacon/relay/dkg/dkg.go

+	)
+
+	stateTransition := func() error {
+		fmt.Printf(


Consider using log.Printf because fmt.Printf seems to be more for output, whereas this is a server app and we're just logging output. Plus, not that we need it here, but log has more features. (For all instances of fmt.Printf or fmt.Errorf below.)

Also, the following lines (38-39) can be removed:

// FIXME Need a way to time out in a given stage, especially the waiting // ones.

Consider adding tests. For example, 1) testing with a few different variations of Threshold and GroupSize for "Did we get it?" 2) testing for proper state transitions: Initialization > join > commitment > sharing > accusing > justifying > keyed 3) testing signing messages using the shared keys 4) etc.

Good catch re: the FIXME (I noticed this when I was talking to y'all about the code yesterday)!

Logging---we still need to figure out, but absolutely we want to be doing that instead of printing in the long run, excellent point.

Lastly, tests: my plan is to do a separate PR with tests. There's both opportunity for unit testing (of the states) and integration testing (of the whole flow) here, but this PR was so long and the testing work has enough open questions that I want it reviewed separately. I've been mulling approaches the past couple of days, hope to get something out in that dept sometime this week.

l3x · 2018-05-16T03:00:23Z

pkg/beacon/relay/dkg/states.go

+	initiate() error
+	groupMember() thresholdgroup.BaseMember
+	// activePeriod is the period during which this state is active, in blocks.
+	activePeriod() int


Consider renaming to numBlocks or numBlocksPerState.

The following is a little more intuitive to me:

blockWaiter = blockCounter.BlockWaiter(currentState.numBlocksPerState())

than

blockWaiter = blockCounter.BlockWaiter(currentState.activePeriod())

Similarly, in the comments, I prefer this:

// initializationState is the starting state of key generation; it waits for // numBlocksPerState and then enters joinState. No messages are valid in this state. type initializationState struct { channel net.BroadcastChannel member *thresholdgroup.LocalMember }

rather than

// initializationState is the starting state of key generation; it waits for // activePeriod and then enters joinState. No messages are valid in this state. type initializationState struct { channel net.BroadcastChannel member *thresholdgroup.LocalMember }

I threw this around a little when I was writing the code. The problem is that numBlocks is unclear on what the number of blocks refers to, and numBlocksPerState as a property of a single state doesn't sound right.

A hybrid, perhaps: activeBlocks?

(Also worth noting… At some point we'll need to deal with the fact that the active number of blocks may vary between chains, for example… But that's for another day.)

activeBlocks 👍 I like it.

l3x · 2018-05-16T03:01:18Z

pkg/beacon/relay/dkg/states.go

+	return is.member
+}
+
+func (is *initializationState) activePeriod() int { return 15 }


Should this value 15 be read from the environment?

We should probably log a message explaining to the user why a state appears to be hung due to an activePeriod being too low. Something like, "Your activePeriod has been exceeded. Consider increasing it and try again."

If it's ready from an environment, it will be the chain. It needs to be a system constant, as this is a synchronization mechanism across the network. Increasing it is not an option for any individual client.

Oh, also, there is no “hanging” due to low active period here. When the active period expires, if we're not ready to go to the next state, ExecuteDKG returns an error.

That's why I just woke up in a panic. While sleeping it occurred to me that the exceeding the active period does, indeed, return an error rather than hang.

Maybe we should add a hint to possibly help the user resolve the error?

. . . [member:index 231] Failed to run DKG: [failed to complete state [*dkg.joinState] inside active period [1]]. panic: Failed to reach group size during DKG, aborting. hint: Increase your Threshold value and try again.

There are no things under the user's control that can fix this. They don't control threshold, the system does. They don't control the active period, the system does. If they change the active period locally, they will prevent DKG from completing correctly across the group because they won't be synchronized with other group members.

Remember this code is currently all running in one process, but in practice it will be running across many machines.

l3x · 2018-05-16T03:02:09Z

pkg/beacon/relay/dkg/states.go

+
+type keyGenerationState interface {
+	initiate() error
+	groupMember() thresholdgroup.BaseMember


Consider moving groupMember() and activePeriod() below nextState()

Any particular reason?

The other methods (initiate, receive, nextState) are more significant in that they do things, rather than groupMember and activePeriod which are property getters. I tend to put the more important things above others. Just a preference.

Importance is relatively vague, IMO, but I see what you're getting at. I prefer properties above actions, but agree that here we mix the two and that's not necessarily ideal. I'll adjust accordingly.

Thank you for considering my preference. Another reason why I like the getters at the bottom is b/c in my IDE (Goland) the Structure view of a source file puts fields below methods:

l3x · 2018-05-16T03:05:36Z

pkg/beacon/relay/dkg/states.go

+		return nil
+	}
+
+	return fmt.Errorf("unexpected message for join state: [%#v]", msg)


Consider using log.Errorf and capitalizing "Unexpected ...

I'm intentionally not printing or logging these errors (side-effects), but rather returning them as proper error values for handling by the caller. That's also why this isn't a complete sentence---error strings shouldn't be, as they are typically logged alongside other content.

EDIT: see https://github.com/golang/go/wiki/CodeReviewComments#error-strings re: error strings. Note that our linter should be catching this.

Thank you for the code review reference link!

l3x · 2018-05-16T03:09:32Z

pkg/beacon/relay/dkg/states.go

+}
+
+func (js *joinState) nextState() keyGenerationState {
+	if js.member.ReadyForSharing() {


Consider renaming js.member.ReadyForSharing() to js.member.JoinGroupComplete() to be consistent with others: CommitmentsComplete() and SharesComplete()

l3x · 2018-05-16T03:19:09Z

pkg/beacon/relay/dkg/states.go

+			make(map[bls.ID]struct{}),
+			as.expectedAccusationCount,
+		}
+	}


Consider adding new (blank) line above the return statement.

l3x · 2018-05-16T03:20:03Z

pkg/beacon/relay/dkg/states.go

+func (js *justifyingState) nextState() keyGenerationState {
+	if len(js.seenJustifications) == js.expectedJustificationCount {
+		return &keyedState{js.member.FinalizeMember()}
+	}


Consider adding new (blank) line above the return statement.

Shadowfiend · 2018-05-16T13:50:45Z

Pushed a few tweaks from the above review; have a few follow-up questions above that will trigger another tweak or two.

ReadyForSharing wasn't particularly descriptive on what made the member ready for sharing.

To clarify that the active period is in blocks, we rename activePeriod to activeBlocks for all states.

Existing style used full-word method receivers and often used value receivers instead of pointer receivers. The switch to short receivers and pointer receivers gives us some gains in efficiency, consistency, and brevity that are aligned with our style guidelines.

Shadowfiend · 2018-05-16T16:02:19Z

Added a few more stylistic tweaks that I'd had backed up in my brain.

pschlump

The code is really clean and clear.

pschlump · 2018-05-16T16:31:49Z

Adding some instrumentation to the state machine

It is unusually difficult to test long running state machines. I have built
a number of them that had to run at remote locations without any local
maintenance. The biggest take away form all of this is the necessity
for having a message system where you can send a "context" changing
message to the machine and have it either report back or log information
while it is running.

In dkg.go add ( something similar to ):


type StateContext struct {
	lockContext sync.RWMutex				// 
	ContextInfo map[string]interface{}		// place to store arbitrary usage defined context information
}
...
ctx := StateContext{ ContextInfo: make(map[string]interface{}) }

Change

	  case msg := <-recvChan:
		...
		err := currentState.receive(msg)
		...
		nextState := currentState.nextState()

To

	  case msg := <-recvChan:
		...
		err := currentState.receive(ctx,msg)
		...
		nextState := currentState.nextState(ctx)

You would want to have the 'ctx' passed as the 1st parameter to each of the state
handling functions.

You will need the ability for a message handler to return a nextState that is the
current state. This would be a no-op state transition. Also a tool to inject
specific messages into the system. The message injection can then be used for
testing - this is how we can test a bad message - or how we can simulate an
unusually high message volume. This is also useful for testing out-of-order
messages and how the machine behaves.

Examples of useful new messages include:

Create a message that can turn on/off logging.
Create a message that will log the current state that the machine is in and how long it has been in that state.
Track the amount of time that a state is using.
Change the destination that logging information is going to (log rotation or send to a named pipe).

Shadowfiend · 2018-05-16T16:46:30Z

Nice! A few thoughts:

This state machine isn't actually very long-running. It should typically complete within, in the current configuration, ~90 blocks. In practice it'll probably be fewer, since Ethereum blocks take longer to happen. There is a possibility that multiple instances of the state machine will run concurrently, however.
Our states are guaranteed to terminate after the given number of blocks, unless the block counter itself gets buggy. We definitely need to make sure we stay aware of whether the block counter falls over (and may even want to panic if that happens, depending on how important it is that it be functioning).
We currently do all our logging at the state machine level (i.e., not inside the actual states). This does have some shortcomings from a debugging perspective, but is related to the mutable state point below.
Re: the mutex, the state machine currently runs (intentionally) single-threaded. The higher up we can deal with threads and mutexes, the simpler our lower-down code can be (though of course we potentially compromise performance).

Some more philosophical constraints:

I don't like injecting mutable state and side effects throughout our system. Our challenge will be finding a good way to percolate any information we want up to a manager who deals with the side effects (in this case, the ExecuteDKG function). Logging is a side-effect, mutable contexts are a side-effect, etc.
I would rather avoid interface{} if we can. Better to (for example) define the messages we want, and give them a common parent.

I think the core goal you're presenting here actually applies to the overall system, and DKG is just a small subpart of it. And it's really part and parcel of the conversation on how we manage metrics and logging in general. I need to spin up an issue on that so we can discuss it a bit more as we start to pull it in.

Let's not put that in this particular PR, but I will spin up that issue and we can continue the conversation there.

pschlump · 2018-05-17T00:27:05Z

Let's merge this - the code is excellent.

Shadowfiend added 4 commits May 10, 2018 09:59

Implement a common thresholdgroup.BaseMember interface

1042619

This interface only provides access to a string MemberID, but allows this id to be accessible on all member types.

Add remaining DKG states

0f4ebb5

These states transition between each other and allow the completion of a full distributed key generation process.

Clarify and wrap local chain.BlockCounter godocs

763cc1f

Shadowfiend requested a review from a team May 10, 2018 14:31

Shadowfiend mentioned this pull request May 15, 2018

Group joining and distributed key generation #8

Closed

4 tasks

Merge branch 'master' into estates-general

29da192

l3x previously approved these changes May 16, 2018

View reviewed changes

Shadowfiend added 4 commits May 16, 2018 09:51

Drop a fixed FIXME!

a42e4af

LocalMember.ReadyForSharing -> MemberListComplete

fd6b824

ReadyForSharing wasn't particularly descriptive on what made the member ready for sharing.

Formatting tweak, extra blank lines before returns

ff5e0e8

Drop an extraneous print

7431319

Shadowfiend dismissed l3x’s stale review via 7431319 May 16, 2018 13:51

Shadowfiend added 2 commits May 16, 2018 11:34

keyGenerationState.activePeriod -> activeBlocks

a43e9df

To clarify that the active period is in blocks, we rename activePeriod to activeBlocks for all states.

pschlump approved these changes May 16, 2018

View reviewed changes

mhluongo mentioned this pull request May 16, 2018

Threshold ECDSA: Research/prototype implementation #115

Closed

pschlump added 2 commits May 17, 2018 09:35

Merge branch 'master' into estates-general

1f9ae18

Merge branch 'master' into estates-general

463efe2

pschlump merged commit 8bc8089 into master May 17, 2018

Shadowfiend deleted the estates-general branch May 17, 2018 14:38

Shadowfiend added this to the Relay Milestone 1 milestone May 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estates General: Rework DKG into a clearer state machine structure #104

Estates General: Rework DKG into a clearer state machine structure #104

Shadowfiend commented May 10, 2018 •

edited

l3x left a comment

l3x May 16, 2018

l3x May 16, 2018

l3x May 16, 2018

Shadowfiend May 16, 2018

l3x May 16, 2018

Shadowfiend May 16, 2018

Shadowfiend May 16, 2018

l3x May 16, 2018

l3x May 16, 2018

l3x May 16, 2018 •

edited

Shadowfiend May 16, 2018

Shadowfiend May 16, 2018

l3x May 16, 2018

Shadowfiend May 16, 2018

l3x May 16, 2018

Shadowfiend May 16, 2018

l3x May 16, 2018

Shadowfiend May 16, 2018

l3x May 16, 2018 •

edited

l3x May 16, 2018

Shadowfiend May 16, 2018 •

edited

l3x May 16, 2018

l3x May 16, 2018

l3x May 16, 2018

l3x May 16, 2018

Shadowfiend commented May 16, 2018

Shadowfiend commented May 16, 2018

pschlump left a comment

pschlump commented May 16, 2018

Shadowfiend commented May 16, 2018

pschlump commented May 17, 2018

Estates General: Rework DKG into a clearer state machine structure #104

Estates General: Rework DKG into a clearer state machine structure #104

Conversation

Shadowfiend commented May 10, 2018 • edited

l3x left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

l3x May 16, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

l3x May 16, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shadowfiend May 16, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shadowfiend commented May 16, 2018

Shadowfiend commented May 16, 2018

pschlump left a comment

Choose a reason for hiding this comment

pschlump commented May 16, 2018

Adding some instrumentation to the state machine

Shadowfiend commented May 16, 2018

pschlump commented May 17, 2018

Shadowfiend commented May 10, 2018 •

edited

l3x May 16, 2018 •

edited

l3x May 16, 2018 •

edited

Shadowfiend May 16, 2018 •

edited