Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Honey Badger sender queue #226

Closed
vkomenda opened this issue Sep 10, 2018 · 11 comments
Closed

Dynamic Honey Badger sender queue #226

vkomenda opened this issue Sep 10, 2018 · 11 comments
Assignees

Comments

@vkomenda
Copy link
Contributor

Issue #43 needs an extra bit of design for queueing and dequeueing Honey Badger (HB) messages. An instance of Dynamic Honey Badger (DHB) contains an embedded instance of HB. This embedded HB instance is restarted on DKG change or completion events. When restarting it, any queued messages should be stored in the DHB instance for later delivery.

@vkomenda
Copy link
Contributor Author

vkomenda commented Sep 11, 2018

In case of the DHB sender queue the extra complexity arises from the management issues associated with the embedded HB instance. In particular, the messages send by HB should be post-processed by the containing DHB. Post-processing allows to queue HB messages beyond the restart of the embedded HB instance.

Two kinds of post-processing is required.

  1. Any EpochStarted(epoch) message from HB is annotated with the DHB epoch number start_epoch -- call it DynamicEpochStarted(start_epoch, epoch) -- and broadcast to all remote nodes. Remote nodes at DHB epoch start_epoch pass a message EpochStarted(epoch) to the embedded HB instance. Remote nodes at earlier DHB epochs drop the message. Remote nodes at later DHB epochs dequeue any queued outgoing messages for (start_epoch, epoch).
  2. The HB instance keeps a queue of messages for remote nodes with the same DHB start_epoch. So, any messages from HB other than EpochStarted can be sent without delay. If however there is a non-empty queue before the HB instance is restarted, the queue is appended to the DHB queue of postponed HB messages. An implementation of this queue is possible within an outgoing queue of DHB which is searchable by keys containing both a DHB start_epoch and an HB epoch.

@afck
Copy link
Collaborator

afck commented Sep 16, 2018

Remote nodes at earlier DHB epochs drop the message.

I don't think they should drop it: If I receive DynamicEpochStarted(6, 8) from you, and I'm still at start_epoch = 2 and epoch = 5, then I know that:

  • My current HB instance won't need to send anything to you ever again. (Basically we could make it handle an EpochStarted(u64::MAX) or something… although if we literally do that we'll need to be careful to avoid + and use saturating_add everywhere on epochs.)
  • My next HB instance will need to handle the EpochStarted(8) message as soon as I create it.

@vkomenda
Copy link
Contributor Author

vkomenda commented Sep 17, 2018

I don't think they should drop it

I agree. In fact, in the code

later epoch announcements are counted in.

The initial design changes accordingly.

The epoch jump is a great optimisation. Or is it more than an optimisation in your view? I didn't add any epoch skips in this design because I thought it would be conceptually simpler to go through all queued messages. On the other hand, skipping the epochs in which there was no HB output makes perfect sense.

Basically we could make it handle an EpochStarted(u64::MAX) or something…

Sure, I'd just use an enum with variants Epoch(u64) and Terminated. Or Option, although that would be less self-explanatory.

@afck
Copy link
Collaborator

afck commented Sep 17, 2018

I agree with using an enum.
Regarding epoch jumps: From the user perspective we shouldn't skip any epochs. What I meant was that if I receive an EpochStarted from you which tells me that you've already moved past epoch 5, I don't need to send you any messages for epoch 5 anymore.

@vkomenda
Copy link
Contributor Author

Agreed! By an epoch skip I meant starting in HB epoch other than 0.

@vkomenda
Copy link
Contributor Author

vkomenda commented Sep 18, 2018

Target::All messages which are not epoch announcements may be split into individual Target::Node(id) messages if not all validators have reached the message epoch yet. That affects what observers receive because in the current simulations observers get any Target::All messages and nothing else. Therefore splitting as describe above hinders observers' progress. To circumvent that, splitting a Target::All message should be accomplished as follows:

  1. Create a Target::Node(id) message for any node id with an earlier epoch.

  2. Send the original Target::All message to all nodes including observer nodes. Nodes at earlier or later epochs will discard that message outright. However the message will be delivered later individually to nodes currently at earlier epochs as soon as they announce an update to the epoch of the message in question.

A problem still remains: What if the observer node is not accepting the epoch of the Target::All message? In that case the observer node will discard the message. Possibly the simlulation should take care of tracking the observer epoch because it is not trackable from the algorithm?

@vkomenda
Copy link
Contributor Author

To sum up on the observer problem. Letting Target::All messages through will tell the simulation which messages to forward to the observer. (As a side note, another option might be labelling messages whose payload would be send to the observer by the simulation. If a message is split into Target::Node messages, pick one of those messages and label it.) However, letting Target::All messages through won't help keeping observer nodes up to date for the reason of the problem that I outlined where the observer node is in a non-matching epoch. So this brings us to the contract with the simulation that maintains an observer node to track the observer epoch and sends the messages that require observer attention (labelled explicitly or otherwise payloads of Target::All messages) when the epoch match.

@afck
Copy link
Collaborator

afck commented Sep 24, 2018

After some discussion I now think we do need to keep a map of observers in NetworkInfo, and add a mechanism to formally add observers. Or not do any of this and move all the responsibility to the user. (Or implement a separate layer for it, on top of DHB.)

I also realized that regarding observers we already make it the user's responsibility to keep an outgoing queue: The API requires that the user makes sure the observer is sent all messages starting at a specific epoch, regardless of when they actually manage to establish a connection. That's inconsistent: Either the user should have all the responsibility to manage the queues, or none.

Option 1: Observers need to register, user doesn't need to queue.

This proposal would remove the need to even queue outgoing messages for observers. (Or would it? What is the user supposed to do if a peer disconnects?)

Let's add a separate map with public keys to NetworkInfo that keeps track of all the observers, and let DHB and HB keep track of the observers' epochs, too, and treat them just like validators for the purpose of outgoing message queues. Also, for each node (observers and validators), let's keep track of an interval of epochs (from_epoch, to_epoch): (u64, u64), where from_epoch is the earliest one they are still interested in—all older messages to them can be discarded—and to_epoch is the earliest one they can't handle yet—all messages with >= to_epoch need to be queued.
Observers also need to formally join the network:

  • Add AddObserver and RemoveObserver votes to DHB.
  • Whenever one of these gets the majority, don't start DKG; just add or remove the observer to/from the NetworkInfo in the next epoch, and set their current epoch to (epoch, epoch), i.e. start queueing all messages for them starting from the epoch in which they joined. Also return the JoinPlan for them.
  • Signal that event to the user so that they can start establishing the connection (if they haven't already done so before casting the vote) and send the JoinPlan to the new observer.
  • Add a DynamicHoneyBadger::connect method that returns the current EpochStarted message: this needs to be sent to a peer whenever we establish a connection. That way, once we connect, we will receive the observer's EpochStarted (and vice-versa) and start sending messages to them.

We would probably even get rid of Target::All entirely, since we'd know all the recipients now. (Maybe we should use Rcs or annotate the messages with a set of recipients to avoid too much cloning.)

With this proposal, we could later implement the optimization where we replace a whole batch of queued messages to a lagging node with just a single message, containing the proof of that epoch's outcome.

Option 2: Make it the user's responsibility to do all of the above.

Expose all information about a message's epoch (both DHB and HB) and an instance's current epoch(s) to the user, so they can implement the queue. The AddObserver/RemoveObserver logic could also be handled on the application level, as long as it is guaranteed that all nodes agree on which messages to send to the new observer (as the current code requires).

Calling handle_message with a message from a future epoch that we're not ready for would be an error.

@vkomenda
Copy link
Contributor Author

vkomenda commented Sep 24, 2018

It turns out that observers are not essential for making changes in the set of validator nodes. DHB can manage an internal set of nodes for which there is an ongoing vote for Change::Add. If we don't need observers for some other purpose, I suggest decommissioning them in the new test framework.

@c0gent
Copy link
Contributor

c0gent commented Sep 24, 2018

I'm not entirely sure but I'd lean towards deprecating the concept of observers as well, leaving it to a higher layer.

We should probably implement the functionality described in Option 1 in a separate module in either hbbft or hydrabadger, structured in such a way that it can be integrated into Parity, or at least be used as a reference implementation.

@c0gent
Copy link
Contributor

c0gent commented Sep 24, 2018

Actually if it could be added to hydrabadger experimentally for now that would be great. Keeping it separate might also help clarify the API required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants