AnnounceSignatures resend in gossiper. #432

halseth · 2017-11-18T03:31:23Z

This PR adds state of AnnouceSignatures to the gossiper, making it retry sending it to the remote peer if it fails initially. This state is also persisted on restarts, making it retry sending the message if need be.

This should help the case where a peer has gone offline before receiving the announcement signatures, not making us able to craft the final channel proof, even though the peer comes back online.

Roasbeef · 2017-11-28T02:53:32Z

server.go

Note that it isn't that case that we always want an error to be returned when sending a message (or we just don't care about the error). One case is when we're broadcasting a set of messages, we don't really care it they are reliably sent, rather it's mostly a set and forget.

Roasbeef · 2017-11-28T02:54:38Z

channeldb/waitingproof.go

Missing godoc comments on the three above fields.

Roasbeef · 2017-11-28T02:55:54Z

channeldb/waitingproof.go

Each of the keys should have their own godoc comments explaining the keys, and their usage.

Roasbeef · 2017-11-28T02:56:28Z

channeldb/waitingproof.go

indicated -> indicates

Roasbeef · 2017-11-28T03:04:13Z

channeldb/waitingproof.go

Is it necessary to store the peer's ID at all? Seems that with just the channel ID, we have everything we need to properly dispatch the the messages?

yeah, almost enough. by fetching the channel info for that particular channel ID we have the public keys of the two peers of the channel, but I couldn't figure out a clever way of figuring out which of the two is the remote, and which is the local node.

If there's an elegant way of fetching this missing information, this could be simplified, and we wouldn't need to store the peer id in the database.

Roasbeef · 2017-11-28T03:29:36Z

channeldb/waitingproof.go

Should have a more descriptive name. Suggestion: getProof.

Roasbeef · 2017-11-28T03:49:11Z

discovery/gossiper.go

Feels like this fragment can be simplified a good bit. I believe this fragment is equivalent to:

if isFirstNode { remotePeer = chanInfo.NodeKey2 } else { remotePeer = chanInfo.NodeKey1 }

Or is there something I'm missing here?

Roasbeef · 2017-11-28T03:51:07Z

discovery/gossiper.go

Shouldn't this be done first, or in tandem when adding the proof above? Otherwise, we could go down between writes, then have no way to determine who originally sent the message.

Roasbeef · 2017-11-28T03:52:21Z

discovery/gossiper.go

Can't parse this sentence currently. It should be re-written.

Roasbeef · 2017-11-28T04:03:26Z

discovery/gossiper.go

It seems like there's a possible race between this, and the goroutine launched below. This has the potential to cause the waiting proofs to stick around on disk longer than is strictly required. If we crash before we get to this point, then we never mark the proof as completed. As a result, it will never be deleted from disk.

A remedy to this would be to scan all waiting proofs on startup and delete the iff, if we read the channel edge from disk, it's announcement is already set. I think this works?

cfromknecht · 2017-11-28T00:24:08Z

server.go

Failed queueing*, or failed to queue, which is equivalength 😎

cfromknecht · 2017-11-28T00:25:45Z

server.go

This should utilize ErrServerShuttingDown :)

cfromknecht · 2017-11-28T00:39:36Z

server.go

I believe adding error handling in this way nerfs the potential for queuing/batching messages within the peer. Since this now requires an ACK through the error channel, another message will not be queued until the previous one in the batch was successfully written out to the network. The server lock is held throughout the duration of this process, so it seems that this would make all SendToPeers serial and blocking

@Roasbeef has suggested that we get around this by interpreting all SendToPeer calls as "critical", while all Broadcast calls can be treated as non-critical. Within the current design, the wait group wg will be non-nil in the event of a broadcast, and nil for direct sends. If we detect the send is a broadcast, it should be fine to just queue messages with no errChan and forget.

There are more optimizations possible in terms of batching up critical sends that contain multiple messages and minimizing the amount of time we hold the server's mutex. This has to do with first queueing the messages, and then reading back all the errors that were produced, if any, before returning to the caller. Note that the lock can be released after having queued the messages, but the invocation of SendToPeer should return only after receiving all potential errors.

It's a little less clear in this case what the caller should do in response to such an error—sending everything again seems to be the only safe option. I think these optimizations are beyond the scope of this PR, but figured I'd make a note of it if we decide that we require more pipelining in this area.

cfromknecht · 2017-11-28T01:26:14Z

peer.go

Perhaps this should return an error directly, that indicates whether or not we were able to queue the message. This will inform us as to whether we even need to receive on the errChan. Since this should only fail if the peer is currently shutting down, it would also allow us to short circuit batched sends from the caller's side.

cfromknecht · 2017-11-30T01:09:55Z

discovery/gossiper.go

Is it safe to persist this if the counterparty hasn't ACK'd anything? Even if SendToPeer returns no error, that doesn't necessarily mean the other party has persisted its state regarding the sent message, e.g. it could crash before it writes SetCompleted. Who's responsibility is it to restart this sequence?

This was written with the underlying assumption that the when SendToPeer returns a success, then the message is delivered to the peer. Not sure how we can actually be sure it is delivered, and introducing an ACK would deviate from the protocol? I think delivery+ack is the responsibility of the transport in this case, and we should make changes there (server+peer+brontide) to make sure it is actually delivered.

cfromknecht · 2017-11-30T01:58:53Z

discovery/gossiper.go

It seems like these should always be retransmitted on startup, unless the channel is already being advertised as per lalou's comment. Hopefully this ends up being simpler, as we won't need to persistently track Sent and Completed status!

cfromknecht · 2017-11-30T02:07:39Z

channeldb/waitingproof.go

Do we ever read this variable? AFAICT it only ever gets assigned

halseth · 2017-12-04T19:57:23Z

I ended up rewriting this quite a bit, reverting the changes done to the WaitingProofStore, and instead storing the result of sending the announcement message in a separate database within the gossiper. This lead to a much smaller diff, and simpler logic, since the two set of states (Sent and Completed) now is handled separately. This should also get rid of the mentioned race.

In addition I added the suggested changes to the peer and server (see the fixup! commits), making the the queueing of messages not blocking, and added to the godoc an explanation of how to use the method to broadcast messages.

cfromknecht · 2017-12-05T10:40:41Z

discovery/gossiper.go

 	d.wg.Add(1)
 	go d.networkHandler()

+	// In case we had an AnnounceSignatures ready to be sent when the


This is a great refactor, I really like the simplicity of the new design! Would it be possible to move this into its own function? I think it would help to keep the Start method from getting too large. I would also recommend executing this bit before launching the network handler, as it's typically good practice to complete any recovery procedures before accepting additional state changes.

cfromknecht · 2017-12-05T20:45:57Z

discovery/gossiper.go

+
+		// Since the message was successfully sent to the peer, we
+		// can safely delete it from the database.
+		err := d.cfg.DB.Update(func(tx *bolt.Tx) error {


It seems that this deletion should only occur after we have persistently received some sort of acknowledgment from the remote peer, otherwise the remote peer can crash in the middle of persisting the proof we just sent. I believe Lalou had a solution to this, which deletes the waiting proof only after it's completed channel announcement has been added to the graph. The only other caveat is that on startup + before sending existing waiting proofs, we have to delete any waiting proofs that have already been added to the graph, which covers the case where the local peer goes down between the addition to the graph and the removal of the waiting proof.

Even with the above changes, I believe there might still be an issue if the message is sent successfully, the remote peer crashes, and then comes back online as we won't attempt to resend the message. Restarting lnd would fix this, but obviously not ideal.

The simplest solution that comes to mind is to not break the loop after a successful send, but to sleep for some amount of time and then check the on-disk waiting proof store. If the entry is still in the waiting proof store, we can continue and just resend as if it was the first attempt, only breaking if it has been removed due to having received the full announcement as in my comment above.

The behavior of this loop would then be: periodically send while peer is online, blocking using server's notifications during intermittent disconnects, until the waiting proof entry has been removed.

See my comment in the start method, that should take care of this case. With that we'll: check all our outgoing channels, and re-send the ann sig (using this func) for those that we don't have a full chan auth proof for (and isn't private). This simple bit of logic meets our goals: we'll keep resending the ann-sig until we either get one back, or the other side broadcasts the full chan ann message.

cfromknecht · 2017-12-06T00:31:29Z

peer.go

-		return
+		peerLog.Debugf("Peer shutting down, could not enqueue msg.")
+		err := fmt.Errorf("peer shutting down")
+		if errChan != nil {


If we're returning the error directly, do we need to add this to the channel? Since the error indicates that message wasn't queued at all, a non-nil would signify that reading from the errChan isn't necessary.

Yeh I don't think we need to return the error. Instead we can just send over the channel as that lets the caller process the error in an async manner. Either is needed, not both though IMO.

cfromknecht · 2017-12-06T00:49:39Z

server.go

+		errChans = append(errChans, errChan)
 	}
+
+	// Return an error if any of the messages failed being sent.


Nice job here as well! This has much better interleaving properties :)

For both sends and broadcasts, this will still hold the server's mutex until all messages have been ACKd or failures occurs. With a few small changes, we can make this only block during queueing, and allow the receiver to conditionally handle errors outside the mutex. This should help keep large sends or broadcasts from limiting our throughput and minimize contention in the server.

Have sendPeerMessages and sendToPeer return ([]chan error, error)

Move the remainder of this section into a separate method such as s.handleErrChans([]chan error) error

Modify SendToPeer to handle error chans, such as:

s.Lock() errChans, err := s.sendToPeer(target, msgs) s.Unlock() if err != nil { return err } return s.handleErrChans(errChans)

We won't need to modify broadcasting at all because we don't care about the errors and this actually returns it to the desired send-and-forget behavior. I suspect the compiler might even be able to optimize out the process of appending errChans during broadcasts, since the return variable can't be captured when run as a go routine.

If the compiler optimization doesn't happen, we can always shave the extra memory allocations by using the nil-ness of the wait group to instruct whether or not to instantiate error chans and appending to the slice.

Good idea. If doing it this way, I would argue that we might just make the methods return only []chan error. Since a broadcast does not distinguish between failing to add a message to the queue, and failure sending at a later point, and a "reliable send" will also treat these two error scenarios similarly, we might as well only check the error channels if we want to.

The only downside I see by doing this is that if we send a batch of messages, and the first one cannot be queued, we will continue trying to queue the rest regardless. I don't think this will usually be a problem (they will all just fail quickly), but in the unlikely scenario that one succeeds (while a previous one failed) we end up with messages out of order. Not even sure this is possible 😛

We can make sendPeerMessages return only []chan error as mentioned above, but still keep the return error from queueMsg. In the case of failing to queue a message, we can send this error directly on the remaining error chans, avoiding the problem I mentioned above.

Thoughts?

Roasbeef · 2017-12-08T22:55:39Z

discovery/gossiper.go

+	// Fetch all the AnnounceSignatures messages that was added
+	// to the database, but not sent to the peer.
+	var msgsResend []msgTuple
+	err = d.cfg.DB.View(func(tx *bolt.Tx) error {


Ideally we don't do a direct database access like this, as it makes moving to a different database in the future more difficult and also requires the entire database to be reconstructed for unit tests. Instead, we can place this functionality behind an interface, creating a concrete implementation that uses bolt itself. This isn't blocking on this PR though, so it can be done in a follow up PR if you wish.

Roasbeef · 2017-12-08T22:56:51Z

discovery/gossiper.go

+	if err != nil {
+		return err
+	}
+


Still getting through the new version but note that: we only need to re-send the signatures iff we check our own channel and: it's set to public, but doesn't have a channel announcement proof.

As a result, we'll need to logic to scan our set of outgoing channels for those that don't yet have a channel announcement proof in the database.

Roasbeef · 2017-12-08T22:58:24Z

discovery/gossiper.go

+// method returns after adding the message to persistent storage, such
+// that the caller knows that the message will be delivered at one point.
+func (d *AuthenticatedGossiper) sendAnnSigReliably(
+	msg *lnwire.AnnounceSignatures, remotePeer *btcec.PublicKey) error {


Nit: missing a new line above this.

Roasbeef · 2017-12-08T22:59:00Z

discovery/gossiper.go

+	// channel oepning in progress to the same peer.
+	var key [41]byte
+	copy(key[:33], remotePeer.SerializeCompressed())
+	binary.BigEndian.PutUint64(key[33:], msg.ShortChannelID.ToUint64())


Similar comment here (not blocking) about abstracting out the logic behind an interface.

Roasbeef · 2017-12-08T23:02:58Z

discovery/gossiper.go

+
+		// Since the message was successfully sent to the peer, we
+		// can safely delete it from the database.
+		err := d.cfg.DB.Update(func(tx *bolt.Tx) error {


See my comment in the start method, that should take care of this case. With that we'll: check all our outgoing channels, and re-send the ann sig (using this func) for those that we don't have a full chan auth proof for (and isn't private). This simple bit of logic meets our goals: we'll keep resending the ann-sig until we either get one back, or the other side broadcasts the full chan ann message.

Roasbeef · 2017-12-08T23:06:03Z

discovery/gossiper_test.go

+
+// Test that we retry sending the AnnounceSignatures to the remote peer,
+// even though we have been able to receive both parts of the proof.
+func TestSignatureAnnouncementRetryAfterBroadcast(t *testing.T) {


Doesn't seem like this is needed. If we have the full proof (a fully signed chan ann message), then we don't need to send it to the remote peer, as upon reconnect, we'll just send them the full proof (which is what we're trying to create anyway).

Roasbeef · 2017-12-08T23:07:42Z

peer.go

 // queueMsg queues a new lnwire.Message to be eventually sent out on the
-// wire.
-func (p *peer) queueMsg(msg lnwire.Message, doneChan chan struct{}) {
+// wire. It returns an error if we failed queing the message. An error


queing -> failed to queue

Roasbeef · 2017-12-08T23:08:49Z

peer.go

-		return
+		peerLog.Debugf("Peer shutting down, could not enqueue msg.")
+		err := fmt.Errorf("peer shutting down")
+		if errChan != nil {


Yeh I don't think we need to return the error. Instead we can just send over the channel as that lets the caller process the error in an async manner. Either is needed, not both though IMO.

Roasbeef · 2017-12-08T23:10:16Z

server.go

 	for _, msg := range msgs {
-		targetPeer.queueMsg(msg, nil)
+		errChan := make(chan error, 1)
+		if err := targetPeer.queueMsg(msg, errChan); err != nil {


If we modify queueMsg to just send an error, or return one (not both), then we don't also need to check the error here. As otherwise, it may always block.

Roasbeef · 2017-12-08T23:12:19Z

server.go

+	// Return an error if any of the messages failed being sent.
+	for _, errChan := range errChans {
+		select {
+		case err := <-errChan:


As was pointed out before, we don't always want this method to block until completion. Instead, it can take the error channel, and only block if it's nil.

halseth · 2017-12-12T13:03:42Z

Another day, another revamp! Notable changes this time is that the resending logic is significantly simplified:

At startup we check all local proofs in the message store, and send them if we don't have the full proof. If we do, we just delete the local proof from the message store.
When sending the local proof, we add it to the message store first, but we don't delete it after a successful send. This because we cannot really know if the remote was able to store it. We will delete it at first gossiper startup after we have the full proof stored.
If we see the remote sending us his remote proof while we already have assembled the full proof, we send the full proof to the remote.
To make (3) possible, an additional check is added when attempting to add a ChannelAnnouncement to the router: if AddEdge fails with ErrIgnored/ErrOutdated we check if the edge we have in the graph already contains the channel proof, if not we add it at this point.

Roasbeef

Dig the latest set of changes, nice work!

Tested this locally a good bit, and didn't run into any major issues. You can go ahead and squash down these commits. I think we'll merge the 6 conf PR first, as that'll make it easier to test this, as sending the funding locked message will actually be decoupled from sending the chan ann messages.

LGTM 🐯

Roasbeef · 2017-12-14T01:21:13Z

discovery/gossiper.go

Onwards! 🤺

Roasbeef · 2017-12-14T01:22:43Z

discovery/gossiper.go

Seems like deleteMsg can be re-used here?

Yep, that was the reason I created the method in the first place 🤦‍♂️

Roasbeef · 2017-12-14T01:27:24Z

discovery/gossiper.go

Nice!

However, the indentation is a bit much. Would recommend rolling this into it's own function. This isn't a blocker on the PR though.

Roasbeef · 2017-12-14T01:28:13Z

discovery/gossiper.go

Nit: extra space here.

Roasbeef

LGTM 🏄🏾

This commit adds a return error to sendPeerMessages, making it possible to observe if a message sent to a peer fails or succeeds.

This commit makes the gossiper track the state of a local AnnounceSignature message, such that it can retry sending it to the remote peer if needed. It will also persist this state in the WaitingProofStore, such that it can resume from this state at startup.

halseth force-pushed the announce-signatures-resend branch 3 times, most recently from ec7264c to c53110a Compare November 23, 2017 08:47

cfromknecht mentioned this pull request Nov 27, 2017

WIP: Revamp Peer Disconnection #396

Closed

Roasbeef reviewed Nov 28, 2017

View reviewed changes

cfromknecht suggested changes Nov 30, 2017

View reviewed changes

halseth force-pushed the announce-signatures-resend branch from 89c2f77 to a1f4711 Compare December 4, 2017 19:53

cfromknecht suggested changes Dec 6, 2017

View reviewed changes

Roasbeef requested changes Dec 8, 2017

View reviewed changes

Roasbeef previously approved these changes Dec 14, 2017

View reviewed changes

halseth dismissed Roasbeef’s stale review via 68c2d61 December 17, 2017 23:34

halseth force-pushed the announce-signatures-resend branch from ea9136a to 68c2d61 Compare December 17, 2017 23:34

Roasbeef previously approved these changes Dec 18, 2017

View reviewed changes

halseth added 5 commits December 18, 2017 10:06

peer: add error chan to queueMsg

421aef4

server: return error from sendPeerMessages

da1a850

This commit adds a return error to sendPeerMessages, making it possible to observe if a message sent to a peer fails or succeeds.

discovery test: update gossiper test for new database structure

0e28b0d

server: add NotifyWhenOnline method to gossiper

110a888

halseth dismissed Roasbeef’s stale review via 110a888 December 18, 2017 09:29

halseth force-pushed the announce-signatures-resend branch from 68c2d61 to 110a888 Compare December 18, 2017 09:29

Roasbeef mentioned this pull request Dec 18, 2017

Can't close channels or reconnect to peers. #493

Closed

Roasbeef merged commit 2f926f1 into lightningnetwork:master Dec 19, 2017

halseth deleted the announce-signatures-resend branch July 12, 2018 13:39

AnnounceSignatures resend in gossiper. #432

AnnounceSignatures resend in gossiper. #432

Uh oh!

Conversation

halseth commented Nov 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cfromknecht Nov 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

halseth commented Dec 4, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cfromknecht Dec 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cfromknecht Nov 28, 2017 •

edited

Loading

cfromknecht Dec 6, 2017 •

edited

Loading