-
Notifications
You must be signed in to change notification settings - Fork 2.2k
AnnounceSignatures resend in gossiper. #432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AnnounceSignatures resend in gossiper. #432
Conversation
ec7264c
to
c53110a
Compare
server.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it isn't that case that we always want an error to be returned when sending a message (or we just don't care about the error). One case is when we're broadcasting a set of messages, we don't really care it they are reliably sent, rather it's mostly a set and forget.
channeldb/waitingproof.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing godoc
comments on the three above fields.
channeldb/waitingproof.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each of the keys should have their own godoc
comments explaining the keys, and their usage.
channeldb/waitingproof.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indicated -> indicates
channeldb/waitingproof.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to store the peer's ID at all? Seems that with just the channel ID, we have everything we need to properly dispatch the the messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, almost enough. by fetching the channel info for that particular channel ID we have the public keys of the two peers of the channel, but I couldn't figure out a clever way of figuring out which of the two is the remote, and which is the local node.
If there's an elegant way of fetching this missing information, this could be simplified, and we wouldn't need to store the peer id in the database.
channeldb/waitingproof.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have a more descriptive name. Suggestion: getProof
.
discovery/gossiper.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like this fragment can be simplified a good bit. I believe this fragment is equivalent to:
if isFirstNode {
remotePeer = chanInfo.NodeKey2
} else {
remotePeer = chanInfo.NodeKey1
}
Or is there something I'm missing here?
discovery/gossiper.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be done first, or in tandem when adding the proof above? Otherwise, we could go down between writes, then have no way to determine who originally sent the message.
discovery/gossiper.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't parse this sentence currently. It should be re-written.
discovery/gossiper.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there's a possible race between this, and the goroutine launched below. This has the potential to cause the waiting proofs to stick around on disk longer than is strictly required. If we crash before we get to this point, then we never mark the proof as completed. As a result, it will never be deleted from disk.
A remedy to this would be to scan all waiting proofs on startup and delete the iff, if we read the channel edge from disk, it's announcement is already set. I think this works?
server.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failed queueing*, or failed to queue, which is equivalength 😎
server.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should utilize ErrServerShuttingDown
:)
server.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe adding error handling in this way nerfs the potential for queuing/batching messages within the peer. Since this now requires an ACK through the error channel, another message will not be queued until the previous one in the batch was successfully written out to the network. The server lock is held throughout the duration of this process, so it seems that this would make all SendToPeers serial and blocking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Roasbeef has suggested that we get around this by interpreting all SendToPeer
calls as "critical", while all Broadcast
calls can be treated as non-critical. Within the current design, the wait group wg
will be non-nil in the event of a broadcast, and nil for direct sends. If we detect the send is a broadcast, it should be fine to just queue messages with no errChan
and forget.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are more optimizations possible in terms of batching up critical sends that contain multiple messages and minimizing the amount of time we hold the server's mutex. This has to do with first queueing the messages, and then reading back all the errors that were produced, if any, before returning to the caller. Note that the lock can be released after having queued the messages, but the invocation of SendToPeer
should return only after receiving all potential errors.
It's a little less clear in this case what the caller should do in response to such an error—sending everything again seems to be the only safe option. I think these optimizations are beyond the scope of this PR, but figured I'd make a note of it if we decide that we require more pipelining in this area.
peer.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should return an error directly, that indicates whether or not we were able to queue the message. This will inform us as to whether we even need to receive on the errChan
. Since this should only fail if the peer is currently shutting down, it would also allow us to short circuit batched sends from the caller's side.
discovery/gossiper.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to persist this if the counterparty hasn't ACK'd anything? Even if SendToPeer
returns no error, that doesn't necessarily mean the other party has persisted its state regarding the sent message, e.g. it could crash before it writes SetCompleted
. Who's responsibility is it to restart this sequence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was written with the underlying assumption that the when SendToPeer
returns a success, then the message is delivered to the peer. Not sure how we can actually be sure it is delivered, and introducing an ACK would deviate from the protocol? I think delivery+ack is the responsibility of the transport in this case, and we should make changes there (server+peer+brontide) to make sure it is actually delivered.
discovery/gossiper.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like these should always be retransmitted on startup, unless the channel is already being advertised as per lalou's comment. Hopefully this ends up being simpler, as we won't need to persistently track Sent and Completed status!
channeldb/waitingproof.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever read this variable? AFAICT it only ever gets assigned
89c2f77
to
a1f4711
Compare
I ended up rewriting this quite a bit, reverting the changes done to the In addition I added the suggested changes to the peer and server (see the fixup! commits), making the the queueing of messages not blocking, and added to the godoc an explanation of how to use the method to broadcast messages. |
d.wg.Add(1) | ||
go d.networkHandler() | ||
|
||
// In case we had an AnnounceSignatures ready to be sent when the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great refactor, I really like the simplicity of the new design! Would it be possible to move this into its own function? I think it would help to keep the Start method from getting too large. I would also recommend executing this bit before launching the network handler, as it's typically good practice to complete any recovery procedures before accepting additional state changes.
discovery/gossiper.go
Outdated
|
||
// Since the message was successfully sent to the peer, we | ||
// can safely delete it from the database. | ||
err := d.cfg.DB.Update(func(tx *bolt.Tx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this deletion should only occur after we have persistently received some sort of acknowledgment from the remote peer, otherwise the remote peer can crash in the middle of persisting the proof we just sent. I believe Lalou had a solution to this, which deletes the waiting proof only after it's completed channel announcement has been added to the graph. The only other caveat is that on startup + before sending existing waiting proofs, we have to delete any waiting proofs that have already been added to the graph, which covers the case where the local peer goes down between the addition to the graph and the removal of the waiting proof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with the above changes, I believe there might still be an issue if the message is sent successfully, the remote peer crashes, and then comes back online as we won't attempt to resend the message. Restarting lnd would fix this, but obviously not ideal.
The simplest solution that comes to mind is to not break the loop after a successful send, but to sleep for some amount of time and then check the on-disk waiting proof store. If the entry is still in the waiting proof store, we can continue
and just resend as if it was the first attempt, only breaking if it has been removed due to having received the full announcement as in my comment above.
The behavior of this loop would then be: periodically send while peer is online, blocking using server's notifications during intermittent disconnects, until the waiting proof entry has been removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment in the start method, that should take care of this case. With that we'll: check all our outgoing channels, and re-send the ann sig (using this func) for those that we don't have a full chan auth proof for (and isn't private). This simple bit of logic meets our goals: we'll keep resending the ann-sig until we either get one back, or the other side broadcasts the full chan ann message.
return | ||
peerLog.Debugf("Peer shutting down, could not enqueue msg.") | ||
err := fmt.Errorf("peer shutting down") | ||
if errChan != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're returning the error directly, do we need to add this to the channel? Since the error indicates that message wasn't queued at all, a non-nil would signify that reading from the errChan
isn't necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh I don't think we need to return the error. Instead we can just send over the channel as that lets the caller process the error in an async manner. Either is needed, not both though IMO.
server.go
Outdated
errChans = append(errChans, errChan) | ||
} | ||
|
||
// Return an error if any of the messages failed being sent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job here as well! This has much better interleaving properties :)
For both sends and broadcasts, this will still hold the server's mutex until all messages have been ACKd or failures occurs. With a few small changes, we can make this only block during queueing, and allow the receiver to conditionally handle errors outside the mutex. This should help keep large sends or broadcasts from limiting our throughput and minimize contention in the server.
- Have
sendPeerMessages
andsendToPeer
return([]chan error, error)
- Move the remainder of this section into a separate method such as
s.handleErrChans([]chan error) error
- Modify
SendToPeer
to handle error chans, such as:
s.Lock()
errChans, err := s.sendToPeer(target, msgs)
s.Unlock()
if err != nil {
return err
}
return s.handleErrChans(errChans)
We won't need to modify broadcasting at all because we don't care about the errors and this actually returns it to the desired send-and-forget behavior. I suspect the compiler might even be able to optimize out the process of appending errChans during broadcasts, since the return variable can't be captured when run as a go routine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the compiler optimization doesn't happen, we can always shave the extra memory allocations by using the nil-ness of the wait group to instruct whether or not to instantiate error chans and appending to the slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. If doing it this way, I would argue that we might just make the methods return only []chan error
. Since a broadcast does not distinguish between failing to add a message to the queue, and failure sending at a later point, and a "reliable send" will also treat these two error scenarios similarly, we might as well only check the error channels if we want to.
The only downside I see by doing this is that if we send a batch of messages, and the first one cannot be queued, we will continue trying to queue the rest regardless. I don't think this will usually be a problem (they will all just fail quickly), but in the unlikely scenario that one succeeds (while a previous one failed) we end up with messages out of order. Not even sure this is possible 😛
We can make sendPeerMessages
return only []chan error
as mentioned above, but still keep the return error from queueMsg
. In the case of failing to queue a message, we can send this error directly on the remaining error chans, avoiding the problem I mentioned above.
Thoughts?
discovery/gossiper.go
Outdated
// Fetch all the AnnounceSignatures messages that was added | ||
// to the database, but not sent to the peer. | ||
var msgsResend []msgTuple | ||
err = d.cfg.DB.View(func(tx *bolt.Tx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we don't do a direct database access like this, as it makes moving to a different database in the future more difficult and also requires the entire database to be reconstructed for unit tests. Instead, we can place this functionality behind an interface, creating a concrete implementation that uses bolt itself. This isn't blocking on this PR though, so it can be done in a follow up PR if you wish.
if err != nil { | ||
return err | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still getting through the new version but note that: we only need to re-send the signatures iff we check our own channel and: it's set to public, but doesn't have a channel announcement proof.
As a result, we'll need to logic to scan our set of outgoing channels for those that don't yet have a channel announcement proof in the database.
// method returns after adding the message to persistent storage, such | ||
// that the caller knows that the message will be delivered at one point. | ||
func (d *AuthenticatedGossiper) sendAnnSigReliably( | ||
msg *lnwire.AnnounceSignatures, remotePeer *btcec.PublicKey) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missing a new line above this.
// channel oepning in progress to the same peer. | ||
var key [41]byte | ||
copy(key[:33], remotePeer.SerializeCompressed()) | ||
binary.BigEndian.PutUint64(key[33:], msg.ShortChannelID.ToUint64()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here (not blocking) about abstracting out the logic behind an interface.
discovery/gossiper.go
Outdated
|
||
// Since the message was successfully sent to the peer, we | ||
// can safely delete it from the database. | ||
err := d.cfg.DB.Update(func(tx *bolt.Tx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment in the start method, that should take care of this case. With that we'll: check all our outgoing channels, and re-send the ann sig (using this func) for those that we don't have a full chan auth proof for (and isn't private). This simple bit of logic meets our goals: we'll keep resending the ann-sig until we either get one back, or the other side broadcasts the full chan ann message.
discovery/gossiper_test.go
Outdated
|
||
// Test that we retry sending the AnnounceSignatures to the remote peer, | ||
// even though we have been able to receive both parts of the proof. | ||
func TestSignatureAnnouncementRetryAfterBroadcast(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem like this is needed. If we have the full proof (a fully signed chan ann message), then we don't need to send it to the remote peer, as upon reconnect, we'll just send them the full proof (which is what we're trying to create anyway).
peer.go
Outdated
// queueMsg queues a new lnwire.Message to be eventually sent out on the | ||
// wire. | ||
func (p *peer) queueMsg(msg lnwire.Message, doneChan chan struct{}) { | ||
// wire. It returns an error if we failed queing the message. An error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
queing -> failed to queue
return | ||
peerLog.Debugf("Peer shutting down, could not enqueue msg.") | ||
err := fmt.Errorf("peer shutting down") | ||
if errChan != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh I don't think we need to return the error. Instead we can just send over the channel as that lets the caller process the error in an async manner. Either is needed, not both though IMO.
server.go
Outdated
for _, msg := range msgs { | ||
targetPeer.queueMsg(msg, nil) | ||
errChan := make(chan error, 1) | ||
if err := targetPeer.queueMsg(msg, errChan); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we modify queueMsg
to just send an error, or return one (not both), then we don't also need to check the error here. As otherwise, it may always block.
server.go
Outdated
// Return an error if any of the messages failed being sent. | ||
for _, errChan := range errChans { | ||
select { | ||
case err := <-errChan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As was pointed out before, we don't always want this method to block until completion. Instead, it can take the error channel, and only block if it's nil.
Another day, another revamp! Notable changes this time is that the resending logic is significantly simplified:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dig the latest set of changes, nice work!
Tested this locally a good bit, and didn't run into any major issues. You can go ahead and squash down these commits. I think we'll merge the 6 conf PR first, as that'll make it easier to test this, as sending the funding locked message will actually be decoupled from sending the chan ann messages.
LGTM 🐯
discovery/gossiper.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Onwards! 🤺
discovery/gossiper.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like deleteMsg
can be re-used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that was the reason I created the method in the first place 🤦♂️
discovery/gossiper.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
However, the indentation is a bit much. Would recommend rolling this into it's own function. This isn't a blocker on the PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added TODO
discovery/gossiper.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: extra space here.
ea9136a
to
68c2d61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🏄🏾
This commit adds a return error to sendPeerMessages, making it possible to observe if a message sent to a peer fails or succeeds.
This commit makes the gossiper track the state of a local AnnounceSignature message, such that it can retry sending it to the remote peer if needed. It will also persist this state in the WaitingProofStore, such that it can resume from this state at startup.
68c2d61
to
110a888
Compare
This PR adds state of AnnouceSignatures to the gossiper, making it retry sending it to the remote peer if it fails initially. This state is also persisted on restarts, making it retry sending the message if need be.
This should help the case where a peer has gone offline before receiving the announcement signatures, not making us able to craft the final channel proof, even though the peer comes back online.