Skip to content

Conversation

halseth
Copy link
Contributor

@halseth halseth commented Nov 18, 2017

This PR adds state of AnnouceSignatures to the gossiper, making it retry sending it to the remote peer if it fails initially. This state is also persisted on restarts, making it retry sending the message if need be.

This should help the case where a peer has gone offline before receiving the announcement signatures, not making us able to craft the final channel proof, even though the peer comes back online.

@halseth halseth force-pushed the announce-signatures-resend branch 3 times, most recently from ec7264c to c53110a Compare November 23, 2017 08:47
server.go Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it isn't that case that we always want an error to be returned when sending a message (or we just don't care about the error). One case is when we're broadcasting a set of messages, we don't really care it they are reliably sent, rather it's mostly a set and forget.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing godoc comments on the three above fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of the keys should have their own godoc comments explaining the keys, and their usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indicated -> indicates

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to store the peer's ID at all? Seems that with just the channel ID, we have everything we need to properly dispatch the the messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, almost enough. by fetching the channel info for that particular channel ID we have the public keys of the two peers of the channel, but I couldn't figure out a clever way of figuring out which of the two is the remote, and which is the local node.

If there's an elegant way of fetching this missing information, this could be simplified, and we wouldn't need to store the peer id in the database.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have a more descriptive name. Suggestion: getProof.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like this fragment can be simplified a good bit. I believe this fragment is equivalent to:

if isFirstNode {
    remotePeer = chanInfo.NodeKey2
} else {
    remotePeer = chanInfo.NodeKey1
}

Or is there something I'm missing here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be done first, or in tandem when adding the proof above? Otherwise, we could go down between writes, then have no way to determine who originally sent the message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't parse this sentence currently. It should be re-written.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there's a possible race between this, and the goroutine launched below. This has the potential to cause the waiting proofs to stick around on disk longer than is strictly required. If we crash before we get to this point, then we never mark the proof as completed. As a result, it will never be deleted from disk.

A remedy to this would be to scan all waiting proofs on startup and delete the iff, if we read the channel edge from disk, it's announcement is already set. I think this works?

server.go Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed queueing*, or failed to queue, which is equivalength 😎

server.go Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should utilize ErrServerShuttingDown :)

server.go Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe adding error handling in this way nerfs the potential for queuing/batching messages within the peer. Since this now requires an ACK through the error channel, another message will not be queued until the previous one in the batch was successfully written out to the network. The server lock is held throughout the duration of this process, so it seems that this would make all SendToPeers serial and blocking

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Roasbeef has suggested that we get around this by interpreting all SendToPeer calls as "critical", while all Broadcast calls can be treated as non-critical. Within the current design, the wait group wg will be non-nil in the event of a broadcast, and nil for direct sends. If we detect the send is a broadcast, it should be fine to just queue messages with no errChan and forget.

Copy link
Contributor

@cfromknecht cfromknecht Nov 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are more optimizations possible in terms of batching up critical sends that contain multiple messages and minimizing the amount of time we hold the server's mutex. This has to do with first queueing the messages, and then reading back all the errors that were produced, if any, before returning to the caller. Note that the lock can be released after having queued the messages, but the invocation of SendToPeer should return only after receiving all potential errors.

It's a little less clear in this case what the caller should do in response to such an error—sending everything again seems to be the only safe option. I think these optimizations are beyond the scope of this PR, but figured I'd make a note of it if we decide that we require more pipelining in this area.

peer.go Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should return an error directly, that indicates whether or not we were able to queue the message. This will inform us as to whether we even need to receive on the errChan. Since this should only fail if the peer is currently shutting down, it would also allow us to short circuit batched sends from the caller's side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to persist this if the counterparty hasn't ACK'd anything? Even if SendToPeer returns no error, that doesn't necessarily mean the other party has persisted its state regarding the sent message, e.g. it could crash before it writes SetCompleted. Who's responsibility is it to restart this sequence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was written with the underlying assumption that the when SendToPeer returns a success, then the message is delivered to the peer. Not sure how we can actually be sure it is delivered, and introducing an ACK would deviate from the protocol? I think delivery+ack is the responsibility of the transport in this case, and we should make changes there (server+peer+brontide) to make sure it is actually delivered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like these should always be retransmitted on startup, unless the channel is already being advertised as per lalou's comment. Hopefully this ends up being simpler, as we won't need to persistently track Sent and Completed status!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever read this variable? AFAICT it only ever gets assigned

@halseth halseth force-pushed the announce-signatures-resend branch from 89c2f77 to a1f4711 Compare December 4, 2017 19:53
@halseth
Copy link
Contributor Author

halseth commented Dec 4, 2017

I ended up rewriting this quite a bit, reverting the changes done to the WaitingProofStore, and instead storing the result of sending the announcement message in a separate database within the gossiper. This lead to a much smaller diff, and simpler logic, since the two set of states (Sent and Completed) now is handled separately. This should also get rid of the mentioned race.

In addition I added the suggested changes to the peer and server (see the fixup! commits), making the the queueing of messages not blocking, and added to the godoc an explanation of how to use the method to broadcast messages.

d.wg.Add(1)
go d.networkHandler()

// In case we had an AnnounceSignatures ready to be sent when the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great refactor, I really like the simplicity of the new design! Would it be possible to move this into its own function? I think it would help to keep the Start method from getting too large. I would also recommend executing this bit before launching the network handler, as it's typically good practice to complete any recovery procedures before accepting additional state changes.


// Since the message was successfully sent to the peer, we
// can safely delete it from the database.
err := d.cfg.DB.Update(func(tx *bolt.Tx) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this deletion should only occur after we have persistently received some sort of acknowledgment from the remote peer, otherwise the remote peer can crash in the middle of persisting the proof we just sent. I believe Lalou had a solution to this, which deletes the waiting proof only after it's completed channel announcement has been added to the graph. The only other caveat is that on startup + before sending existing waiting proofs, we have to delete any waiting proofs that have already been added to the graph, which covers the case where the local peer goes down between the addition to the graph and the removal of the waiting proof.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with the above changes, I believe there might still be an issue if the message is sent successfully, the remote peer crashes, and then comes back online as we won't attempt to resend the message. Restarting lnd would fix this, but obviously not ideal.

The simplest solution that comes to mind is to not break the loop after a successful send, but to sleep for some amount of time and then check the on-disk waiting proof store. If the entry is still in the waiting proof store, we can continue and just resend as if it was the first attempt, only breaking if it has been removed due to having received the full announcement as in my comment above.

The behavior of this loop would then be: periodically send while peer is online, blocking using server's notifications during intermittent disconnects, until the waiting proof entry has been removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in the start method, that should take care of this case. With that we'll: check all our outgoing channels, and re-send the ann sig (using this func) for those that we don't have a full chan auth proof for (and isn't private). This simple bit of logic meets our goals: we'll keep resending the ann-sig until we either get one back, or the other side broadcasts the full chan ann message.

return
peerLog.Debugf("Peer shutting down, could not enqueue msg.")
err := fmt.Errorf("peer shutting down")
if errChan != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're returning the error directly, do we need to add this to the channel? Since the error indicates that message wasn't queued at all, a non-nil would signify that reading from the errChan isn't necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh I don't think we need to return the error. Instead we can just send over the channel as that lets the caller process the error in an async manner. Either is needed, not both though IMO.

server.go Outdated
errChans = append(errChans, errChan)
}

// Return an error if any of the messages failed being sent.
Copy link
Contributor

@cfromknecht cfromknecht Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job here as well! This has much better interleaving properties :)

For both sends and broadcasts, this will still hold the server's mutex until all messages have been ACKd or failures occurs. With a few small changes, we can make this only block during queueing, and allow the receiver to conditionally handle errors outside the mutex. This should help keep large sends or broadcasts from limiting our throughput and minimize contention in the server.

  1. Have sendPeerMessages and sendToPeer return ([]chan error, error)
  2. Move the remainder of this section into a separate method such as s.handleErrChans([]chan error) error
  3. Modify SendToPeer to handle error chans, such as:
s.Lock()
errChans, err := s.sendToPeer(target, msgs)
s.Unlock()
if err != nil {
    return err
}

return s.handleErrChans(errChans)

We won't need to modify broadcasting at all because we don't care about the errors and this actually returns it to the desired send-and-forget behavior. I suspect the compiler might even be able to optimize out the process of appending errChans during broadcasts, since the return variable can't be captured when run as a go routine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the compiler optimization doesn't happen, we can always shave the extra memory allocations by using the nil-ness of the wait group to instruct whether or not to instantiate error chans and appending to the slice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. If doing it this way, I would argue that we might just make the methods return only []chan error. Since a broadcast does not distinguish between failing to add a message to the queue, and failure sending at a later point, and a "reliable send" will also treat these two error scenarios similarly, we might as well only check the error channels if we want to.

The only downside I see by doing this is that if we send a batch of messages, and the first one cannot be queued, we will continue trying to queue the rest regardless. I don't think this will usually be a problem (they will all just fail quickly), but in the unlikely scenario that one succeeds (while a previous one failed) we end up with messages out of order. Not even sure this is possible 😛

We can make sendPeerMessages return only []chan error as mentioned above, but still keep the return error from queueMsg. In the case of failing to queue a message, we can send this error directly on the remaining error chans, avoiding the problem I mentioned above.

Thoughts?

// Fetch all the AnnounceSignatures messages that was added
// to the database, but not sent to the peer.
var msgsResend []msgTuple
err = d.cfg.DB.View(func(tx *bolt.Tx) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we don't do a direct database access like this, as it makes moving to a different database in the future more difficult and also requires the entire database to be reconstructed for unit tests. Instead, we can place this functionality behind an interface, creating a concrete implementation that uses bolt itself. This isn't blocking on this PR though, so it can be done in a follow up PR if you wish.

if err != nil {
return err
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still getting through the new version but note that: we only need to re-send the signatures iff we check our own channel and: it's set to public, but doesn't have a channel announcement proof.

As a result, we'll need to logic to scan our set of outgoing channels for those that don't yet have a channel announcement proof in the database.

// method returns after adding the message to persistent storage, such
// that the caller knows that the message will be delivered at one point.
func (d *AuthenticatedGossiper) sendAnnSigReliably(
msg *lnwire.AnnounceSignatures, remotePeer *btcec.PublicKey) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing a new line above this.

// channel oepning in progress to the same peer.
var key [41]byte
copy(key[:33], remotePeer.SerializeCompressed())
binary.BigEndian.PutUint64(key[33:], msg.ShortChannelID.ToUint64())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here (not blocking) about abstracting out the logic behind an interface.


// Since the message was successfully sent to the peer, we
// can safely delete it from the database.
err := d.cfg.DB.Update(func(tx *bolt.Tx) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in the start method, that should take care of this case. With that we'll: check all our outgoing channels, and re-send the ann sig (using this func) for those that we don't have a full chan auth proof for (and isn't private). This simple bit of logic meets our goals: we'll keep resending the ann-sig until we either get one back, or the other side broadcasts the full chan ann message.


// Test that we retry sending the AnnounceSignatures to the remote peer,
// even though we have been able to receive both parts of the proof.
func TestSignatureAnnouncementRetryAfterBroadcast(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem like this is needed. If we have the full proof (a fully signed chan ann message), then we don't need to send it to the remote peer, as upon reconnect, we'll just send them the full proof (which is what we're trying to create anyway).

peer.go Outdated
// queueMsg queues a new lnwire.Message to be eventually sent out on the
// wire.
func (p *peer) queueMsg(msg lnwire.Message, doneChan chan struct{}) {
// wire. It returns an error if we failed queing the message. An error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queing -> failed to queue

return
peerLog.Debugf("Peer shutting down, could not enqueue msg.")
err := fmt.Errorf("peer shutting down")
if errChan != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh I don't think we need to return the error. Instead we can just send over the channel as that lets the caller process the error in an async manner. Either is needed, not both though IMO.

server.go Outdated
for _, msg := range msgs {
targetPeer.queueMsg(msg, nil)
errChan := make(chan error, 1)
if err := targetPeer.queueMsg(msg, errChan); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we modify queueMsg to just send an error, or return one (not both), then we don't also need to check the error here. As otherwise, it may always block.

server.go Outdated
// Return an error if any of the messages failed being sent.
for _, errChan := range errChans {
select {
case err := <-errChan:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As was pointed out before, we don't always want this method to block until completion. Instead, it can take the error channel, and only block if it's nil.

@halseth
Copy link
Contributor Author

halseth commented Dec 12, 2017

Another day, another revamp! Notable changes this time is that the resending logic is significantly simplified:

  1. At startup we check all local proofs in the message store, and send them if we don't have the full proof. If we do, we just delete the local proof from the message store.
  2. When sending the local proof, we add it to the message store first, but we don't delete it after a successful send. This because we cannot really know if the remote was able to store it. We will delete it at first gossiper startup after we have the full proof stored.
  3. If we see the remote sending us his remote proof while we already have assembled the full proof, we send the full proof to the remote.
  4. To make (3) possible, an additional check is added when attempting to add a ChannelAnnouncement to the router: if AddEdge fails with ErrIgnored/ErrOutdated we check if the edge we have in the graph already contains the channel proof, if not we add it at this point.

Roasbeef
Roasbeef previously approved these changes Dec 14, 2017
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dig the latest set of changes, nice work!

Tested this locally a good bit, and didn't run into any major issues. You can go ahead and squash down these commits. I think we'll merge the 6 conf PR first, as that'll make it easier to test this, as sending the funding locked message will actually be decoupled from sending the chan ann messages.

LGTM 🐯

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Onwards! 🤺

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like deleteMsg can be re-used here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that was the reason I created the method in the first place 🤦‍♂️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

However, the indentation is a bit much. Would recommend rolling this into it's own function. This isn't a blocker on the PR though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TODO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra space here.

Roasbeef
Roasbeef previously approved these changes Dec 18, 2017
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🏄🏾

This commit adds a return error to sendPeerMessages,
making it possible to observe if a message sent to
a peer fails or succeeds.
This commit makes the gossiper track the state of a local
AnnounceSignature message, such that it can retry sending
it to the remote peer if needed. It will also persist this
state in the WaitingProofStore, such that it can resume
from this state at startup.
@halseth halseth force-pushed the announce-signatures-resend branch from 68c2d61 to 110a888 Compare December 18, 2017 09:29
@Roasbeef Roasbeef merged commit 2f926f1 into lightningnetwork:master Dec 19, 2017
@halseth halseth deleted the announce-signatures-resend branch July 12, 2018 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants