New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send ChannelUpdate with Disabled bit if channel inactive for 20 minutes #1687

Merged
merged 12 commits into from Aug 18, 2018

Conversation

Projects
None yet
5 participants
@halseth
Collaborator

halseth commented Aug 6, 2018

Fixes #1605

Note:

server is not very unit testable at the moment. Should consider extracting some of the logic behind interfaces to make it easier to test additions like these.

@halseth halseth force-pushed the halseth:disable-inactive-channels branch 3 times, most recently from b13da73 to 781783b Aug 7, 2018

server.go Outdated
&c.FundingOutpoint)
active := false
if s.htlcSwitch.HasActiveLink(chanID) {

This comment has been minimized.

@cfromknecht

cfromknecht Aug 8, 2018

Collaborator

equivalent to:

active := s.htlcSwitch.HasActiveLink(
        chanID,
)

This comment has been minimized.

@halseth

halseth Aug 8, 2018

Collaborator

lol that's embarrassing

server.go Outdated
active: active,
time: time.Now(),
}
continue

This comment has been minimized.

@cfromknecht

cfromknecht Aug 8, 2018

Collaborator

unnecessary continue?

server.go Outdated
}
// For each open channel, update the status.
for _, c := range channels {

This comment has been minimized.

@cfromknecht

cfromknecht Aug 8, 2018

Collaborator

should we filter out private channels?

This comment has been minimized.

@cfromknecht

cfromknecht Aug 9, 2018

Collaborator

maybe pending as well? HasActiveLink will always return false for those, and we'll send out disable msgs

This comment has been minimized.

@halseth

halseth Aug 9, 2018

Collaborator

Private channel updates are filtered out by the gossiper, same for pending channels (as they don't have an announcement proof).

This comment has been minimized.

@halseth

halseth Aug 9, 2018

Collaborator

Hm, thinking about it we might send them directly to our channel peer, so maybe better to filter out here to be sure we don't disable channels just waiting for funding confirmation.

This comment has been minimized.

@halseth

halseth Aug 9, 2018

Collaborator

alright, FetchAllOpenChannels actually filter out pending channels, so we should be okay :)

This comment has been minimized.

@cfromknecht

cfromknecht Aug 9, 2018

Collaborator

FetchAllOpenChannels actually filter out pending channels

Ah right, rtfm :)

For private ones though, is there a reason not to spare the gossiper the extra work?

This comment has been minimized.

@halseth

halseth Aug 9, 2018

Collaborator

It would involve doing an extra DB lookup here, feels like it's nice to let the gossiper have that responsibility :)

server.go Outdated
// map of link statuses, to make sure
// we don't send out another update on
// the next tick.
delete(status, op)

This comment has been minimized.

@cfromknecht

cfromknecht Aug 8, 2018

Collaborator

it seems like deleting the status will cause us to resend an update live links every 40 minutes, since we'll cycle through the five steps:

  1. adding to the map
  2. waiting 20 mins
  3. sending update
  4. delete from map
  5. waiting 20 mins

i think this can be solved by setting the lastEnabledTime/lastDisabledTime appropriately

This comment has been minimized.

@halseth

halseth Aug 9, 2018

Collaborator

Handled by the sendDisabled map.

server.go Outdated
// This is done by sending a new channel update across the network with the
// disabled flag set accordingly. The result of disabling the channel is it not
// being able to forward payments.
func (s *server) disableChannel(op wire.OutPoint, disable bool) error {

This comment has been minimized.

@cfromknecht

cfromknecht Aug 8, 2018

Collaborator

annChannelStatus might be a better name now that this handles enable/disable

Would also be useful to have an AnnChannelStatus that sends the requests to the watchChannelStatus. It will probably be easier to track exactly when we last sent enable/disable if they all go through one place. My thinking is that the public version could replace the current places where disableChannel is configured in other peers/subsystems. Lmk what you think

This comment has been minimized.

@halseth

halseth Aug 9, 2018

Collaborator

Keeping unexported for now, as it is only used in the server.

peer.go Outdated
// As a final measure we ensure the channel is not currently
// disabled, as that will make us skip it during path finding.
err = p.server.disableChannel(*chanPoint, false)

This comment has been minimized.

@cfromknecht

cfromknecht Aug 8, 2018

Collaborator

may want to defer sending these until after all channels have been successfully started, o/w the peer can be torn down after partially reactivating the channels

This comment has been minimized.

@halseth

halseth Aug 9, 2018

Collaborator

Fixed.

// time we saw a link changing to the current active status.
type activeStatus struct {
active bool
time time.Time

This comment has been minimized.

@cfromknecht

cfromknecht Aug 8, 2018

Collaborator

would also be useful to track lastEnabledTime and lastDisabledTime here. this will allow us to determine what the last message we sent was, even if the active status flip flops

for example, if we want to disable a channel, we would also check
st.lastDisabledTime.Before(st.lastEnabledTime)
for enabled:
st.lastEnabledTime.Before(st.lastDisabledTime)

Othwerise we may flip-flop, and then send out an update after 20min even though that was what we last announced

This comment has been minimized.

@halseth

halseth Aug 9, 2018

Collaborator

Now handled by the sentDisabled in announceChanStatus

This comment has been minimized.

@cfromknecht

cfromknecht Aug 9, 2018

Collaborator

nice, the single mutex approach is much cleaner

@Roasbeef Roasbeef added this to the 0.5 milestone Aug 8, 2018

@halseth halseth force-pushed the halseth:disable-inactive-channels branch 2 times, most recently from 959096d to 013cf10 Aug 8, 2018

@@ -226,7 +227,8 @@ type config struct {
NoEncryptWallet bool `long:"noencryptwallet" description:"If set, wallet will be encrypted using the default passphrase."`
TrickleDelay int `long:"trickledelay" description:"Time in milliseconds between each release of announcements to the network"`
TrickleDelay int `long:"trickledelay" description:"Time in milliseconds between each release of announcements to the network"`
InactiveChanTimeout time.Duration `long:"inactivechantimeout" description:"If a channel has been inactive for the set time, send a ChannelUpdate disabling it."`

This comment has been minimized.

@cfromknecht

cfromknecht Aug 9, 2018

Collaborator

👍

server.go Outdated
// disabled flag set accordingly. The result of disabling the channel is it not
// being able to forward payments.
func (s *server) announceChanStatus(op wire.OutPoint, disabled bool) error {

This comment has been minimized.

@cfromknecht

cfromknecht Aug 9, 2018

Collaborator

nit: extra new line

server.go Outdated
}
// For each open channel, update the status.
for _, c := range channels {

This comment has been minimized.

@cfromknecht

cfromknecht Aug 9, 2018

Collaborator

maybe pending as well? HasActiveLink will always return false for those, and we'll send out disable msgs

server.go Outdated
s.mu.RLock()
alreadyDisabled, ok := s.sentDisabled[op]
s.mu.RUnlock()
if ok && alreadyDisabled == disabled {

This comment has been minimized.

@cfromknecht

cfromknecht Aug 9, 2018

Collaborator

with this locking procedure, we may send out duplicate announcements w/ same active/inactive status if ok. not sure how critical that is tho

If !ok, it would be possible to send out conflicting inactive/active statuses, and the network would see which ever finished last

This comment has been minimized.

@halseth

halseth Aug 9, 2018

Collaborator

Changed to using own mutex.

@halseth halseth force-pushed the halseth:disable-inactive-channels branch from 013cf10 to 58530cb Aug 9, 2018

// marking each as disabled.
// Now that the channel close processes have been started, we should
// receive an update marking each as disabled.
expectedPolicy.Disabled = true

This comment has been minimized.

@cfromknecht

cfromknecht Aug 9, 2018

Collaborator

👍

@@ -555,6 +555,20 @@ func (n *NetworkHarness) RestartNode(node *HarnessNode, callback func() error) e
return node.start(n.lndErrorChan)
}
// SuspendNode stops the given node and returns a callback that can be used to
// start it again.
func (n *NetworkHarness) SuspendNode(node *HarnessNode) (func() error, error) {

This comment has been minimized.

@cfromknecht

cfromknecht Aug 9, 2018

Collaborator

nifty 🤓

// We must now update the message's timestamp and generate a new
// signature.
chanUpdate.Timestamp = uint32(time.Now().Unix())
newTimestamp := uint32(time.Now().Unix())
if newTimestamp <= chanUpdate.Timestamp {

This comment has been minimized.

@cfromknecht

cfromknecht Aug 9, 2018

Collaborator

Good catch

This comment has been minimized.

@wpaulino

wpaulino Aug 15, 2018

Collaborator

Hmm would like to know how it's possible for the timestamp to be in the future.

This comment has been minimized.

@Roasbeef

Roasbeef Aug 15, 2018

Member

Not that it's in the future, but if two updates closely follow each other, if we don't bump the timestamp, then it won't propagate. For the <, older version of go don't enforce monotonic clocks.

@cfromknecht

LGTM! 🤓 Just needs squash

@Roasbeef

Nice work! Super eager to get this into our next release. This along should dramatically reduce routing errors in the network, as wholesale unreliable (or inactive) peers will have their channels disabled.

One possible follow up to would be to extend the ListChannels method itself to also return is a channel is disabled or not. Not a blocker on this though, I can follow up with an issue.

@@ -555,6 +555,18 @@ func (n *NetworkHarness) RestartNode(node *HarnessNode, callback func() error) e
return node.start(n.lndErrorChan)
}
func (n *NetworkHarness) SuspendNode(node *HarnessNode) (func() error, error) {

This comment has been minimized.

@Roasbeef

Roasbeef Aug 11, 2018

Member

Missing godoc comment.

lnd_test.go Outdated
// We create a new node Eve that has an inactive channel timeout of
// just 5 secnods (down from the default 20m). it will be used to test
// channel updates for channels going inactive.
eve, err := net.NewNode("Eve", []string{"--inactivechantimeout=5s"})

This comment has been minimized.

@Roasbeef

Roasbeef Aug 11, 2018

Member

Why not a lower timeout? Seems possible that on a fast machine, the assertion below will fail as it runs through the logic below fast enough.

This comment has been minimized.

@halseth

halseth Aug 11, 2018

Collaborator

It will wait up to 20 seconds for the update, so should be plenty of time. Did lower to 2s.

@@ -2924,22 +2924,6 @@ func extractChannelUpdate(info *channeldb.ChannelEdgeInfo,
// applyChannelUpdate applies the channel update to the different sub-systems of
// the server.
func (s *server) applyChannelUpdate(update *lnwire.ChannelUpdate) error {
newChannelPolicy := &channeldb.ChannelEdgePolicy{

This comment has been minimized.

@Roasbeef

Roasbeef Aug 11, 2018

Member

Nice catch!

This comment has been minimized.

@wpaulino

wpaulino Aug 15, 2018

Collaborator

Heh was meaning to remove this myself once I realized it wasn't needed 👍

peer.go Outdated
// during path finding.
go func() {
for _, chanPoint := range activeChans {
err := p.server.announceChanStatus(chanPoint, false)

This comment has been minimized.

@Roasbeef

Roasbeef Aug 11, 2018

Member

This'll cause us to continually announce a channel as enabled each time a peer reconnects. Something we've been trying to avoid. It would necessarily be each time (due to the trickle timeout), but it'll maximize the number of channel updates in flight for flappy peers.

This comment has been minimized.

@Roasbeef

Roasbeef Aug 11, 2018

Member

Ah nvm, I see now that the method it uses will noop if we already announced this state.

This comment has been minimized.

@wpaulino

wpaulino Aug 15, 2018

Collaborator

Would be nice to note that here in a comment.

This comment has been minimized.

@halseth

halseth Aug 16, 2018

Collaborator

Added comment.

server.go Outdated
@@ -2914,30 +2914,44 @@ func (s *server) announceChanStatus(op wire.OutPoint, disabled bool) error {
// fetchLastChanUpdateByOutPoint fetches the latest update for a channel from
// our point of view.
func (s *server) fetchLastChanUpdateByOutPoint(op wire.OutPoint) (*lnwire.ChannelUpdate, error) {
func (s *server) fetchLastChanUpdateByOutPoint(op wire.OutPoint) (

This comment has been minimized.

@Roasbeef

Roasbeef Aug 11, 2018

Member

Nit: the comment should read "either edge is not present". From my reading the prior code would possibly panic if one of the edges weren't advertised, while this will return with an error if we can't find our edge.

@Roasbeef

This comment has been minimized.

Member

Roasbeef commented Aug 11, 2018

Will start to run this on the faucet!

@Roasbeef

This comment has been minimized.

Member

Roasbeef commented Aug 11, 2018

This doesn't build as is.

@halseth halseth force-pushed the halseth:disable-inactive-channels branch from a54f69b to f92f98d Aug 11, 2018

@halseth

This comment has been minimized.

Collaborator

halseth commented Aug 11, 2018

Compilation issue fixed, and squashed.

@halseth halseth force-pushed the halseth:disable-inactive-channels branch from f92f98d to 8dcfcd5 Aug 11, 2018

@Roasbeef

This comment has been minimized.

Member

Roasbeef commented Aug 14, 2018

Needs a rebase, now has a switch conflict.

@halseth halseth force-pushed the halseth:disable-inactive-channels branch from 8dcfcd5 to 58d0a2d Aug 14, 2018

@halseth

This comment has been minimized.

Collaborator

halseth commented Aug 14, 2018

Rebased.

for op, st := range status {
disable := !st.active
if st.time.Before(limit) {

This comment has been minimized.

@Roasbeef

Roasbeef Aug 15, 2018

Member

This is ineffective in testing. As is right now, it'll continually attempt to re-send an update each interval. We should only iterate over the set of channels that have had their state changed since we last went thru the prior iteration.

This comment has been minimized.

@Roasbeef

Roasbeef Aug 16, 2018

Member

Actually this is OK, although we could prevent our selves from sending again. The issue is that we never actually send out the update since we can't find the edge, so sentDisabled isn't updated. As a result we continually try in announceChanStatus and fail each time.

This comment has been minimized.

@cfromknecht

cfromknecht Aug 16, 2018

Collaborator

The stricter checks would prevent us from doing unneccessary signing, fwiw

active bool
time time.Time
}
status := make(map[wire.OutPoint]activeStatus)

This comment has been minimized.

@Roasbeef

Roasbeef Aug 16, 2018

Member

We never clean out closed channels. So as is, this can grow unbounded, albeit the growth rate should be pretty small unless someone has like tens of thousands of channels.

This comment has been minimized.

@halseth

halseth Aug 16, 2018

Collaborator

Fixed by recent fixup. See comment below.

@Roasbeef

This comment has been minimized.

Member

Roasbeef commented Aug 16, 2018

Actually scratch that, for those channels they have a chan ID of zero, which means that they were somehow never inserted into the graph at all. Possibly an issue with the funding manager not reliably sending the announcements to the gossiper.

@Roasbeef

This comment has been minimized.

Member

Roasbeef commented Aug 16, 2018

We should def find the greater issue here, but one work around would be to have the funding manager check for channel open on start up, or which have recently transitioned state, and re-insert them into the graph if we find that they're unknown.

@Roasbeef

This comment has been minimized.

Member

Roasbeef commented Aug 16, 2018

Hit this panic while running the patch:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0xcf4c1a]

goroutine 2528 [running]:
main.(*server).fetchLastChanUpdateByOutPoint.func1(0x0, 0xc421183650, 0x21, 0x21)
        /root/gocode/src/github.com/lightningnetwork/lnd/server.go:2942 +0x2a
main.(*server).fetchLastChanUpdateByOutPoint(0xc42058c340, 0x56e8d11296769b2, 0x13753c248299d43f, 0xbc354e16e1bb81a4, 0x5b5ecbfcac30b18f, 0x0, 0xc425c93840, 0xc4202aa9a0, 0x4e)
        /root/gocode/src/github.com/lightningnetwork/lnd/server.go:2958 +0x14c
main.(*server).announceChanStatus(0xc42058c340, 0x56e8d11296769b2, 0x13753c248299d43f, 0xbc354e16e1bb81a4, 0x5b5ecbfcac30b18f, 0x100000000, 0x0, 0x0)
        /root/gocode/src/github.com/lightningnetwork/lnd/server.go:2877 +0x262
main.(*server).watchChannelStatus(0xc42058c340)
        /root/gocode/src/github.com/lightningnetwork/lnd/server.go:3079 +0x615
created by main.(*server).Start
        /root/gocode/src/github.com/lightningnetwork/lnd/server.go:999 +0x2dc
server.go Outdated
local = edge2
} else {
local = edge1
if bytes.Equal(ourPubKey, owner(edge1)) {

This comment has been minimized.

@Roasbeef

Roasbeef Aug 16, 2018

Member

Here and below should be prefixed with edge1 != nil as well as the other.

@Roasbeef

This comment has been minimized.

Member

Roasbeef commented Aug 16, 2018

I took another look at the gossiper code, and there're a few areas where forget to send an error back via the main channel. Generally I think we should discourage sending errors via channels back to callers within a method, in favor of instead returning the error, and having the caller pass back that error to its caller in a single place.

@joostjager

This comment has been minimized.

Collaborator

joostjager commented Aug 16, 2018

@Roasbeef I had a quick look at

[ERR] SRVR: unable to disable channel: edge not found

Could reproduce this on simnet. Open private channel and then force close.

I don't think it has to do with the db migration / unknown edge policies. The error message maybe somewhat misleading. In my test it originates here:

return ErrEdgeNotFound

Mapping from outpoint to channel id is unavailable.

@halseth halseth force-pushed the halseth:disable-inactive-channels branch from 58d0a2d to ead863e Aug 16, 2018

@halseth

This comment has been minimized.

Collaborator

halseth commented Aug 16, 2018

Turned out the reason we were trying to disable channels where edges were not found in the DB was because closed channels would be lingering around in the status map. Fixed by only copying the status of open channels on each tick.

@Roasbeef

This comment has been minimized.

Member

Roasbeef commented Aug 17, 2018

@joostjager thanks for that report! Led me to finding an existing bug which is now fixed in #1739.

@Roasbeef

This comment has been minimized.

Member

Roasbeef commented Aug 17, 2018

Alrighty, with the latest fixups, giving this another spin on the faucet. Will report back with any comments/reports, then likely green light it for inclusion, as this should dramatically reduce the number of routing failures on mainnet/testnet once folks update.

@Roasbeef

This comment has been minimized.

Member

Roasbeef commented Aug 17, 2018

I think the faucet (which is suuper old) is in a weird state itself (those channels have a chanID of zero). Will try this on a newer node. However, the minor issue of it continually re-trying to disable those channels when it's unable to actually read them is still present.

@halseth halseth force-pushed the halseth:disable-inactive-channels branch from d9cf545 to c5d8d50 Aug 17, 2018

halseth added some commits Aug 9, 2018

server: let gossiper hand policy update to router
ProcessLocalAnnouncement will attempt to call UpdateEdge with the new
policy. If we call it manually before handing it to the gossiper, that
call will fail with "Outdated" and the announcement won't propagate.
lnd_test: use dave to listen for ChannelUpdates in testSendUpdateDisa…
…bleChannel

This commit changes the test testSendUpdateDisableChannel to use Dave to
listen for channel updates, as was originally intended.

Since a ChannelUpdate won't propagate in the network if the channel is
already closed (if the closing transaction has been mined), we change
the closing procedure to initiate the closing process, wait for the
ChannelUpdates, then finally mining the closing transaction. This avoids
a propagation race between the mined block and the ChannelUpdate.
server+peer: rename disableChannel->announceChanStatus
Now takes a boolean whether to disable/activate, and keeps track of the
last update sent to avoid sending duplicates.
peer: call disableChannel with disabled=false after loading channels
This to ensure now active channel is available in the router, for path
finding, as it might have been disabled if the peer has been offline a
while.
server: add goroutine watchChannelStatus
This commit adds a goroutine watchChannelStatus to the server, which
will query the switch for the status of all open channels every
InactiveChanTimeout / 4. If a channel's status has remained unchanged
during the last InactiveChanTimeout it'll send out a ChannelUpdate
setting the disabled bit accordingly.

@halseth halseth force-pushed the halseth:disable-inactive-channels branch from c5d8d50 to 7b1ac0a Aug 17, 2018

@Roasbeef

LGTM 🔮

@@ -10925,7 +10925,7 @@ func testSendUpdateDisableChannel(net *lntest.NetworkHarness, t *harnessTest) {
t.Fatalf("unable to connect bob to dave: %v", err)
}
daveUpdates, quit := subscribeGraphNotifications(t, ctxb, net.Alice)
daveUpdates, quit := subscribeGraphNotifications(t, ctxb, dave)

This comment has been minimized.

@Roasbeef

Roasbeef Aug 18, 2018

Member

One of those..."how tf?".

@Roasbeef Roasbeef merged commit 1475960 into lightningnetwork:master Aug 18, 2018

1 of 2 checks passed

coverage/coveralls Coverage decreased (-0.1%) to 54.458%
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment