Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netann: channel status manager #2411

Merged
merged 10 commits into from Feb 19, 2019

Conversation

Projects
None yet
4 participants
@cfromknecht
Copy link
Collaborator

commented Jan 4, 2019

In this PR, we introduce a new subsystem called the netann.ChanStatusManager responsible for managing the announcement of channel updates which toggle a channel's disabled bit. Most of the logic has been moved into a new netann package, which offers greater unit testability of the subsystem's behavior.

The netann.ChanStatusManager exposes two methods:

func RequestEnable(outpoint wire.OutPoint) error {}
func RequestDisable(outpoint wire.OutPoint) error {}

Together, these allow other subsystems to request a toggle in a particular direction. The netann.ChanStatusManager then handles the task of dropping duplicate requests, and also ensuring that it's network behavior follows a well-defined state machine.

State Machine Description

At any point, the netann.ChanStatusManager categorizes known channels into three distinct ChanStatus states:

type ChanStatus uint8

const (
    // ChanStatusActive indicates that the channel's last announcement has
    // the disabled bit cleared.
    ChanStatusActive ChanStatus = iota

    // ChanStatusPendingInactive indicates that the channel's last
    // announcement has the disabled bit cleared, but that the channel was
    // detected in an inactive state. Channels in this state will have a
    // disabling announcement sent after the ChanInactiveTimeout expires
    // from the time of the first detection--unless the channel is explicitly
    // reenabled before the disabling occurs.
    ChanStatusPendingInactive

    // ChanStatusInactive indicates that the channel's last announcement has
    // the disabled bit set.
    ChanStatusInactive
)

Channels will always start in either ChanStatusActive or ChanStatusInactive on startup or after we detect a new outpoint (perhaps for a newly created channel) by examining the last channel update we have on disk.

One key distinction from the existing design is that the netann.ChanStatusManager only uses long polling to detect inactive channels, meaning that channels can ONLY be reenabled via an explict call to RequestEnable for an outpoint. This allows us to use a more accurate metric for enabling channels: connection duration.

Once a channel has been detected as inactive within the switch, presumably because of a disconnection, this kicks off a timed progression from ChanStatusActive -> ChanStatusPendingInactive -> ChanStatusInactive.

If RequestEnable is not received before the progression terminates, a new announcement setting the disable bit will be broadcast to the network.

Otherwise, any call to RequestEnable before the progression finishes will cancel the disable from being sent, and leaves the channel in the its still-enabled state on the network. This allows the netann.ChanStatusManager to tolerate short-lived reconnections, without causing the node to spam the network with channel updates.

Using the exposed configurations, users can tune this to a specific threshold, e.g. require ~95% uptime over a 20 minute interval, in order for the channel to remain enabled.

Supersedes #2080

Depends on:

@cfromknecht cfromknecht force-pushed the cfromknecht:chan-status-manager branch from 91c8911 to 6b2de75 Jan 4, 2019

@cfromknecht cfromknecht force-pushed the cfromknecht:chan-status-manager branch 3 times, most recently from 87828fd to 29f32c8 Jan 5, 2019

@cfromknecht cfromknecht force-pushed the cfromknecht:chan-status-manager branch 2 times, most recently from 56edc99 to 0d5e1e4 Jan 5, 2019

@Roasbeef Roasbeef requested review from wpaulino and halseth Jan 8, 2019

@cfromknecht cfromknecht added the blocked label Jan 8, 2019

@halseth

This comment has been minimized.

Copy link
Collaborator

commented Jan 8, 2019

Apart from the addition of tests, I'm struggling to see what this adds that is not already covered by #2080. How is this logic different than taking what we already have in #2080, making sure we only send updates if they alter the disable bit already in the DB?

@cfromknecht

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 9, 2019

Apart from the addition of tests, I'm struggling to see what this adds that is not already covered by #2080.

To me this is the most crucial part, as we don't have much testing on the server or insight into the existing behavior apart from the integration tests.

Apart from that, the primary difference is that the state machine only performs the passive enable -> disable transition, while the current one will perform the transition in both directions. My hunch is that this is the primary thing leading to the instability we see on the network, and would probably have strange interactions with #2080 if left unchanged.

Another difference is that all state changes are serialized via the ChanStatusManagers main event loop, and are kept in sync with the on-disk state. Right now external commands can modify the persistent view of the graph separately from watchChannelStatus's view of each channel's current state.

With those things in mind, the goals in my mind are to reduce the state space, consolidate the logic into one unit, and provide a way of throughly testing it. It's also a good chance to work on our server bloat :)

@cfromknecht cfromknecht force-pushed the cfromknecht:chan-status-manager branch from c5b1c6d to 5b5c09b Jan 9, 2019

@cfromknecht cfromknecht force-pushed the cfromknecht:chan-status-manager branch from 5b5c09b to 10a0050 Jan 10, 2019

@cfromknecht cfromknecht force-pushed the cfromknecht:chan-status-manager branch 9 times, most recently from 245b180 to 405081e Jan 11, 2019

@cfromknecht

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 12, 2019

Testing framework has been significantly reworked, and additional state machine tests for the ChanStatusManager have been finished. The rework now brings just over 80% coverage to the netann package :D

@cfromknecht cfromknecht removed the blocked label Jan 15, 2019

@cfromknecht

This comment has been minimized.

Copy link
Collaborator Author

commented Feb 1, 2019

@halseth @wpaulino rebased on master and squashed prior fixups. also addressed @wpaulino's recent comments, PTAL

@cfromknecht cfromknecht force-pushed the cfromknecht:chan-status-manager branch from b76ba69 to 2df928e Feb 1, 2019

@wpaulino

This comment has been minimized.

Copy link
Collaborator

commented Feb 1, 2019

@cfromknecht looks like the latest version introduced a test failure:

    --- FAIL: TestLightningNetworkDaemon/update_channel_policy (7.19s)
        lnd_test.go:75: Failed: (update channel policy): exited with error: 
            *errors.errorString unable to send payment: received payment error: unable to find a path to destination
            /home/travis/gopath/src/github.com/lightningnetwork/lnd/lnd_test.go:1489 (0xbe2523)
            	testUpdateChannelPolicy: t.Fatalf("unable to send payment: %v", err)
            /home/travis/gopath/src/github.com/lightningnetwork/lnd/lnd_test.go:100 (0xbdac66)
            	(*harnessTest).RunTestCase: testCase.test(net, h)
            /home/travis/gopath/src/github.com/lightningnetwork/lnd/lnd_test.go:13282 (0xc39f74)
            	TestLightningNetworkDaemon.func4: ht.RunTestCase(testCase, lndHarness)
            /home/travis/.gimme/versions/go1.11.5.linux.amd64/src/testing/testing.go:827 (0x4f8caf)
            	tRunner: fn(t)
            /home/travis/.gimme/versions/go1.11.5.linux.amd64/src/runtime/asm_amd64.s:1333 (0x45dd91)
            	goexit: BYTE	$0x90	// NOP
@cfromknecht

This comment has been minimized.

Copy link
Collaborator Author

commented Feb 2, 2019

@wpaulino indeed still looking into it

@cfromknecht cfromknecht force-pushed the cfromknecht:chan-status-manager branch from 2df928e to a3c2820 Feb 7, 2019

Show resolved Hide resolved netann/common.go Outdated
@Roasbeef
Copy link
Member

left a comment

Running latest version on the faucet now, will observe le logs to see if anything peculiar happens over the next day or two.

Show resolved Hide resolved netann/common.go Outdated
@halseth
Copy link
Collaborator

left a comment

Mostly just nits on code and commit structure at this point, looks really good 👮

Show resolved Hide resolved netann/common.go Outdated
Show resolved Hide resolved netann/common.go Outdated
Show resolved Hide resolved netann/common_test.go Outdated
Show resolved Hide resolved netann/chan_status_manager.go Outdated
Show resolved Hide resolved netann/chan_status_manager.go
Show resolved Hide resolved netann/chan_status_manager.go
Show resolved Hide resolved netann/chan_status_manager.go
Show resolved Hide resolved config.go Outdated
config.go Outdated
ChanActiveTimeout time.Duration `long:"chan-active-timeout" description:"The duration that a peer connection must be stable before attempting to send a channel announcement that enables or reenables a given channel on the network."`
ChanInactiveTimeout time.Duration `long:"chan-inactive-timeout" description:"The duration that must elapse after first detecting that an already active channel is actually inactive and the channel announcement disabling the channel is announced to the network."`

ChanStatusSampleInterval time.Duration `long:"chan-status-sample-interval" description:"The polling interval between attempts to detect if an active channel has become inactive with respect to the channel's ability to forward."`

This comment has been minimized.

Copy link
@halseth

halseth Feb 13, 2019

Collaborator

It just bloats the exposed config, and seems like all three could be derived from just setting the longest one, even during integration tests.

Show resolved Hide resolved test_utils.go

@cfromknecht cfromknecht force-pushed the cfromknecht:chan-status-manager branch 3 times, most recently from 960c216 to d213d6d Feb 14, 2019

@cfromknecht

This comment has been minimized.

Copy link
Collaborator Author

commented Feb 14, 2019

@wpaulino @halseth @Roasbeef comments addressed, commits restructured, and additional unit tests added. See this force-push for the majority of the changes, ptal

@wpaulino
Copy link
Collaborator

left a comment

Clean code with excellent test coverage. LGTM 💥

Show resolved Hide resolved netann/channel_update.go Outdated
Show resolved Hide resolved netann/channel_update.go
Show resolved Hide resolved netann/chan_status_manager.go Outdated

cfromknecht added some commits Feb 15, 2019

config+server+lnd_test: expose chan status manager config options
Exposes the three parameters that dictate
the behavior of the channel status manager:
 * --chan-enable-timeout
 * --chan-disable-timeout
 * --chan-status-sample-interval
test_utils: waitgroup manually spawned goroutines
Found that his can sometimes cause a panic with a
negative waitgroup counter.
peer+server+test_utlils: use new ChanStatusManager
This commit hooks up the new netann.ChanStatusManager,
replacing the prior method which used the
watchChannelStatus goroutine.

@cfromknecht cfromknecht force-pushed the cfromknecht:chan-status-manager branch from d213d6d to de28217 Feb 15, 2019

@cfromknecht

This comment has been minimized.

Copy link
Collaborator Author

commented Feb 15, 2019

@wpaulino comments addressed, @halseth @Roasbeef ptal

@Roasbeef
Copy link
Member

left a comment

LGTM 🦄

I think we'll be saving the entire network a good bit of bandwidth once this lands in 0.6 ;)

// in the map. If for some reason the channel isn't closed, the state
// will be repopulated on subsequent calls to RequestEnable or
// RequestDisable via a db lookup, or on startup.
delete(m.chanStates, outpoint)

This comment has been minimized.

Copy link
@Roasbeef

Roasbeef Feb 19, 2019

Member

👍

I think I ran into this issue in a prior diff of this PR. It appears to be fixed now.

for _, c := range allChannels {
// We'll skip any private channels, as they aren't used for
// routing within the network by other nodes.
if c.ChannelFlags&lnwire.FFAnnounceChannel == 0 {

This comment has been minimized.

Copy link
@Roasbeef

Roasbeef Feb 19, 2019

Member

Not a blocker, but this could have been hidden behind this new interface. With that path, then the ChanStatusManager doesn't need to know how to interpret the bits of the channel flag, or what an advertised channel even is.

@@ -12755,7 +12787,12 @@ func testSendUpdateDisableChannel(net *lntest.NetworkHarness, t *harnessTest) {
// We create a new node Eve that has an inactive channel timeout of
// just 2 seconds (down from the default 20m). It will be used to test
// channel updates for channels going inactive.
eve, err := net.NewNode("Eve", []string{"--chan-disable-timeout=2s"})
eve, err := net.NewNode("Eve", []string{
"--minbackoff=10s",

This comment has been minimized.

Copy link
@Roasbeef

Roasbeef Feb 19, 2019

Member

I have a feeling that we'll need to adjust these values due to Travis flakiness in the future...

@@ -1782,6 +1780,25 @@ out:
// relevant sub-systems and launching a goroutine to
// wait for close tx conf.
p.finalizeChanClosure(chanCloser)

// The channel reannounce delay has elapsed, broadcast the

This comment has been minimized.

Copy link
@Roasbeef

Roasbeef Feb 19, 2019

Member

Nice, I think this'll probably make the biggest difference in our current handling, so we don't immediately enable and span the network due to a flappy peer.

@Roasbeef Roasbeef merged commit c8b5e1f into lightningnetwork:master Feb 19, 2019

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls First build on chan-status-manager at 60.124%
Details

@cfromknecht cfromknecht deleted the cfromknecht:chan-status-manager branch Feb 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.