Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chanfitness: Track peer uptime #3332

Open
wants to merge 3 commits into
base: master
from

Conversation

@carlaKC
Copy link
Contributor

commented Jul 22, 2019

This PR introduces a new package chanfitness which is used to track scores for a node's channels.
Scores are currently kept in memory, with the intention of persisting them once the set of metrics needed to score nodes is more clearly defined. It is related to #1253, although the issue is quite old.

This change implements tracking for peer uptime, by maintaining a log of channel events for each channel. Peer online/offline events are monitored by a goroutine on a per peer basis, and channel creation/close events are monitored by a single goroutine.

This change also includes adding a channel fitness rpc subserver, but that can be split out into another PR if the change is too big.

PR Checklist

  • All changes are Go version 1.12 compliant (used go1.12.6)
  • Go fmt, commented and wrapped at 80 lines
  • Make check, lint and go vet ok, builds properly!
  • Code accompanied by tests

@carlaKC carlaKC requested review from cfromknecht and Roasbeef as code owners Jul 22, 2019

@carlaKC carlaKC force-pushed the carlaKC:chanfitness-trackchanneluptime branch from e04f971 to 1683f2a Jul 22, 2019

@halseth
Copy link
Collaborator

left a comment

Very impressed by this first iteration of the package! The design can be streamlined a little bit, but all in all I think this has taken the right direction.

First of all, this size of this PR is quite large, so I would suggest breaking it into smaller parts. It seems natural to let the pure event tracking be its own PR, then we can follow up with PRs to get scores, exposing them on the RPC etc.

Secondly with the above comment in mind, I think we should attempt to consolidate monitoring of channel events and peer activity. The first step here would be to add a SubscribePeerEvents API similar to what already exists for channels.

chanfitness/chanfitness.go Outdated Show resolved Hide resolved
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
"github.com/lightningnetwork/lnd/subscribe"
)

// ScoreStore maintains a set of scores for the node's channels. It is intended

This comment has been minimized.

Copy link
@halseth

halseth Jul 24, 2019

Collaborator

Instead of the notion of this storing a "score", should we change it to store "events". The score can later be calculated on the basis of the events. We can name this ChannelEventStore?

chanfitness/chanfitness.go Outdated Show resolved Hide resolved
score = &ChannelScore{
id: channelID,
peer: peer,
quit: make(chan struct{}, 1),

This comment has been minimized.

Copy link
@halseth

halseth Jul 24, 2019

Collaborator

take a look at how quit channels are defined and used other places in the codebase :)

chanfitness/chanfitness.go Outdated Show resolved Hide resolved
chanfitness/chanscore.go Outdated Show resolved Hide resolved
@carlaKC

This comment has been minimized.

Copy link
Contributor Author

commented Jul 25, 2019

Thanks for the review @halseth 🙏
Agreed on monitoring all peers rather than peers on a per channel basis, makes a lot more sense.

What do you think about adding a peernotifier package that provides a subscribe.Client like channotifier currently does? That way, chanfitness can subscribe to peer events in the same way that it does channels?

With that approach, I'd break this (entirely too big - sorry 😅) PR down into:

  1. Add peernotifier package in new a PR
  2. Reduce this PR to just consume channel events for now (+ remove grpc)
  3. Add peer monitoring to chanfitness
  4. Add grpc subserver for querying chanfitnes
    Sound ok?

@carlaKC carlaKC force-pushed the carlaKC:chanfitness-trackchanneluptime branch from 1683f2a to 7551588 Jul 25, 2019

@halseth

This comment has been minimized.

Copy link
Collaborator

commented Jul 29, 2019

What do you think about adding a peernotifier package that provides a subscribe.Client like channotifier currently does? That way, chanfitness can subscribe to peer events in the same way that it does channels?

👍

With that approach, I'd break this (entirely too big - sorry 😅) PR down into:

  1. Add peernotifier package in new a PR
  2. Reduce this PR to just consume channel events for now (+ remove grpc)
  3. Add peer monitoring to chanfitness
  4. Add grpc subserver for querying chanfitnes
    Sound ok?

I would start by doing 1 and 2, then we can take a look at how big 2 becomes and what we could reasonably include there 😄

@carlaKC carlaKC referenced this pull request Jul 29, 2019
4 of 4 tasks complete

@Roasbeef Roasbeef removed their request for review Jul 29, 2019

@Roasbeef Roasbeef requested review from joostjager and removed request for cfromknecht Aug 7, 2019

@halseth

This comment has been minimized.

Copy link
Collaborator

commented Aug 7, 2019

Needs rebase!

@halseth
Copy link
Collaborator

left a comment

This is coming along nicely!

I think we could also incorporate peer event tracking in this PR, to see how it will interact with the channel event store. Alternatively a new PR that builds on this one that adds peer tracking can be created, to see how the final result will look.

chanfitness/chanevent.go Outdated Show resolved Hide resolved
chanfitness/chanevent.go Outdated Show resolved Hide resolved
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
chanfitness/chanevent.go Show resolved Hide resolved
chanfitness/chanevent.go Outdated Show resolved Hide resolved
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
chanfitness/chanfitness_test.go Outdated Show resolved Hide resolved
chanfitness/chanfitness_test.go Outdated Show resolved Hide resolved
@joostjager
Copy link
Collaborator

left a comment

I only did a high level pass.

With regard to scope: I prefer a pr to have some meaningful impact (personal opinion). Add functionality, fix a bug, speed up something, etc. In general I think it is best if it also makes the changes end to end, so that one can verify (run the code) that the goal of the pr is met.

This pr is mostly a horizontal slice with code that isn't being called into except for starting the main event loop. A technical layer. Imo that bears the risk that code get merged that needs to be revised later when doing the work in the higher level layers. It may take some creativity to define an end to end pr that is not too big, but it seems that in this new domain there must be some good opportunities.

Also, if the goal is to track channel open/closes on chain: it may be that that info is already available and doesn't need to be tracked. The channel point and the closing tx id are persisted in the database. Maybe the peer (tcp) up time metric would have been more suitable to kick off the fitness system.

chanfitness/chanevent.go Outdated Show resolved Hide resolved
@@ -118,6 +120,7 @@ func init() {
chanbackup.UseLogger(chbuLog)
monitoring.UseLogger(promLog)
wtclient.UseLogger(wtclLog)
chanfitness.UseLogger(chftLog)

This comment has been minimized.

Copy link
@joostjager

joostjager Aug 7, 2019

Collaborator

You could choose to use the addSubLogger construction here and define the log prefix in the package itself.

chanfitness/chanevent.go Outdated Show resolved Hide resolved
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
@carlaKC

This comment has been minimized.

Copy link
Contributor Author

commented Aug 7, 2019

Thanks for taking a look @joostjager

In general I think it is best if it also makes the changes end to end, so that one can verify (run the code) that the goal of the pr is met.

Imo that bears the risk that code get merged that needs to be revised later when doing the work in the higher level layers. It may take some creativity to define an end to end pr that is not too big, but it seems that in this new domain there must be some good opportunities.

I definitely agree on having a more horizontal PR in terms of being able to test the changes and figuring out how it will be used by higher levels. My original pass at this PR had a rpcsubserver for getting scores included, but that ended up being a pretty huge PR so I removed it. My thoughts on expanding this PR to be more than just kicking off a logical loop:

  1. Re-add a RPC server so the endpoint can be called, although I don't feel like this properly covers what you mean by figuring out how it will be used on higher levels?
  2. Add some opt-in channel closing behaviour to autopilot. Since we're only going to have uptime metrics available (see below), I'd say that this would have to be done with some minimum channel lifetime param (eg must be at least x days old) and threshold flap rate which is not acceptable so that we don't start maniacally closing channels. I'll give that a go and see how it looks, might end up being too much for one change.

Also, if the goal is to track channel open/closes on chain: it may be that that info is already available and doesn't need to be tracked. The channel point and the closing tx id are persisted in the database. Maybe the peer (tcp) up time metric would have been more suitable to kick off the fitness system.

This PR has been waiting on #3354 (which was originally included in my way too big first attempt) so that uptime can be added to the events store - that is the main goal for this PR. When peer events were split out, I updated this to just have open/close events to get an idea of what the structure would look like. I'm going to add online/offline events now that #3354 is merged - I should have made that clearer, sorry!

In terms of having open/ close events at all, I think it still makes sense to have them while events are being kept in memory? That way the package doesn't have to touch the DB at all for now (and does not need to query it every time a score is requested). If/when we start looking at persisting score data, we can probably just get channel open/close events from chanDB.

@joostjager

This comment has been minimized.

Copy link
Collaborator

commented Aug 7, 2019

I actually meant creating a vertical PR, cutting through several layers. Ok, horizontal, vertical, it isn't very self-describing, but I think we mean the same thing.

I would at least choose either channel open/close or peer online/offline, not both, for this pr. If you choose peer online/offline, maybe you can just expose a single value "average peer uptime %" over the rpc. Or even more minimalistic, just log the average peer uptime value periodically. Then at least, some business value is created and users can see that it works.

For channel open/close: I don't know if it is a requirement to not touch the database for now. If you think you'll query chandb later, it may not be worth to write the code to send channel events now and remove that again later. In general, with storing redundant information things may get out of sync.

@halseth

This comment has been minimized.

Copy link
Collaborator

commented Aug 7, 2019

I think only doing peer uptime for now sounds like a good middle ground. That was what this PR originally set out to do, and would provide information we don't currently have anywhere.

The question is whether we can get away with channel events completely. Since a peer's uptime might not be very interesting if we don't have an open channel with it, it makes sense to somehow relate it to channel events. This information can possibly be fetched from the DB, but I'm not sure if we have it all (we only have block heights, not timestamps, even though we can fetch these from block headers), and I can imagine this could get more complicated than the current approach, as it is quite self contained. Storing open/close events is also not that big of a deal IMO, as it is not in any way critical if it gets out of sync.

For using this information, we could start with something as simple as returning peer uptime % during a listchannels call :)

@joostjager

This comment has been minimized.

Copy link
Collaborator

commented Aug 8, 2019

One other question: how does the channel fitness subsystem relate to lndmon?

From lndmon:

while lnd provides some information, lndmon by far does the heavy lifting with regards to metrics. With lndmon's data, you can track routing fees over time [..]

This seems to overlap with what is started in this pr. For example iterating over time series to extract useful values.

@carlaKC

This comment has been minimized.

Copy link
Contributor Author

commented Aug 8, 2019

One other question: how does the channel fitness subsystem relate to lndmon?

I haven't looked at lndmon in detail, but I think that chanfitness is a reduced, inhouse version of lndmon that we need for querying a specific subset of data? There are some things (like peer uptime) that could be exported to lndmon - which actually makes a case for tracking peer online status independent of channels in peernotifier so that we can query uptime over a period (which could be exported to lndmon and chanfitness . I think that may be overkill for now, but moving the logic in chanfitness to peerNotifier at a later time will be pretty trivial.

This seems to overlap with what is started in this pr. For example iterating over time series to extract useful values.

In terms of metrics that may be duplicated on both sides, I don't think we need to specifically track an event log for the payment level data (fees, failed/succeeded amounts) so that would minimise the amount of time series processing we do. Uptime is the only "stateful" series that made more sense to track as time series (I did implement a running counter for uptime but it was very messy). If it comes to a point where there's straight up duplication between the two, then we can look at chanfitness exporting metrics for lndmon?

@joostjager

This comment has been minimized.

Copy link
Collaborator

commented Aug 9, 2019

We had a bit more discussion on the topic. One thing that was brought up, is the case where tracking of data can trigger certain actions like auto-closing of channels. With lndmon, a loop would be created where lnd exports to lndmon, lndmon does time series/analysis/etc and then pushes the result back into lnd to take action. This may not be the architecture that we want.

I think the main thing is for us to be aware that the fitness subsystem and lndmon are related and make a deliberate decision on where to implement new functionality.

I am a bit worried though to which extent we can actually foresee what we need. But yes, you are right, the fitness subsystem could in that case export to lnd to prevent duplication.

With regard to useful things to track as time series in the fitness subsystem (in the future), the metric 'channel profitability' always comes to my mind. To be able to auto-close non-profitable channels. For channel profitability, an analysis over time of local balance and fwding fees is required.

@joostjager

This comment has been minimized.

Copy link
Collaborator

commented Aug 9, 2019

Tagging @valentinewallace as she's been developing lndmon.

@carlaKC carlaKC force-pushed the carlaKC:chanfitness-trackchanneluptime branch from 7551588 to 9238cf0 Aug 12, 2019

@carlaKC

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2019

PR updated to add a few changes:

  1. Include uptime in lnrpc/Channel which is returned by ListChannels
  2. Record online/offline events on startup for each channel and on open channel events
  • When we restart, we need to add a peer online/offline event for existing channels otherwise we will have to wait until they go online/offline to start tracking
  • A channel open event is not a guarantee that a peer is online, because they can go offline while the funding tx confirms, so we need to know their starting state

There is an edge case for existing channels where peers are recorded as offline because we have not connected to them at restart time, and then online when we establish a connection, which means existing channels will have 99% uptime rather than 100%. This can be addressed by adding a wait before recording initial peer state for channels (eg sleep one minute then check status) if necessary.

@valentinewallace

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2019

With lndmon, a loop would be created where lnd exports to lndmon, lndmon does time series/analysis/etc and then pushes the result back into lnd to take action. This may not be the architecture that we want.

I agree that sounds like an architecture to avoid 😄

There are some things (like peer uptime) that could be exported to lndmon

That sounds useful. Prometheus may be able to track peer uptime already since it does know when the set of peers changes 🤔 Let me know if I can contribute more lndmon insight.

@joostjager
Copy link
Collaborator

left a comment

Definitely much nicer now that something is visible on the rpc interface 👍

I left some comments. My main concern is about the flow of events from subsystems to the fitness tracker.

rpcserver.go Outdated Show resolved Hide resolved
chanfitness/chanevent.go Show resolved Hide resolved
chanfitness/chanevent.go Show resolved Hide resolved
}

for _, test := range tests {
t.Run(test.name, func(t *testing.T) {

This comment has been minimized.

Copy link
@joostjager

joostjager Aug 14, 2019

Collaborator

Nice parametrized test

chanfitness/chanevent.go Show resolved Hide resolved

// PeerEvents provides a subscription client which provides a stream of
// peer online/offline events.
PeerEvents func() (*subscribe.Client, error)

This comment has been minimized.

Copy link
@joostjager

joostjager Aug 14, 2019

Collaborator

If these are untyped anyway, and especially if this is extended to a large list in the future, can't the store not just have a single channel over which it receives all events? And then have all event sources drop their typed events into that channel.

This comment has been minimized.

Copy link
@carlaKC

carlaKC Aug 14, 2019

Author Contributor

So we'd construct a function which pipes all of the subscriptions into a channel which we create in server.go and then pass that to Config? I think that constructing that kind of aggregation from the caller bubbles complexity up rather than down sometimes, because if we have a lot of event sources it's simpler to just provide a list of functions than have some long aggregations of them all in the middle of the newServer call?

Having had my philosophical ramble, I think this may address some of the concerns you have about the unit tests so will give this a try :)

This comment has been minimized.

Copy link
@joostjager

joostjager Aug 14, 2019

Collaborator

I was actually thinking something simpler. Just make the fitness system a dependency of the other subsystems. It exposes a single function SendEvent to those subsystems. In the implementation of SendEvent, the only thing that happens is that the (untyped) event is dropped into a single buffered channel or queue. The main loop of the fitness system is receiving on that channel and processing the events via a type switch.

But I could be missing something here of course.

This comment has been minimized.

Copy link
@carlaKC

carlaKC Aug 16, 2019

Author Contributor

My main concern with a SendEvent call is that it feels a bit counterintuitive to be sending events when we have two subscriptions specifically intended for receiving events.

In the case where we don't have subscriptions, a SendEvent makes sense. And I suspect this will be the case for the remainder of the events we add to channel scores (ie htlc level events).

Since it probably (?) doesn't make sense to make subscriptions for htlc level events the options would be:

  1. Use SendEvent everywhere, accepting that it's a bit of an off construction to send events in the cases where you have subscriptions available
  2. Use SendEvent for non-subscription events and still pass the Peer/Chan Subscriptions in, but pipe those into the SendEvent channel so that monitoring happens in one central place.

This comment has been minimized.

Copy link
@joostjager

joostjager Aug 16, 2019

Collaborator

Isn't it that these subscriptions only exist to serve the event store? In server.go, the call s.peerNotifier.NotifyPeerOnline(pubKey) is made, which could be replaced by a call directly into the event store SendEvent?

I am looking for the simplest thing that works, but may miss something because I didn't follow the design decision around peer notifier.

This comment has been minimized.

Copy link
@carlaKC

carlaKC Aug 22, 2019

Author Contributor

Although an events chan is looking a lot cleaner for tests. The only ugliness is that it has to be called from the channel open/close notification functions which feels like doing the same thing twice. So it really is dependent on whether subscriptions actually have other use cases. which I'm unsure of.

This comment has been minimized.

Copy link
@Roasbeef

Roasbeef Aug 23, 2019

Member

chanfitness internally manages a queue to not block reporting sub-systems. There only needs to be a single queue for all events.

It would more or less need to re-implement the current queue handling in the subscribe package to do this. Therefore we should re-use what we have rather than attempt to over optimize for this use case which isn't performance critical and still nascent.

This comment has been minimized.

Copy link
@Roasbeef

Roasbeef Aug 23, 2019

Member

I don't think we need to get caught up in this relatively minor difference. The original primary reviewer (johan) was OK with the current approach. We shouldn't undo that prior lineage due to a new reviewer stepping in. For the sake of making progress, I don't think there's a fundamental principle that tends us in one direction or the other. Therefore we can continue down the current path which had already received review by the former primary reviewer.

If this was more critical code such as the channel state machine or revocation handling, then I would be willing to spend more time here, but it's a relatively independent sub-system that is non-critical.

This comment has been minimized.

Copy link
@joostjager

joostjager Aug 26, 2019

Collaborator

Even though Johan was ok with the current approach, it could still be that he likes the alternative more. I don't think we've discussed that. But yes, both approaches lead to functioning, correct code. This is only about preventing boiler plate. I can put aside my concerns about that and see how it works out.

This comment has been minimized.

Copy link
@halseth

halseth Aug 26, 2019

Collaborator

I think the current approach is more general and would scale better with more subscription types, and more subscribers to the different types.

We had to take a similar decision when we initially added the NotifyPeerOnline API to the fundingmanager. We could have made the server call into the funding manager directly, but rather we made it a general subscription API that any subsystem could call. This has turned out useful to have access to also in other packages.

I also don't see the boiler plate we would avoid, since it is mostly contained in the subscribe package?

chanfitness/chanfitness.go Outdated Show resolved Hide resolved
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
The percentage of the channel's lifetime that the remote peer has been
online.
*/
int64 uptime_percentage = 22 [json_name = "uptime_percentage"];

This comment has been minimized.

Copy link
@joostjager

joostjager Aug 14, 2019

Collaborator

Maybe we can also add the channel life time here, as we have it available anyway. And I think we should also mark the fields as 'experimental' in the comment, so that we can make breaking changes later if we want to.

This comment has been minimized.

Copy link
@carlaKC

carlaKC Aug 14, 2019

Author Contributor

Sounds good, it also makes it less likely that the uptime is misinterpreted by users as applying to the whole lifetime of the channel.

How about:
observed_lifetime: total seconds we have been monitoring the channel
observed_uptime: total seconds we have observed the peer online during monitoring

Then users can do their own calculations, and there's no rounding to be dealt with. There's probably a better var name than observed because that makes is sound like we didn't know about the channel before that, but I'll scour synonyms.com later :)

This comment has been minimized.

Copy link
@joostjager

joostjager Aug 14, 2019

Collaborator

Sounds good. Sounds better. I don't think the observed prefix is needed, is it?

This comment has been minimized.

Copy link
@carlaKC

carlaKC Aug 16, 2019

Author Contributor

How should I mark fields as experimental? I can't see any examples in the code.
Thinking of something along these lines in the field comments?

    [experimental]: this field is experimental and may be subject to change.

This comment has been minimized.

Copy link
@joostjager

joostjager Aug 22, 2019

Collaborator

We indeed don't have that yet. All our experimental fields are in sub-servers atm. Yes, something along those lines.

This comment has been minimized.

Copy link
@halseth

halseth Aug 26, 2019

Collaborator

IMO we don't have to mark them experimental. We can rather deprecate and add new fields later if we need to.


// ChannelEventStore maintains a set of event logs for the node's channels.
// It is intended to provide insight into the performance of channels.
type ChannelEventStore struct {

This comment has been minimized.

Copy link
@joostjager

joostjager Aug 14, 2019

Collaborator

Main type in this file doesn't match file name

This comment has been minimized.

Copy link
@halseth

halseth Aug 26, 2019

Collaborator

+1 on renaming the file

This comment has been minimized.

Copy link
@carlaKC

carlaKC Aug 27, 2019

Author Contributor

Should I rename the package? I think chanfitness (or channelscores) still works because the package will eventually be revealing scores, but at the moment there won't be a file with the package name which weird.

This comment has been minimized.

Copy link
@halseth

halseth Aug 28, 2019

Collaborator

I would keep the package name, only change the file name :)

chanfitness: Add channel event log structure
This commit adds a chanfitness package which will be used to track
channel scores for autopilot. It adds a channel event structure
which will be used to track channel opens/closes and peer uptime.
The eventLog implements an uptime function which calcualtes uptime
over a given period and a lifespan function which returns the time
when the log began monitoring the channel and, if the channel is
closed, the time when it stopped moitoring it.
carlaKC added 2 commits Aug 8, 2019
chanfitness: Add channel event store
This commit adds a channel event store to the channel fitness
package which is used to manage tracking of a node's channels.
It adds tracking for channel open/closed and peer online/offline
events for all channels that a node has open.

Events are consumed from channelNotifier and peerNotifier event
subscriptions. If either of these subscriptions is cancelled,
channel scoring stops, because both subscriptions are expected
to run until node shutdown.

Two functions are exposed to allow external callers to get uptime
information about a channel. GetLifespan returns the period over
which the channel has been monitored. GetUptime returns the channel's
uptime over a specified period. Callers can use these functions to
get the channel's remote peer uptime over its entire lifetime, or
a subset of that period.
lnrpc: Add experimental uptime and lifetime fields to channel struct
This commit adds the total observed lifetime  of a channel and the
totaluptime of its remote peer to the lnrpc channel struct. These
fields are marked as experimential because they are subject to change.

@carlaKC carlaKC force-pushed the carlaKC:chanfitness-trackchanneluptime branch from 9238cf0 to 8309a24 Aug 28, 2019

@carlaKC

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2019

Pushed a few small fixes. Still working on tests for chaneventstore.

Thinking ahead, I'm wondering whether it would be better to have a channelEvent interface in chanevent.go since the next set of events (HTLCS) will probably have data that we want to store in the event log (such as amount/fees). Alternatively, this could just be done in a separate log with a specific type, since HTLC events and peer online/offline events are unlikely to interact?

@halseth

This comment has been minimized.

Copy link
Collaborator

commented Aug 29, 2019

Thinking ahead, I'm wondering whether it would be better to have a channelEvent interface in chanevent.go

That indeed sounds like a good idea IMO, but that change can possibly be done when needed? :)

@Roasbeef

This comment has been minimized.

Copy link
Member

commented Sep 19, 2019

OP ping!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.