Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chanfitness: Track peer uptime #3332

Merged

Conversation

@carlaKC
Copy link
Collaborator

@carlaKC carlaKC commented Jul 22, 2019

This PR introduces a new package chanfitness which is used to track scores for a node's channels.
Scores are currently kept in memory, with the intention of persisting them once the set of metrics needed to score nodes is more clearly defined. It is related to #1253, although the issue is quite old.

This change implements tracking for peer uptime, by maintaining a log of channel events for each channel. Peer online/offline events are monitored by a goroutine on a per peer basis, and channel creation/close events are monitored by a single goroutine.

This change also includes adding a channel fitness rpc subserver, but that can be split out into another PR if the change is too big.

PR Checklist

  • All changes are Go version 1.12 compliant (used go1.12.6)
  • Go fmt, commented and wrapped at 80 lines
  • Make check, lint and go vet ok, builds properly!
  • Code accompanied by tests
@carlaKC carlaKC force-pushed the chanfitness-trackchanneluptime branch from e04f971 to 1683f2a Jul 22, 2019
Copy link
Collaborator

@halseth halseth left a comment

Very impressed by this first iteration of the package! The design can be streamlined a little bit, but all in all I think this has taken the right direction.

First of all, this size of this PR is quite large, so I would suggest breaking it into smaller parts. It seems natural to let the pure event tracking be its own PR, then we can follow up with PRs to get scores, exposing them on the RPC etc.

Secondly with the above comment in mind, I think we should attempt to consolidate monitoring of channel events and peer activity. The first step here would be to add a SubscribePeerEvents API similar to what already exists for channels.

Loading

chanfitness/chanfitness.go Outdated Show resolved Hide resolved
Loading
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
Loading
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
Loading
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
Loading
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
Loading
"github.com/lightningnetwork/lnd/subscribe"
)

// ScoreStore maintains a set of scores for the node's channels. It is intended
Copy link
Collaborator

@halseth halseth Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the notion of this storing a "score", should we change it to store "events". The score can later be calculated on the basis of the events. We can name this ChannelEventStore?

Loading

chanfitness/chanfitness.go Outdated Show resolved Hide resolved
Loading
score = &ChannelScore{
id: channelID,
peer: peer,
quit: make(chan struct{}, 1),
Copy link
Collaborator

@halseth halseth Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take a look at how quit channels are defined and used other places in the codebase :)

Loading

chanfitness/chanfitness.go Outdated Show resolved Hide resolved
Loading
chanfitness/chanscore.go Outdated Show resolved Hide resolved
Loading
@carlaKC
Copy link
Collaborator Author

@carlaKC carlaKC commented Jul 25, 2019

Thanks for the review @halseth 🙏
Agreed on monitoring all peers rather than peers on a per channel basis, makes a lot more sense.

What do you think about adding a peernotifier package that provides a subscribe.Client like channotifier currently does? That way, chanfitness can subscribe to peer events in the same way that it does channels?

With that approach, I'd break this (entirely too big - sorry 😅) PR down into:

  1. Add peernotifier package in new a PR
  2. Reduce this PR to just consume channel events for now (+ remove grpc)
  3. Add peer monitoring to chanfitness
  4. Add grpc subserver for querying chanfitnes
    Sound ok?

Loading

@carlaKC carlaKC force-pushed the chanfitness-trackchanneluptime branch from 1683f2a to 7551588 Jul 25, 2019
@halseth
Copy link
Collaborator

@halseth halseth commented Jul 29, 2019

What do you think about adding a peernotifier package that provides a subscribe.Client like channotifier currently does? That way, chanfitness can subscribe to peer events in the same way that it does channels?

👍

With that approach, I'd break this (entirely too big - sorry 😅) PR down into:

  1. Add peernotifier package in new a PR
  2. Reduce this PR to just consume channel events for now (+ remove grpc)
  3. Add peer monitoring to chanfitness
  4. Add grpc subserver for querying chanfitnes
    Sound ok?

I would start by doing 1 and 2, then we can take a look at how big 2 becomes and what we could reasonably include there 😄

Loading

@Roasbeef Roasbeef removed their request for review Jul 29, 2019
@Roasbeef Roasbeef requested review from joostjager and removed request for cfromknecht Aug 7, 2019
@halseth
Copy link
Collaborator

@halseth halseth commented Aug 7, 2019

Needs rebase!

Loading

Copy link
Collaborator

@halseth halseth left a comment

This is coming along nicely!

I think we could also incorporate peer event tracking in this PR, to see how it will interact with the channel event store. Alternatively a new PR that builds on this one that adds peer tracking can be created, to see how the final result will look.

Loading

chanfitness/chanevent.go Outdated Show resolved Hide resolved
Loading
chanfitness/chanevent.go Outdated Show resolved Hide resolved
Loading
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
Loading
// id is the uInt64 of the short channel ID.
id uint64

peer [33]byte
Copy link
Collaborator

@halseth halseth Aug 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing godoc

Loading

Copy link
Collaborator

@joostjager joostjager Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for pubkeys, there is also the route.Vertex type

Loading

Copy link
Collaborator

@joostjager joostjager Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No route.Vertex?

Loading

Copy link
Collaborator

@halseth halseth Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to add a dependency on routing, so would prefer keeping as is. I'm fine with either approach.

Loading

Copy link
Collaborator Author

@carlaKC carlaKC Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave it as is for now and then I'll update PeerNotifier and this code in a separate PR?
Discussed, will change in this PR then update PeerNotifier separately.

Loading

chanfitness/chanevent.go Outdated Show resolved Hide resolved
Loading
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
Loading
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
Loading
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
Loading
chanfitness/chanfitness_test.go Outdated Show resolved Hide resolved
Loading
chanfitness/chanfitness_test.go Outdated Show resolved Hide resolved
Loading
Copy link
Collaborator

@joostjager joostjager left a comment

I only did a high level pass.

With regard to scope: I prefer a pr to have some meaningful impact (personal opinion). Add functionality, fix a bug, speed up something, etc. In general I think it is best if it also makes the changes end to end, so that one can verify (run the code) that the goal of the pr is met.

This pr is mostly a horizontal slice with code that isn't being called into except for starting the main event loop. A technical layer. Imo that bears the risk that code get merged that needs to be revised later when doing the work in the higher level layers. It may take some creativity to define an end to end pr that is not too big, but it seems that in this new domain there must be some good opportunities.

Also, if the goal is to track channel open/closes on chain: it may be that that info is already available and doesn't need to be tracked. The channel point and the closing tx id are persisted in the database. Maybe the peer (tcp) up time metric would have been more suitable to kick off the fitness system.

Loading

chanfitness/chanevent.go Outdated Show resolved Hide resolved
Loading
log.go Outdated
@@ -118,6 +120,7 @@ func init() {
chanbackup.UseLogger(chbuLog)
monitoring.UseLogger(promLog)
wtclient.UseLogger(wtclLog)
chanfitness.UseLogger(chftLog)
Copy link
Collaborator

@joostjager joostjager Aug 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could choose to use the addSubLogger construction here and define the log prefix in the package itself.

Loading

chanfitness/chanevent.go Outdated Show resolved Hide resolved
Loading
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
Loading
@carlaKC
Copy link
Collaborator Author

@carlaKC carlaKC commented Aug 7, 2019

Thanks for taking a look @joostjager

In general I think it is best if it also makes the changes end to end, so that one can verify (run the code) that the goal of the pr is met.

Imo that bears the risk that code get merged that needs to be revised later when doing the work in the higher level layers. It may take some creativity to define an end to end pr that is not too big, but it seems that in this new domain there must be some good opportunities.

I definitely agree on having a more horizontal PR in terms of being able to test the changes and figuring out how it will be used by higher levels. My original pass at this PR had a rpcsubserver for getting scores included, but that ended up being a pretty huge PR so I removed it. My thoughts on expanding this PR to be more than just kicking off a logical loop:

  1. Re-add a RPC server so the endpoint can be called, although I don't feel like this properly covers what you mean by figuring out how it will be used on higher levels?
  2. Add some opt-in channel closing behaviour to autopilot. Since we're only going to have uptime metrics available (see below), I'd say that this would have to be done with some minimum channel lifetime param (eg must be at least x days old) and threshold flap rate which is not acceptable so that we don't start maniacally closing channels. I'll give that a go and see how it looks, might end up being too much for one change.

Also, if the goal is to track channel open/closes on chain: it may be that that info is already available and doesn't need to be tracked. The channel point and the closing tx id are persisted in the database. Maybe the peer (tcp) up time metric would have been more suitable to kick off the fitness system.

This PR has been waiting on #3354 (which was originally included in my way too big first attempt) so that uptime can be added to the events store - that is the main goal for this PR. When peer events were split out, I updated this to just have open/close events to get an idea of what the structure would look like. I'm going to add online/offline events now that #3354 is merged - I should have made that clearer, sorry!

In terms of having open/ close events at all, I think it still makes sense to have them while events are being kept in memory? That way the package doesn't have to touch the DB at all for now (and does not need to query it every time a score is requested). If/when we start looking at persisting score data, we can probably just get channel open/close events from chanDB.

Loading

@joostjager
Copy link
Collaborator

@joostjager joostjager commented Aug 7, 2019

I actually meant creating a vertical PR, cutting through several layers. Ok, horizontal, vertical, it isn't very self-describing, but I think we mean the same thing.

I would at least choose either channel open/close or peer online/offline, not both, for this pr. If you choose peer online/offline, maybe you can just expose a single value "average peer uptime %" over the rpc. Or even more minimalistic, just log the average peer uptime value periodically. Then at least, some business value is created and users can see that it works.

For channel open/close: I don't know if it is a requirement to not touch the database for now. If you think you'll query chandb later, it may not be worth to write the code to send channel events now and remove that again later. In general, with storing redundant information things may get out of sync.

Loading

@halseth
Copy link
Collaborator

@halseth halseth commented Aug 7, 2019

I think only doing peer uptime for now sounds like a good middle ground. That was what this PR originally set out to do, and would provide information we don't currently have anywhere.

The question is whether we can get away with channel events completely. Since a peer's uptime might not be very interesting if we don't have an open channel with it, it makes sense to somehow relate it to channel events. This information can possibly be fetched from the DB, but I'm not sure if we have it all (we only have block heights, not timestamps, even though we can fetch these from block headers), and I can imagine this could get more complicated than the current approach, as it is quite self contained. Storing open/close events is also not that big of a deal IMO, as it is not in any way critical if it gets out of sync.

For using this information, we could start with something as simple as returning peer uptime % during a listchannels call :)

Loading

@joostjager
Copy link
Collaborator

@joostjager joostjager commented Aug 8, 2019

One other question: how does the channel fitness subsystem relate to lndmon?

From lndmon:

while lnd provides some information, lndmon by far does the heavy lifting with regards to metrics. With lndmon's data, you can track routing fees over time [..]

This seems to overlap with what is started in this pr. For example iterating over time series to extract useful values.

Loading

@carlaKC
Copy link
Collaborator Author

@carlaKC carlaKC commented Aug 8, 2019

One other question: how does the channel fitness subsystem relate to lndmon?

I haven't looked at lndmon in detail, but I think that chanfitness is a reduced, inhouse version of lndmon that we need for querying a specific subset of data? There are some things (like peer uptime) that could be exported to lndmon - which actually makes a case for tracking peer online status independent of channels in peernotifier so that we can query uptime over a period (which could be exported to lndmon and chanfitness . I think that may be overkill for now, but moving the logic in chanfitness to peerNotifier at a later time will be pretty trivial.

This seems to overlap with what is started in this pr. For example iterating over time series to extract useful values.

In terms of metrics that may be duplicated on both sides, I don't think we need to specifically track an event log for the payment level data (fees, failed/succeeded amounts) so that would minimise the amount of time series processing we do. Uptime is the only "stateful" series that made more sense to track as time series (I did implement a running counter for uptime but it was very messy). If it comes to a point where there's straight up duplication between the two, then we can look at chanfitness exporting metrics for lndmon?

Loading

@joostjager
Copy link
Collaborator

@joostjager joostjager commented Aug 9, 2019

We had a bit more discussion on the topic. One thing that was brought up, is the case where tracking of data can trigger certain actions like auto-closing of channels. With lndmon, a loop would be created where lnd exports to lndmon, lndmon does time series/analysis/etc and then pushes the result back into lnd to take action. This may not be the architecture that we want.

I think the main thing is for us to be aware that the fitness subsystem and lndmon are related and make a deliberate decision on where to implement new functionality.

I am a bit worried though to which extent we can actually foresee what we need. But yes, you are right, the fitness subsystem could in that case export to lnd to prevent duplication.

With regard to useful things to track as time series in the fitness subsystem (in the future), the metric 'channel profitability' always comes to my mind. To be able to auto-close non-profitable channels. For channel profitability, an analysis over time of local balance and fwding fees is required.

Loading

@joostjager
Copy link
Collaborator

@joostjager joostjager commented Aug 9, 2019

Tagging @valentinewallace as she's been developing lndmon.

Loading

@carlaKC carlaKC force-pushed the chanfitness-trackchanneluptime branch from 7551588 to 9238cf0 Aug 12, 2019
@carlaKC
Copy link
Collaborator Author

@carlaKC carlaKC commented Aug 12, 2019

PR updated to add a few changes:

  1. Include uptime in lnrpc/Channel which is returned by ListChannels
  2. Record online/offline events on startup for each channel and on open channel events
  • When we restart, we need to add a peer online/offline event for existing channels otherwise we will have to wait until they go online/offline to start tracking
  • A channel open event is not a guarantee that a peer is online, because they can go offline while the funding tx confirms, so we need to know their starting state

There is an edge case for existing channels where peers are recorded as offline because we have not connected to them at restart time, and then online when we establish a connection, which means existing channels will have 99% uptime rather than 100%. This can be addressed by adding a wait before recording initial peer state for channels (eg sleep one minute then check status) if necessary.

Loading

@valentinewallace
Copy link
Contributor

@valentinewallace valentinewallace commented Aug 13, 2019

With lndmon, a loop would be created where lnd exports to lndmon, lndmon does time series/analysis/etc and then pushes the result back into lnd to take action. This may not be the architecture that we want.

I agree that sounds like an architecture to avoid 😄

There are some things (like peer uptime) that could be exported to lndmon

That sounds useful. Prometheus may be able to track peer uptime already since it does know when the set of peers changes 🤔 Let me know if I can contribute more lndmon insight.

Loading

Copy link
Collaborator

@joostjager joostjager left a comment

Definitely much nicer now that something is visible on the rpc interface 👍

I left some comments. My main concern is about the flow of events from subsystems to the fitness tracker.

Loading

rpcserver.go Outdated Show resolved Hide resolved
Loading
// id is the uInt64 of the short channel ID.
id uint64

peer [33]byte
Copy link
Collaborator

@joostjager joostjager Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for pubkeys, there is also the route.Vertex type

Loading

chanfitness/chanevent.go Show resolved Hide resolved
Loading
chanfitness/chanevent_test.go Show resolved Hide resolved
Loading
chanfitness/chanevent.go Show resolved Hide resolved
Loading

// PeerEvents provides a subscription client which provides a stream of
// peer online/offline events.
PeerEvents func() (*subscribe.Client, error)
Copy link
Collaborator

@joostjager joostjager Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are untyped anyway, and especially if this is extended to a large list in the future, can't the store not just have a single channel over which it receives all events? And then have all event sources drop their typed events into that channel.

Loading

Copy link
Collaborator Author

@carlaKC carlaKC Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we'd construct a function which pipes all of the subscriptions into a channel which we create in server.go and then pass that to Config? I think that constructing that kind of aggregation from the caller bubbles complexity up rather than down sometimes, because if we have a lot of event sources it's simpler to just provide a list of functions than have some long aggregations of them all in the middle of the newServer call?

Having had my philosophical ramble, I think this may address some of the concerns you have about the unit tests so will give this a try :)

Loading

Copy link
Collaborator

@joostjager joostjager Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking something simpler. Just make the fitness system a dependency of the other subsystems. It exposes a single function SendEvent to those subsystems. In the implementation of SendEvent, the only thing that happens is that the (untyped) event is dropped into a single buffered channel or queue. The main loop of the fitness system is receiving on that channel and processing the events via a type switch.

But I could be missing something here of course.

Loading

Copy link
Collaborator Author

@carlaKC carlaKC Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern with a SendEvent call is that it feels a bit counterintuitive to be sending events when we have two subscriptions specifically intended for receiving events.

In the case where we don't have subscriptions, a SendEvent makes sense. And I suspect this will be the case for the remainder of the events we add to channel scores (ie htlc level events).

Since it probably (?) doesn't make sense to make subscriptions for htlc level events the options would be:

  1. Use SendEvent everywhere, accepting that it's a bit of an off construction to send events in the cases where you have subscriptions available
  2. Use SendEvent for non-subscription events and still pass the Peer/Chan Subscriptions in, but pipe those into the SendEvent channel so that monitoring happens in one central place.

Loading

Copy link
Collaborator

@joostjager joostjager Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it that these subscriptions only exist to serve the event store? In server.go, the call s.peerNotifier.NotifyPeerOnline(pubKey) is made, which could be replaced by a call directly into the event store SendEvent?

I am looking for the simplest thing that works, but may miss something because I didn't follow the design decision around peer notifier.

Loading

Copy link
Collaborator Author

@carlaKC carlaKC Aug 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although an events chan is looking a lot cleaner for tests. The only ugliness is that it has to be called from the channel open/close notification functions which feels like doing the same thing twice. So it really is dependent on whether subscriptions actually have other use cases. which I'm unsure of.

Loading

Copy link
Member

@Roasbeef Roasbeef Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chanfitness internally manages a queue to not block reporting sub-systems. There only needs to be a single queue for all events.

It would more or less need to re-implement the current queue handling in the subscribe package to do this. Therefore we should re-use what we have rather than attempt to over optimize for this use case which isn't performance critical and still nascent.

Loading

Copy link
Member

@Roasbeef Roasbeef Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to get caught up in this relatively minor difference. The original primary reviewer (johan) was OK with the current approach. We shouldn't undo that prior lineage due to a new reviewer stepping in. For the sake of making progress, I don't think there's a fundamental principle that tends us in one direction or the other. Therefore we can continue down the current path which had already received review by the former primary reviewer.

If this was more critical code such as the channel state machine or revocation handling, then I would be willing to spend more time here, but it's a relatively independent sub-system that is non-critical.

Loading

Copy link
Collaborator

@joostjager joostjager Aug 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though Johan was ok with the current approach, it could still be that he likes the alternative more. I don't think we've discussed that. But yes, both approaches lead to functioning, correct code. This is only about preventing boiler plate. I can put aside my concerns about that and see how it works out.

Loading

Copy link
Collaborator

@halseth halseth Aug 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current approach is more general and would scale better with more subscription types, and more subscribers to the different types.

We had to take a similar decision when we initially added the NotifyPeerOnline API to the fundingmanager. We could have made the server call into the funding manager directly, but rather we made it a general subscription API that any subsystem could call. This has turned out useful to have access to also in other packages.

I also don't see the boiler plate we would avoid, since it is mostly contained in the subscribe package?

Loading

chanfitness/chanfitness.go Outdated Show resolved Hide resolved
Loading
chanfitness/chanfitness.go Outdated Show resolved Hide resolved
Loading
lnrpc/rpc.proto Outdated
The percentage of the channel's lifetime that the remote peer has been
online.
*/
int64 uptime_percentage = 22 [json_name = "uptime_percentage"];
Copy link
Collaborator

@joostjager joostjager Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can also add the channel life time here, as we have it available anyway. And I think we should also mark the fields as 'experimental' in the comment, so that we can make breaking changes later if we want to.

Loading

Copy link
Collaborator Author

@carlaKC carlaKC Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, it also makes it less likely that the uptime is misinterpreted by users as applying to the whole lifetime of the channel.

How about:
observed_lifetime: total seconds we have been monitoring the channel
observed_uptime: total seconds we have observed the peer online during monitoring

Then users can do their own calculations, and there's no rounding to be dealt with. There's probably a better var name than observed because that makes is sound like we didn't know about the channel before that, but I'll scour synonyms.com later :)

Loading

Copy link
Collaborator

@joostjager joostjager Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Sounds better. I don't think the observed prefix is needed, is it?

Loading

Copy link
Collaborator Author

@carlaKC carlaKC Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should I mark fields as experimental? I can't see any examples in the code.
Thinking of something along these lines in the field comments?

    [experimental]: this field is experimental and may be subject to change.

Loading

Copy link
Collaborator

@joostjager joostjager Aug 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We indeed don't have that yet. All our experimental fields are in sub-servers atm. Yes, something along those lines.

Loading

Copy link
Collaborator

@halseth halseth Aug 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we don't have to mark them experimental. We can rather deprecate and add new fields later if we need to.

Loading


// ChannelEventStore maintains a set of event logs for the node's channels.
// It is intended to provide insight into the performance of channels.
type ChannelEventStore struct {
Copy link
Collaborator

@joostjager joostjager Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main type in this file doesn't match file name

Loading

Copy link
Collaborator

@halseth halseth Aug 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on renaming the file

Loading

Copy link
Collaborator Author

@carlaKC carlaKC Aug 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I rename the package? I think chanfitness (or channelscores) still works because the package will eventually be revealing scores, but at the moment there won't be a file with the package name which weird.

Loading

Copy link
Collaborator

@halseth halseth Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the package name, only change the file name :)

Loading

@carlaKC carlaKC force-pushed the chanfitness-trackchanneluptime branch from 9238cf0 to 8309a24 Aug 28, 2019
@joostjager joostjager added this to the 0.9 milestone Oct 7, 2019
@carlaKC carlaKC requested review from halseth and joostjager Oct 7, 2019
func (e *chanEventLog) uptime(startTime, endTime time.Time) time.Duration {
// Sanity check the period provided.
if endTime.Before(startTime) {
return 0
Copy link
Collaborator

@joostjager joostjager Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return an error here, to help identify bugs?

Loading

// If event is before the period we are calculating uptime for, or we
// are currently in an online state, we do not need to increment uptime.
if event.timestamp.Before(startTime) {
continue
Copy link
Collaborator

@joostjager joostjager Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this continue do anything?

Loading

chanfitness/chanevent.go Show resolved Hide resolved
Loading
chanfitness/chaneventstore_test.go Show resolved Hide resolved
Loading
chanfitness/chaneventstore.go Show resolved Hide resolved
Loading
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
Loading
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
Loading
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
Loading
lnrpc/rpc.proto Show resolved Hide resolved
Loading
close(events)

store.wg.Add(1)
store.monitorChannelEvents(events)
Copy link
Collaborator

@joostjager joostjager Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it is using a lot if implementation knowledge of the store, making this test expensive to maintain. It needs to change if the internals of the store change. It is preferable to test it more as a black box.

Loading

Copy link
Collaborator

@halseth halseth Oct 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that it's not ideal, but in this case it will be a bit hard to test without accessing this internal method since calling Start will start the subscriptions (which aren't that easy to mock). Also the test is called TestMonitorChannelEvents so maybe okay that it accesses this method directly?

I think this is good enough for now 😄

Loading

Copy link
Collaborator

@joostjager joostjager Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the solution to prevent peeking inside the event lists (implementation specific), would be to create a mock event log. But I am ok leaving it as is.

Loading

@carlaKC carlaKC force-pushed the chanfitness-trackchanneluptime branch 3 times, most recently from ea5d140 to 8b50c8c Oct 14, 2019
@carlaKC carlaKC requested a review from joostjager Oct 14, 2019

response := make(chan uptimeResponse)

c.uptimeRequests <- uptimeRequest{
Copy link
Collaborator

@halseth halseth Oct 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must select on quit

Loading

@carlaKC carlaKC force-pushed the chanfitness-trackchanneluptime branch from 8b50c8c to f74db87 Oct 14, 2019
// id is the uInt64 of the short channel ID.
id uint64

peer [33]byte
Copy link
Collaborator

@joostjager joostjager Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No route.Vertex?

Loading

chanfitness/chanevent.go Show resolved Hide resolved
Loading
chanfitness/chaneventstore.go Show resolved Hide resolved
Loading
chanfitness/chaneventstore.go Show resolved Hide resolved
Loading
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
Loading
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
Loading
chanfitness/chaneventstore.go Show resolved Hide resolved
Loading
rpcserver.go Outdated
// If the channel has not been closed yet, and it was found in the channel
// store set its endTime to now to calculate lifetime and uptime until the
// present.
if endTime.IsZero() && err != nil {
Copy link
Collaborator

@joostjager joostjager Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match on a specific error here?

Loading

rpcserver.go Outdated
chanID := dbChannel.ShortChannelID.ToUint64()

// Get the lifespan observed by the channel event store. It it is unknown,
// zero values will be returned (which wil yield a zero lifetime).
Copy link
Collaborator

@joostjager joostjager Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the channel is unknown, shouldn't we then stop making additional calls?

Loading

Copy link
Collaborator

@halseth halseth Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be handled correctly as it currently is?

Loading

channelnotifier/channelnotifier.go Outdated Show resolved Hide resolved
Loading
@joostjager joostjager added this to WIP in v0.9.0-beta via automation Oct 23, 2019
@joostjager joostjager moved this from WIP to Needs Review in v0.9.0-beta Oct 23, 2019
Copy link
Collaborator

@halseth halseth left a comment

Also needs a rebase!

Loading

chanfitness/chanevent.go Outdated Show resolved Hide resolved
Loading
chanfitness/chanevent.go Show resolved Hide resolved
Loading
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
Loading
chanfitness/chaneventstore.go Show resolved Hide resolved
Loading
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
Loading
GetChannels: func() (channels []*channeldb.OpenChannel, e error) {
return nil, errors.New("intentional test err")
},
expectStartErr: true,
Copy link
Collaborator

@halseth halseth Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test case for no error?

Loading

Copy link
Collaborator Author

@carlaKC carlaKC Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't error, the start func will try to touch the subscribe clients updates.ChanOut, since it's unexported it can't be mocked. Will remove that bool since and update the comment to note that.

Loading

}

// Stop the store's go routine.
store.Stop()
Copy link
Collaborator

@halseth halseth Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can defer stop this?

Loading

channelnotifier/channelnotifier.go Outdated Show resolved Hide resolved
Loading
rpcserver.go Outdated
chanID := dbChannel.ShortChannelID.ToUint64()

// Get the lifespan observed by the channel event store. It it is unknown,
// zero values will be returned (which wil yield a zero lifetime).
Copy link
Collaborator

@halseth halseth Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be handled correctly as it currently is?

Loading

rpcserver.go Outdated
@@ -2685,6 +2710,8 @@ func createRPCOpenChannel(r *rpcServer, graph *channeldb.ChannelGraph,
LocalChanReserveSat: int64(dbChannel.LocalChanCfg.ChanReserve),
RemoteChanReserveSat: int64(dbChannel.RemoteChanCfg.ChanReserve),
StaticRemoteKey: dbChannel.ChanType.IsTweakless(),
Lifetime: int64(endTime.Sub(startTime)),
Copy link
Collaborator

@halseth halseth Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this result in Nanoseconds?

Loading

@carlaKC carlaKC force-pushed the chanfitness-trackchanneluptime branch 2 times, most recently from 16725a3 to ad5b65b Oct 24, 2019
carlaKC added 2 commits Oct 25, 2019
This commit adds a chanfitness package which will be used to track
channel health and performance metrics. It adds a channel event
structure which will be used to track channel opens/closes and peer
uptime.

The eventLog implements an uptime function which calcualtes uptime
over a given period and a lifespan function which returns the time
when the log began monitoring the channel and, if the channel is
closed, the time when it stopped moitoring it.
This commit adds a channel event store to the channel fitness
package which is used to manage tracking of a node's channels.
It adds tracking for channel open/closed and peer online/offline
events for all channels that a node has open.

Events are consumed from channelNotifier and peerNotifier event
subscriptions. If either of these subscriptions is cancelled,
channel scoring stops, because both subscriptions are expected
to run until node shutdown.

Two functions are exposed to allow external callers to get uptime
information about a channel. GetLifespan returns the period over
which the channel has been monitored. GetUptime returns the channel's
uptime over a specified period. Callers can use these functions to
get the channel's remote peer uptime over its entire lifetime, or
a subset of that period.
@carlaKC carlaKC force-pushed the chanfitness-trackchanneluptime branch from ad5b65b to 7b12efa Oct 25, 2019
Copy link
Collaborator

@halseth halseth left a comment

Great work, this will useful to find channels to close going forward! LGMT 💯

Loading

rpcserver.go Show resolved Hide resolved
Loading
rpcserver.go Outdated
if endTime.IsZero() {
endTime = time.Now()
}
lifespan = startTime.Sub(endTime)
Copy link
Collaborator

@halseth halseth Oct 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to reverse this

Loading

This commit adds the total observed lifetime of a channel and the
totaluptime of its remote peer to the lnrpc channel struct. These
fields are marked as experimential because they are subject to
change.
@carlaKC carlaKC force-pushed the chanfitness-trackchanneluptime branch from 7b12efa to 31bf542 Oct 25, 2019
@carlaKC carlaKC requested a review from joostjager Oct 25, 2019
v0.9.0-beta automation moved this from Needs Review to Approved Oct 25, 2019
@halseth halseth merged commit 1a0ab53 into lightningnetwork:master Oct 25, 2019
2 checks passed
Loading
v0.9.0-beta automation moved this from Approved to Done Oct 25, 2019
@carlaKC carlaKC deleted the chanfitness-trackchanneluptime branch Feb 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
v0.9.0-beta
  
Done
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants