-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chanfitness: Track peer uptime #3332
Chanfitness: Track peer uptime #3332
Conversation
e04f971
to
1683f2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very impressed by this first iteration of the package! The design can be streamlined a little bit, but all in all I think this has taken the right direction.
First of all, this size of this PR is quite large, so I would suggest breaking it into smaller parts. It seems natural to let the pure event tracking be its own PR, then we can follow up with PRs to get scores, exposing them on the RPC etc.
Secondly with the above comment in mind, I think we should attempt to consolidate monitoring of channel events and peer activity. The first step here would be to add a SubscribePeerEvents
API similar to what already exists for channels.
chanfitness/chanfitness.go
Outdated
"github.com/lightningnetwork/lnd/subscribe" | ||
) | ||
|
||
// ScoreStore maintains a set of scores for the node's channels. It is intended |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the notion of this storing a "score", should we change it to store "events". The score can later be calculated on the basis of the events. We can name this ChannelEventStore
?
chanfitness/chanfitness.go
Outdated
score = &ChannelScore{ | ||
id: channelID, | ||
peer: peer, | ||
quit: make(chan struct{}, 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take a look at how quit channels are defined and used other places in the codebase :)
Thanks for the review @halseth 🙏 What do you think about adding a With that approach, I'd break this (entirely too big - sorry 😅) PR down into:
|
1683f2a
to
7551588
Compare
👍
I would start by doing 1 and 2, then we can take a look at how big 2 becomes and what we could reasonably include there 😄 |
Needs rebase! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is coming along nicely!
I think we could also incorporate peer event tracking in this PR, to see how it will interact with the channel event store. Alternatively a new PR that builds on this one that adds peer tracking can be created, to see how the final result will look.
chanfitness/chanevent.go
Outdated
// id is the uInt64 of the short channel ID. | ||
id uint64 | ||
|
||
peer [33]byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing godoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for pubkeys, there is also the route.Vertex
type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No route.Vertex
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to add a dependency on I'm fine with either approach.routing
, so would prefer keeping as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave it as is for now and then I'll update PeerNotifier
and this code in a separate PR?
Discussed, will change in this PR then update PeerNotifier
separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only did a high level pass.
With regard to scope: I prefer a pr to have some meaningful impact (personal opinion). Add functionality, fix a bug, speed up something, etc. In general I think it is best if it also makes the changes end to end, so that one can verify (run the code) that the goal of the pr is met.
This pr is mostly a horizontal slice with code that isn't being called into except for starting the main event loop. A technical layer. Imo that bears the risk that code get merged that needs to be revised later when doing the work in the higher level layers. It may take some creativity to define an end to end pr that is not too big, but it seems that in this new domain there must be some good opportunities.
Also, if the goal is to track channel open/closes on chain: it may be that that info is already available and doesn't need to be tracked. The channel point and the closing tx id are persisted in the database. Maybe the peer (tcp) up time metric would have been more suitable to kick off the fitness system.
log.go
Outdated
@@ -118,6 +120,7 @@ func init() { | |||
chanbackup.UseLogger(chbuLog) | |||
monitoring.UseLogger(promLog) | |||
wtclient.UseLogger(wtclLog) | |||
chanfitness.UseLogger(chftLog) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could choose to use the addSubLogger
construction here and define the log prefix in the package itself.
Thanks for taking a look @joostjager
I definitely agree on having a more horizontal PR in terms of being able to test the changes and figuring out how it will be used by higher levels. My original pass at this PR had a rpcsubserver for getting scores included, but that ended up being a pretty huge PR so I removed it. My thoughts on expanding this PR to be more than just kicking off a logical loop:
This PR has been waiting on #3354 (which was originally included in my way too big first attempt) so that uptime can be added to the events store - that is the main goal for this PR. When peer events were split out, I updated this to just have open/close events to get an idea of what the structure would look like. I'm going to add online/offline events now that #3354 is merged - I should have made that clearer, sorry! In terms of having open/ close events at all, I think it still makes sense to have them while events are being kept in memory? That way the package doesn't have to touch the DB at all for now (and does not need to query it every time a score is requested). If/when we start looking at persisting score data, we can probably just get channel open/close events from chanDB. |
I actually meant creating a vertical PR, cutting through several layers. Ok, horizontal, vertical, it isn't very self-describing, but I think we mean the same thing. I would at least choose either channel open/close or peer online/offline, not both, for this pr. If you choose peer online/offline, maybe you can just expose a single value "average peer uptime %" over the rpc. Or even more minimalistic, just log the average peer uptime value periodically. Then at least, some business value is created and users can see that it works. For channel open/close: I don't know if it is a requirement to not touch the database for now. If you think you'll query chandb later, it may not be worth to write the code to send channel events now and remove that again later. In general, with storing redundant information things may get out of sync. |
I think only doing peer uptime for now sounds like a good middle ground. That was what this PR originally set out to do, and would provide information we don't currently have anywhere. The question is whether we can get away with channel events completely. Since a peer's uptime might not be very interesting if we don't have an open channel with it, it makes sense to somehow relate it to channel events. This information can possibly be fetched from the DB, but I'm not sure if we have it all (we only have block heights, not timestamps, even though we can fetch these from block headers), and I can imagine this could get more complicated than the current approach, as it is quite self contained. Storing open/close events is also not that big of a deal IMO, as it is not in any way critical if it gets out of sync. For using this information, we could start with something as simple as returning peer uptime % during a |
One other question: how does the channel fitness subsystem relate to From
This seems to overlap with what is started in this pr. For example iterating over time series to extract useful values. |
I haven't looked at
In terms of metrics that may be duplicated on both sides, I don't think we need to specifically track an event log for the payment level data (fees, failed/succeeded amounts) so that would minimise the amount of time series processing we do. Uptime is the only "stateful" series that made more sense to track as time series (I did implement a running counter for uptime but it was very messy). If it comes to a point where there's straight up duplication between the two, then we can look at chanfitness exporting metrics for |
We had a bit more discussion on the topic. One thing that was brought up, is the case where tracking of data can trigger certain actions like auto-closing of channels. With I think the main thing is for us to be aware that the fitness subsystem and I am a bit worried though to which extent we can actually foresee what we need. But yes, you are right, the fitness subsystem could in that case export to With regard to useful things to track as time series in the fitness subsystem (in the future), the metric 'channel profitability' always comes to my mind. To be able to auto-close non-profitable channels. For channel profitability, an analysis over time of local balance and fwding fees is required. |
Tagging @valentinewallace as she's been developing |
7551588
to
9238cf0
Compare
PR updated to add a few changes:
There is an edge case for existing channels where peers are recorded as offline because we have not connected to them at restart time, and then online when we establish a connection, which means existing channels will have 99% uptime rather than 100%. This can be addressed by adding a wait before recording initial peer state for channels (eg sleep one minute then check status) if necessary. |
I agree that sounds like an architecture to avoid 😄
That sounds useful. Prometheus may be able to track peer uptime already since it does know when the set of peers changes 🤔 Let me know if I can contribute more lndmon insight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely much nicer now that something is visible on the rpc interface 👍
I left some comments. My main concern is about the flow of events from subsystems to the fitness tracker.
chanfitness/chanevent.go
Outdated
// id is the uInt64 of the short channel ID. | ||
id uint64 | ||
|
||
peer [33]byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for pubkeys, there is also the route.Vertex
type
chanfitness/chanfitness.go
Outdated
|
||
// PeerEvents provides a subscription client which provides a stream of | ||
// peer online/offline events. | ||
PeerEvents func() (*subscribe.Client, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these are untyped anyway, and especially if this is extended to a large list in the future, can't the store not just have a single channel over which it receives all events? And then have all event sources drop their typed events into that channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we'd construct a function which pipes all of the subscriptions into a channel which we create in server.go
and then pass that to Config
? I think that constructing that kind of aggregation from the caller bubbles complexity up rather than down sometimes, because if we have a lot of event sources it's simpler to just provide a list of functions than have some long aggregations of them all in the middle of the newServer
call?
Having had my philosophical ramble, I think this may address some of the concerns you have about the unit tests so will give this a try :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually thinking something simpler. Just make the fitness system a dependency of the other subsystems. It exposes a single function SendEvent
to those subsystems. In the implementation of SendEvent
, the only thing that happens is that the (untyped) event is dropped into a single buffered channel or queue. The main loop of the fitness system is receiving on that channel and processing the events via a type switch.
But I could be missing something here of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern with a SendEvent
call is that it feels a bit counterintuitive to be sending events when we have two subscriptions specifically intended for receiving events.
In the case where we don't have subscriptions, a SendEvent
makes sense. And I suspect this will be the case for the remainder of the events we add to channel scores (ie htlc level events).
Since it probably (?) doesn't make sense to make subscriptions for htlc level events the options would be:
- Use
SendEvent
everywhere, accepting that it's a bit of an off construction to send events in the cases where you have subscriptions available - Use
SendEvent
for non-subscription events and still pass the Peer/Chan Subscriptions in, but pipe those into the SendEvent channel so that monitoring happens in one central place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it that these subscriptions only exist to serve the event store? In server.go
, the call s.peerNotifier.NotifyPeerOnline(pubKey)
is made, which could be replaced by a call directly into the event store SendEvent
?
I am looking for the simplest thing that works, but may miss something because I didn't follow the design decision around peer notifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although an events chan is looking a lot cleaner for tests. The only ugliness is that it has to be called from the channel open/close notification functions which feels like doing the same thing twice. So it really is dependent on whether subscriptions actually have other use cases. which I'm unsure of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chanfitness internally manages a queue to not block reporting sub-systems. There only needs to be a single queue for all events.
It would more or less need to re-implement the current queue handling in the subscribe
package to do this. Therefore we should re-use what we have rather than attempt to over optimize for this use case which isn't performance critical and still nascent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to get caught up in this relatively minor difference. The original primary reviewer (johan) was OK with the current approach. We shouldn't undo that prior lineage due to a new reviewer stepping in. For the sake of making progress, I don't think there's a fundamental principle that tends us in one direction or the other. Therefore we can continue down the current path which had already received review by the former primary reviewer.
If this was more critical code such as the channel state machine or revocation handling, then I would be willing to spend more time here, but it's a relatively independent sub-system that is non-critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though Johan was ok with the current approach, it could still be that he likes the alternative more. I don't think we've discussed that. But yes, both approaches lead to functioning, correct code. This is only about preventing boiler plate. I can put aside my concerns about that and see how it works out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current approach is more general and would scale better with more subscription types, and more subscribers to the different types.
We had to take a similar decision when we initially added the NotifyPeerOnline
API to the fundingmanager
. We could have made the server call into the funding manager directly, but rather we made it a general subscription API that any subsystem could call. This has turned out useful to have access to also in other packages.
I also don't see the boiler plate we would avoid, since it is mostly contained in the subscribe
package?
lnrpc/rpc.proto
Outdated
The percentage of the channel's lifetime that the remote peer has been | ||
online. | ||
*/ | ||
int64 uptime_percentage = 22 [json_name = "uptime_percentage"]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can also add the channel life time here, as we have it available anyway. And I think we should also mark the fields as 'experimental' in the comment, so that we can make breaking changes later if we want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, it also makes it less likely that the uptime is misinterpreted by users as applying to the whole lifetime of the channel.
How about:
observed_lifetime
: total seconds we have been monitoring the channel
observed_uptime
: total seconds we have observed the peer online during monitoring
Then users can do their own calculations, and there's no rounding to be dealt with. There's probably a better var name than observed
because that makes is sound like we didn't know about the channel before that, but I'll scour synonyms.com later :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Sounds better. I don't think the observed
prefix is needed, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should I mark fields as experimental? I can't see any examples in the code.
Thinking of something along these lines in the field comments?
[experimental]: this field is experimental and may be subject to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We indeed don't have that yet. All our experimental fields are in sub-servers atm. Yes, something along those lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we don't have to mark them experimental. We can rather deprecate and add new fields later if we need to.
chanfitness/chanfitness.go
Outdated
|
||
// ChannelEventStore maintains a set of event logs for the node's channels. | ||
// It is intended to provide insight into the performance of channels. | ||
type ChannelEventStore struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main type in this file doesn't match file name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on renaming the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I rename the package? I think chanfitness
(or channelscores
) still works because the package will eventually be revealing scores, but at the moment there won't be a file with the package name which weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the package name, only change the file name :)
9238cf0
to
8309a24
Compare
chanfitness/chanevent.go
Outdated
func (e *chanEventLog) uptime(startTime, endTime time.Time) time.Duration { | ||
// Sanity check the period provided. | ||
if endTime.Before(startTime) { | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return an error here, to help identify bugs?
chanfitness/chanevent.go
Outdated
// If event is before the period we are calculating uptime for, or we | ||
// are currently in an online state, we do not need to increment uptime. | ||
if event.timestamp.Before(startTime) { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this continue
do anything?
chanfitness/chaneventstore_test.go
Outdated
close(events) | ||
|
||
store.wg.Add(1) | ||
store.monitorChannelEvents(events) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it is using a lot if implementation knowledge of the store, making this test expensive to maintain. It needs to change if the internals of the store change. It is preferable to test it more as a black box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that it's not ideal, but in this case it will be a bit hard to test without accessing this internal method since calling Start
will start the subscriptions (which aren't that easy to mock). Also the test is called TestMonitorChannelEvents
so maybe okay that it accesses this method directly?
I think this is good enough for now 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the solution to prevent peeking inside the event lists (implementation specific), would be to create a mock event log. But I am ok leaving it as is.
ea5d140
to
8b50c8c
Compare
chanfitness/chaneventstore.go
Outdated
|
||
response := make(chan uptimeResponse) | ||
|
||
c.uptimeRequests <- uptimeRequest{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must select on quit
8b50c8c
to
f74db87
Compare
chanfitness/chanevent.go
Outdated
// id is the uInt64 of the short channel ID. | ||
id uint64 | ||
|
||
peer [33]byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No route.Vertex
?
rpcserver.go
Outdated
// If the channel has not been closed yet, and it was found in the channel | ||
// store set its endTime to now to calculate lifetime and uptime until the | ||
// present. | ||
if endTime.IsZero() && err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Match on a specific error here?
rpcserver.go
Outdated
chanID := dbChannel.ShortChannelID.ToUint64() | ||
|
||
// Get the lifespan observed by the channel event store. It it is unknown, | ||
// zero values will be returned (which wil yield a zero lifetime). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the channel is unknown, shouldn't we then stop making additional calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be handled correctly as it currently is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also needs a rebase!
chanfitness/chaneventstore_test.go
Outdated
GetChannels: func() (channels []*channeldb.OpenChannel, e error) { | ||
return nil, errors.New("intentional test err") | ||
}, | ||
expectStartErr: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a test case for no error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't error, the start func will try to touch the subscribe clients updates.ChanOut
, since it's unexported it can't be mocked. Will remove that bool since and update the comment to note that.
chanfitness/chaneventstore_test.go
Outdated
} | ||
|
||
// Stop the store's go routine. | ||
store.Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can defer stop this?
rpcserver.go
Outdated
chanID := dbChannel.ShortChannelID.ToUint64() | ||
|
||
// Get the lifespan observed by the channel event store. It it is unknown, | ||
// zero values will be returned (which wil yield a zero lifetime). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduces a new package
chanfitness
which is used to track scores for a node's channels.Scores are currently kept in memory, with the intention of persisting them once the set of metrics needed to score nodes is more clearly defined. It is related to #1253, although the issue is quite old.
This change implements tracking for peer uptime, by maintaining a log of channel events for each channel. Peer online/offline events are monitored by a goroutine on a per peer basis, and channel creation/close events are monitored by a single goroutine.
This change also includes adding a channel fitness rpc subserver, but that can be split out into another PR if the change is too big.
PR Checklist