New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chanfitness: Rate limit in memory events based on peer flap rate #4440
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice code and commit structure, made it pleasant to review. I left some non-blocking comments and questions. The main thing for me is:
recordFlapCount
should probably aggregate the transactions as long as it doesn't drastically interfere with theWriteFlapCount
function.- Right now the flap count is aggregated for a peer. Since I assume this data is going to be used elsewhere (ban manager?), it might be better to also store it under a per-channel sub-bucket of the peer-bucket to have some more preciseness. Non-blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a large PR, initial pass seems to be good, have a few comments though.
80ebf77
to
e36ace6
Compare
19344b5
to
dd18082
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! My only remaining question is (the offline discussed) flap count vs subsequent online events "issue".
offline = false | ||
// If our previous event is nil, we just set it and | ||
// break out of the switch. | ||
if lastEvent == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this offline, just wanted to leave my question here too:
What if a peer reportedly comes online but is never reported to go offline in between? Will that hurt the flap rate?
Or can this even happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve had another look at this, and I think that we may be ok in terms of ordering because we notify on quite a high level. We notify peer online events in addPeer
and offline events in removePeer
.
addPeer
is only called when we get an inbound connection that we do not have an existing record of, or we replace a stale connection with a fresh one (InboundPeerConnected
/ OutboundPeerConnected
). When we are replacing stale connections, we remove all previous state, which will trigger an offline event before our new online event.
This happens at a much higher level than our connection retries, and has logic in place that’s aware of existing connections. This does make our online events less granular, but it happens at the same level for every peer (and is preexisting for this uptime measure).
However, it’s still difficult to say for sure whether we can rely on in-order events in reality. Perhaps @Crypt-iQ can weigh in on this? I still like the suggestion of counting online and offline events, because it makes sure that our measurement isn’t biased (perhaps certain peers have certain odd connection habits, others don’t). Esp in this case, where the value itself matters less than the comparison of different peers values.
|
||
// Uptime is the total amount of time that the channel peer has been | ||
// observed as online during the monitored lifespan. | ||
Uptime time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question in a different form, but if a peer reportedly comes online such that it's always in the idle timeout then what is the uptime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this would be the case where a peer connects to us, does not respond to any messages for the idle timeout, we disconnect and then they immediately reconnect but don't respond to any messages again?
With the current notifications, this peer would be up for that period of time, since we've defined online as having a connection open with that peer (so far). I think this case would likely be due to a mis-implemented peer, and while somebody could setup their node to "game" our uptime metric, the flap count would reflect that they go down every 5 minutes, and I don't think there's much to gain because they would have to respond to our messages to participate in forwards (the only incentive I can see for wanting high uptime, we choose to route through you).
We could look at moving notifications for uptime tracking to a lower level, but I think we will then start running into ordering issues.
chanfitness/chanevent.go
Outdated
@@ -95,14 +121,34 @@ func (p *peerLog) addOnlineEvent(online bool, time time.Time) { | |||
eventType := peerOnlineEvent | |||
if !online { | |||
eventType = peerOfflineEvent | |||
p.flapCount++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to the prev comments: we could increase flap count for subsequent online events too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we may need to account for if we increment for online events is that we will then do so on restart for every peer (not just those which are offline when we restart). Perhaps this is more fair, because the online peers may have flapped while our node was down, and the offline ones just have bad luck.
7427bcf
to
00a3a2c
Compare
00a3a2c
to
20c4c3f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the cooldown needs fixing, other than that, the PR looks good! 👍
|
||
// Likewise, if we have an online event, nothing beyond the online state | ||
// of our peer log should change. | ||
// of our peer log should change, but our flap count should change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think it's good idea to increase flaps regardless of order.
chanfitness/chaneventstore.go
Outdated
for peer, monitor := range c.peers { | ||
flapCount, lastFlap := monitor.getFlapCount() | ||
if lastFlap == nil { | ||
continue | ||
} | ||
|
||
flapCounts[peer] = &channeldb.FlapCount{ | ||
Count: uint32(flapCount), | ||
Count: uint32( | ||
cooldownFlapCount(now, flapCount, *lastFlap), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc this won't really cooldown the flapcount as we always get it from the peer monitor where the flapcount always increases so the max we could get is 0.9 * flapcount
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this up and tested with some regtest nodes. Since we only need the cooldown for rate limiting, decided to keep it as we need it as you suggest.
It is the case that we may cooldown then increment, but that makes sense to me in the following case:
- peer with 100 flap count connects, time since flap > cooldown period
- cooldown to 95
- update to 96, since they have just flapped, also update their last flapped
ca19aba
to
d765b3e
Compare
Changes since last review here, other pushes are rebases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost ready, just one question left on my end.
Original PR was written with 4 spaces instead of 8, do a once off fix here rather than fixing bit-by bit in the subsequent commits and cluttering them for review.
The current implementation of subscribe is difficult to mock because the queue that we send updates on in unexported, so you cannot create a subscribe.Client object and then add your own updates. While it is possible to run a subscribe server in tests, subscribe servers will shutdown before dispatching their udpates to all clients, which can be flakey (and is difficult to workaround). In this commit, we add a subscription interface so that these testing struggles can be addressed with a mock.
As we add more elements to the chanfitness subsystem, we will require more complex testing. The current tests are built around the inability to mock subscriptions, which is remedied by addition of our own mock. This context allows us to run the full store in a test, rather than having to manually spin up the main goroutine. Mocking our subscriptions is required so that we can block our subscribe updates on consumption, using the real package provides us with no guarantee that the client receives the update before shutdown, which produces test flakes. This change also makes a move towards separating out the testing of our event store from testing the underlying event logs to prepare for further refactoring.
To get our uptime, we first filter our event log to get online periods. This change updates this code to be tolerant of consecutive online or offline events in the log. This will be required for rate limiting, because we will not record every event for anti-dos reasons, so we could record an online event, ignore an offline event and then record another offline event. We could just ignore this duplicate event, but we will also need this tolerance for when we persist uptime and our peers can have their last event before restart as an online event and record another online event when we come back up.
We currently query the store for uptime and lifespan individually. As we add more fields, we will need to add more queries with this design. This change combines requests into a single channel infor request so that we do not need to add unnecessary boilerplate going forward.
When dealing with online events, we actually need to track our events by peer, not by channel. All we need to track channels is to have a set of online events for a peer which at least contain those events. This change refactors chanfitness to track by peer.
In preparation for storing our flap count on disk, we start tracking flap count per-peer.
To prevent flapping peers from endlessly dos-ing us with online and offline events, we rate limit the number of events we will store per period using their flap rate to determine how often we will add their events to our in memory list of online events. Since we are tracking online events, we need to track the aggregate change over the rate limited period, otherwise we will lose track of a peer's current state. For example, if we store an online event, then do not store the subsequent offline event, we will believe that the peer is online when they actually aren't. To address this, we "stage" a single event which keeps track of all the events that occurred while we were rate limiting the peer. At the end of the rate limting period, we will store the last state for that peer, thereby ensureing that we maintain our record of their most recent state.
Since we will use peer flap rate to determine how we rate limit, we store this value on disk per peer per channel. This allows us to restart with memory of our peers past behaviour, so we don't give badly behaving peers have a fresh start on restart. Last flap timestamp is stored with our flap count so that we can degrade this all time flap count over time for peers that have not recently flapped.
d765b3e
to
896804f
Compare
Since we store all-time flap count for a peer, we add a cooldown factor which will discount poor flap counts in the past. This is only applied to peers that have not flapped for at least a cooldown period, so that we do not downgrade our rate limiting for badly behaved peers.
896804f
to
e2c0604
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the last changes, the PR LGTM, great work @carlaKC 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! Really like the usage of mock clock + deterministic unit tests. Tested this out locally with a flapping peer and could see the listpeers
output increment and noticed no pprof
usage for chanfitness
. One thing is that both sides will have this flap count even though one of them disconnected first, that could be something addressed in a later PR, but I honestly don't know that it's necessary.
This PR adds rate limiting to peers that we are currently tracking online/offline events for. This change prevents constantly flapping peers from filling up our memory with an endless set of online/offline events. It is preparatory work for #4165, addressing our in memory resource consumption before we proceed to write these events to disk. The rate limiting logic used is very simple, we just place peers in buckets based on the number of times they have flapped. This can easily be switched out in future, but we start simple. To add a tangible output to this PR, we display flap rate in listchannels.
A new bucket is added to track peer flap count across restarts so that we maintain our rate limiting when lnd is restarted.
This bucket will be extended to include uptime events.
This PR includes some refactoring to cleanup spacing/testing in the original implementation, and to update the system to store events per peer. While working on this, it became clear that having all of our peer level information stored in one place is a better design for our requirements (do not need to double store online events when we have two channels with a peer, flap count is consistently stored). If we need to extend this system to have channel level, we can still do so by having event logs for each channel within our peer log.