chanfitness: Rate limit in memory events based on peer flap rate #4440

carlaKC · 2020-07-05T09:50:39Z

This PR adds rate limiting to peers that we are currently tracking online/offline events for. This change prevents constantly flapping peers from filling up our memory with an endless set of online/offline events. It is preparatory work for #4165, addressing our in memory resource consumption before we proceed to write these events to disk. The rate limiting logic used is very simple, we just place peers in buckets based on the number of times they have flapped. This can easily be switched out in future, but we start simple. To add a tangible output to this PR, we display flap rate in listchannels.

A new bucket is added to track peer flap count across restarts so that we maintain our rate limiting when lnd is restarted.
This bucket will be extended to include uptime events.

[peers-bucket]
    [peer-pubkey]
        [flap-count-key]
       +[events-bucket]

This PR includes some refactoring to cleanup spacing/testing in the original implementation, and to update the system to store events per peer. While working on this, it became clear that having all of our peer level information stored in one place is a better design for our requirements (do not need to double store online events when we have two channels with a peer, flap count is consistently stored). If we need to extend this system to have channel level, we can still do so by having event logs for each channel within our peer log.

Crypt-iQ

Nice code and commit structure, made it pleasant to review. I left some non-blocking comments and questions. The main thing for me is:

recordFlapCount should probably aggregate the transactions as long as it doesn't drastically interfere with the WriteFlapCount function.
Right now the flap count is aggregated for a peer. Since I assume this data is going to be used elsewhere (ban manager?), it might be better to also store it under a per-channel sub-bucket of the peer-bucket to have some more preciseness. Non-blocking.

chanfitness/rate_limit_test.go

chanfitness/rate_limit.go

chanfitness/chanevent.go

chanfitness/chaneventstore.go

channeldb/db.go

chanfitness/chaneventstore.go

bhandras

Quite a large PR, initial pass seems to be good, have a few comments though.

chanfitness/chaneventstore_testctx_test.go

chanfitness/chanevent.go

chanfitness/chanevent_test.go

chanfitness/chanevent.go

chanfitness/rate_limit.go

chanfitness/chaneventstore.go

channeldb/peers_test.go

channeldb/peers.go

subscribe/interface.go

chanfitness/chaneventstore_test.go

chanfitness/chaneventstore_testctx_test.go

chanfitness/chaneventstore.go

chanfitness/chaneventstore_test.go

bhandras

Looking good! My only remaining question is (the offline discussed) flap count vs subsequent online events "issue".

bhandras · 2020-08-12T10:58:08Z

chanfitness/chanevent.go

-			offline = false
+			// If our previous event is nil, we just set it and
+			// break out of the switch.
+			if lastEvent == nil {


We discussed this offline, just wanted to leave my question here too:
What if a peer reportedly comes online but is never reported to go offline in between? Will that hurt the flap rate?
Or can this even happen?

I’ve had another look at this, and I think that we may be ok in terms of ordering because we notify on quite a high level. We notify peer online events in addPeer and offline events in removePeer.

addPeer is only called when we get an inbound connection that we do not have an existing record of, or we replace a stale connection with a fresh one (InboundPeerConnected/ OutboundPeerConnected). When we are replacing stale connections, we remove all previous state, which will trigger an offline event before our new online event.

This happens at a much higher level than our connection retries, and has logic in place that’s aware of existing connections. This does make our online events less granular, but it happens at the same level for every peer (and is preexisting for this uptime measure).

However, it’s still difficult to say for sure whether we can rely on in-order events in reality. Perhaps @Crypt-iQ can weigh in on this? I still like the suggestion of counting online and offline events, because it makes sure that our measurement isn’t biased (perhaps certain peers have certain odd connection habits, others don’t). Esp in this case, where the value itself matters less than the comparison of different peers values.

bhandras · 2020-08-12T11:00:38Z

chanfitness/chaneventstore.go

+
+	// Uptime is the total amount of time that the channel peer has been
+	// observed as online during the monitored lifespan.
+	Uptime time.Duration


Same question in a different form, but if a peer reportedly comes online such that it's always in the idle timeout then what is the uptime?

So this would be the case where a peer connects to us, does not respond to any messages for the idle timeout, we disconnect and then they immediately reconnect but don't respond to any messages again?

With the current notifications, this peer would be up for that period of time, since we've defined online as having a connection open with that peer (so far). I think this case would likely be due to a mis-implemented peer, and while somebody could setup their node to "game" our uptime metric, the flap count would reflect that they go down every 5 minutes, and I don't think there's much to gain because they would have to respond to our messages to participate in forwards (the only incentive I can see for wanting high uptime, we choose to route through you).

We could look at moving notifications for uptime tracking to a lower level, but I think we will then start running into ordering issues.

chanfitness/chanevent.go

chanfitness/interface.go

chanfitness/chaneventstore.go

bhandras · 2020-08-12T11:18:59Z

chanfitness/chanevent.go

@@ -95,14 +121,34 @@ func (p *peerLog) addOnlineEvent(online bool, time time.Time) {
 	eventType := peerOnlineEvent
 	if !online {
 		eventType = peerOfflineEvent
+		p.flapCount++


Adding to the prev comments: we could increase flap count for subsequent online events too.

One thing we may need to account for if we increment for online events is that we will then do so on restart for every peer (not just those which are offline when we restart). Perhaps this is more fair, because the online peers may have flapped while our node was down, and the offline ones just have bad luck.

channeldb/peers.go

bhandras

I think the cooldown needs fixing, other than that, the PR looks good! 👍

chanfitness/chaneventstore.go

bhandras · 2020-08-25T15:52:09Z

chanfitness/chanevent_test.go


 	// Likewise, if we have an online event, nothing beyond the online state
-	// of our peer log should change.
+	// of our peer log should change, but our flap count should change.


👍 I think it's good idea to increase flaps regardless of order.

chanfitness/chanevent.go

bhandras · 2020-08-25T16:30:00Z

chanfitness/chaneventstore.go

 	for peer, monitor := range c.peers {
 		flapCount, lastFlap := monitor.getFlapCount()
 		if lastFlap == nil {
 			continue
 		}

 		flapCounts[peer] = &channeldb.FlapCount{
-			Count:    uint32(flapCount),
+			Count: uint32(
+				cooldownFlapCount(now, flapCount, *lastFlap),


iiuc this won't really cooldown the flapcount as we always get it from the peer monitor where the flapcount always increases so the max we could get is 0.9 * flapcount

Moved this up and tested with some regtest nodes. Since we only need the cooldown for rate limiting, decided to keep it as we need it as you suggest.

It is the case that we may cooldown then increment, but that makes sense to me in the following case:

peer with 100 flap count connects, time since flap > cooldown period

cooldown to 95

update to 96, since they have just flapped, also update their last flapped

Dockerfile

carlaKC · 2020-08-30T07:27:54Z

Changes since last review here, other pushes are rebases.

bhandras

Almost ready, just one question left on my end.

chanfitness/rate_limit.go

chanfitness/chaneventstore.go

Original PR was written with 4 spaces instead of 8, do a once off fix here rather than fixing bit-by bit in the subsequent commits and cluttering them for review.

The current implementation of subscribe is difficult to mock because the queue that we send updates on in unexported, so you cannot create a subscribe.Client object and then add your own updates. While it is possible to run a subscribe server in tests, subscribe servers will shutdown before dispatching their udpates to all clients, which can be flakey (and is difficult to workaround). In this commit, we add a subscription interface so that these testing struggles can be addressed with a mock.

As we add more elements to the chanfitness subsystem, we will require more complex testing. The current tests are built around the inability to mock subscriptions, which is remedied by addition of our own mock. This context allows us to run the full store in a test, rather than having to manually spin up the main goroutine. Mocking our subscriptions is required so that we can block our subscribe updates on consumption, using the real package provides us with no guarantee that the client receives the update before shutdown, which produces test flakes. This change also makes a move towards separating out the testing of our event store from testing the underlying event logs to prepare for further refactoring.

To get our uptime, we first filter our event log to get online periods. This change updates this code to be tolerant of consecutive online or offline events in the log. This will be required for rate limiting, because we will not record every event for anti-dos reasons, so we could record an online event, ignore an offline event and then record another offline event. We could just ignore this duplicate event, but we will also need this tolerance for when we persist uptime and our peers can have their last event before restart as an online event and record another online event when we come back up.

We currently query the store for uptime and lifespan individually. As we add more fields, we will need to add more queries with this design. This change combines requests into a single channel infor request so that we do not need to add unnecessary boilerplate going forward.

When dealing with online events, we actually need to track our events by peer, not by channel. All we need to track channels is to have a set of online events for a peer which at least contain those events. This change refactors chanfitness to track by peer.

In preparation for storing our flap count on disk, we start tracking flap count per-peer.

To prevent flapping peers from endlessly dos-ing us with online and offline events, we rate limit the number of events we will store per period using their flap rate to determine how often we will add their events to our in memory list of online events. Since we are tracking online events, we need to track the aggregate change over the rate limited period, otherwise we will lose track of a peer's current state. For example, if we store an online event, then do not store the subsequent offline event, we will believe that the peer is online when they actually aren't. To address this, we "stage" a single event which keeps track of all the events that occurred while we were rate limiting the peer. At the end of the rate limting period, we will store the last state for that peer, thereby ensureing that we maintain our record of their most recent state.

Since we will use peer flap rate to determine how we rate limit, we store this value on disk per peer per channel. This allows us to restart with memory of our peers past behaviour, so we don't give badly behaving peers have a fresh start on restart. Last flap timestamp is stored with our flap count so that we can degrade this all time flap count over time for peers that have not recently flapped.

Since we store all-time flap count for a peer, we add a cooldown factor which will discount poor flap counts in the past. This is only applied to peers that have not flapped for at least a cooldown period, so that we do not downgrade our rate limiting for badly behaved peers.

bhandras

With the last changes, the PR LGTM, great work @carlaKC 💯

Crypt-iQ

Awesome work! Really like the usage of mock clock + deterministic unit tests. Tested this out locally with a flapping peer and could see the listpeers output increment and noticed no pprof usage for chanfitness. One thing is that both sides will have this flap count even though one of them disconnected first, that could be something addressed in a later PR, but I honestly don't know that it's necessary.

carlaKC requested review from cfromknecht and Roasbeef as code owners July 5, 2020 09:50

carlaKC requested review from Crypt-iQ and removed request for Roasbeef July 5, 2020 09:50

carlaKC changed the title ~~chanfitness: Rate in memory events based of peer flap rate~~ chanfitness: Rate limit in memory events based of peer flap rate Jul 5, 2020

Crypt-iQ reviewed Jul 6, 2020

View reviewed changes

carlaKC force-pushed the 4165-ratelimit branch from 37c88b0 to 0348cb3 Compare July 8, 2020 12:17

carlaKC added the v0.12 label Jul 9, 2020

carlaKC force-pushed the 4165-ratelimit branch from 0348cb3 to 82b1207 Compare July 14, 2020 12:35

carlaKC requested a review from Crypt-iQ July 14, 2020 16:44

Roasbeef added channel management The management of the nodes channels optimization labels Jul 20, 2020

Roasbeef added this to the 0.12.0 milestone Jul 22, 2020

Roasbeef added this to To do in v0.12.0-beta Jul 22, 2020

Roasbeef moved this from To do to Review in progress in v0.12.0-beta Jul 22, 2020

Roasbeef requested review from bhandras and removed request for cfromknecht July 23, 2020 03:13

bhandras reviewed Jul 24, 2020

View reviewed changes

carlaKC force-pushed the 4165-ratelimit branch 2 times, most recently from 80ebf77 to e36ace6 Compare August 3, 2020 09:19

carlaKC changed the title ~~chanfitness: Rate limit in memory events based of peer flap rate~~ chanfitness: Rate limit in memory events based on peer flap rate Aug 5, 2020

carlaKC force-pushed the 4165-ratelimit branch from e36ace6 to 21920dd Compare August 5, 2020 08:48

yyforyongyu reviewed Aug 9, 2020

View reviewed changes

carlaKC force-pushed the 4165-ratelimit branch 3 times, most recently from 19344b5 to dd18082 Compare August 10, 2020 12:15

carlaKC requested a review from bhandras August 10, 2020 12:20

bhandras reviewed Aug 12, 2020

View reviewed changes

carlaKC force-pushed the 4165-ratelimit branch 2 times, most recently from 7427bcf to 00a3a2c Compare August 24, 2020 13:41

carlaKC force-pushed the 4165-ratelimit branch from 00a3a2c to 20c4c3f Compare August 25, 2020 08:11

carlaKC requested a review from bhandras August 25, 2020 08:32

bhandras requested changes Aug 25, 2020

View reviewed changes

carlaKC force-pushed the 4165-ratelimit branch 3 times, most recently from ca19aba to d765b3e Compare August 30, 2020 07:26

carlaKC requested a review from bhandras August 31, 2020 11:20

bhandras reviewed Sep 3, 2020

View reviewed changes

chanfitness/rate_limit.go Outdated Show resolved Hide resolved

chanfitness/chaneventstore.go Outdated Show resolved Hide resolved

carlaKC added 11 commits September 8, 2020 13:47

chanfitness: fix line wrapping

25b0c40

Original PR was written with 4 spaces instead of 8, do a once off fix here rather than fixing bit-by bit in the subsequent commits and cluttering them for review.

chanfitness: pass clock in to chaneventstore for testing

94accfb

chanfitness/test: switchover to testify

c33d0aa

chanfitness: record timestamped flap count for peers

8b09b2d

In preparation for storing our flap count on disk, we start tracking flap count per-peer.

carlaKC force-pushed the 4165-ratelimit branch from d765b3e to 896804f Compare September 8, 2020 12:01

carlaKC added 2 commits September 8, 2020 15:01

multi: add flap count and last flap time to listpeers

e2c0604

carlaKC force-pushed the 4165-ratelimit branch from 896804f to e2c0604 Compare September 8, 2020 13:01

carlaKC requested a review from bhandras September 10, 2020 09:10

bhandras approved these changes Sep 10, 2020

View reviewed changes

v0.12.0-beta automation moved this from Review in progress to Reviewer approved Sep 11, 2020

Crypt-iQ approved these changes Sep 11, 2020

View reviewed changes

carlaKC merged commit db0802b into lightningnetwork:master Sep 14, 2020

v0.12.0-beta automation moved this from Reviewer approved to Done Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chanfitness: Rate limit in memory events based on peer flap rate #4440

chanfitness: Rate limit in memory events based on peer flap rate #4440

carlaKC commented Jul 5, 2020 •

edited

Crypt-iQ left a comment •

edited

bhandras left a comment

bhandras left a comment

bhandras Aug 12, 2020

carlaKC Aug 14, 2020

bhandras Aug 12, 2020

carlaKC Aug 14, 2020

bhandras Aug 12, 2020

carlaKC Aug 14, 2020

bhandras left a comment

bhandras Aug 25, 2020

bhandras Aug 25, 2020

carlaKC Aug 31, 2020

carlaKC commented Aug 30, 2020

bhandras left a comment

bhandras left a comment

Crypt-iQ left a comment •

edited

chanfitness: Rate limit in memory events based on peer flap rate #4440

chanfitness: Rate limit in memory events based on peer flap rate #4440

Conversation

carlaKC commented Jul 5, 2020 • edited

Crypt-iQ left a comment • edited

Choose a reason for hiding this comment

bhandras left a comment

Choose a reason for hiding this comment

bhandras left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhandras left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlaKC commented Aug 30, 2020

bhandras left a comment

Choose a reason for hiding this comment

bhandras left a comment

Choose a reason for hiding this comment

Crypt-iQ left a comment • edited

Choose a reason for hiding this comment

carlaKC commented Jul 5, 2020 •

edited

Crypt-iQ left a comment •

edited

Crypt-iQ left a comment •

edited