Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chanfitness: Rate limit in memory events based on peer flap rate #4440

Merged
merged 13 commits into from Sep 14, 2020

Conversation

carlaKC
Copy link
Collaborator

@carlaKC carlaKC commented Jul 5, 2020

This PR adds rate limiting to peers that we are currently tracking online/offline events for. This change prevents constantly flapping peers from filling up our memory with an endless set of online/offline events. It is preparatory work for #4165, addressing our in memory resource consumption before we proceed to write these events to disk. The rate limiting logic used is very simple, we just place peers in buckets based on the number of times they have flapped. This can easily be switched out in future, but we start simple. To add a tangible output to this PR, we display flap rate in listchannels.

A new bucket is added to track peer flap count across restarts so that we maintain our rate limiting when lnd is restarted.
This bucket will be extended to include uptime events.

[peers-bucket]
    [peer-pubkey]
        [flap-count-key]
       +[events-bucket]

This PR includes some refactoring to cleanup spacing/testing in the original implementation, and to update the system to store events per peer. While working on this, it became clear that having all of our peer level information stored in one place is a better design for our requirements (do not need to double store online events when we have two channels with a peer, flap count is consistently stored). If we need to extend this system to have channel level, we can still do so by having event logs for each channel within our peer log.

@carlaKC carlaKC requested review from Crypt-iQ and removed request for Roasbeef July 5, 2020 09:50
@carlaKC carlaKC changed the title chanfitness: Rate in memory events based of peer flap rate chanfitness: Rate limit in memory events based of peer flap rate Jul 5, 2020
Copy link
Collaborator

@Crypt-iQ Crypt-iQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice code and commit structure, made it pleasant to review. I left some non-blocking comments and questions. The main thing for me is:

  • recordFlapCount should probably aggregate the transactions as long as it doesn't drastically interfere with the WriteFlapCount function.
  • Right now the flap count is aggregated for a peer. Since I assume this data is going to be used elsewhere (ban manager?), it might be better to also store it under a per-channel sub-bucket of the peer-bucket to have some more preciseness. Non-blocking.

chanfitness/rate_limit_test.go Show resolved Hide resolved
chanfitness/rate_limit.go Show resolved Hide resolved
chanfitness/chanevent.go Outdated Show resolved Hide resolved
chanfitness/chanevent.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore.go Show resolved Hide resolved
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
channeldb/db.go Show resolved Hide resolved
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
@carlaKC carlaKC added the v0.12 label Jul 9, 2020
@carlaKC carlaKC requested a review from Crypt-iQ July 14, 2020 16:44
@Roasbeef Roasbeef added channel management The management of the nodes channels optimization labels Jul 20, 2020
@Roasbeef Roasbeef added this to the 0.12.0 milestone Jul 22, 2020
@Roasbeef Roasbeef added this to To do in v0.12.0-beta Jul 22, 2020
@Roasbeef Roasbeef moved this from To do to Review in progress in v0.12.0-beta Jul 22, 2020
@Roasbeef Roasbeef requested review from bhandras and removed request for cfromknecht July 23, 2020 03:13
Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a large PR, initial pass seems to be good, have a few comments though.

chanfitness/chaneventstore_testctx_test.go Show resolved Hide resolved
chanfitness/chaneventstore_testctx_test.go Outdated Show resolved Hide resolved
chanfitness/chanevent.go Show resolved Hide resolved
chanfitness/chanevent_test.go Outdated Show resolved Hide resolved
chanfitness/chanevent.go Outdated Show resolved Hide resolved
chanfitness/rate_limit.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
channeldb/peers_test.go Outdated Show resolved Hide resolved
channeldb/peers.go Outdated Show resolved Hide resolved
@carlaKC carlaKC force-pushed the 4165-ratelimit branch 2 times, most recently from 80ebf77 to e36ace6 Compare August 3, 2020 09:19
@carlaKC carlaKC changed the title chanfitness: Rate limit in memory events based of peer flap rate chanfitness: Rate limit in memory events based on peer flap rate Aug 5, 2020
subscribe/interface.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore_test.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore_test.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore_test.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore_test.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore_testctx_test.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore.go Show resolved Hide resolved
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore_test.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore_test.go Outdated Show resolved Hide resolved
@carlaKC carlaKC force-pushed the 4165-ratelimit branch 3 times, most recently from 19344b5 to dd18082 Compare August 10, 2020 12:15
@carlaKC carlaKC requested a review from bhandras August 10, 2020 12:20
Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! My only remaining question is (the offline discussed) flap count vs subsequent online events "issue".

offline = false
// If our previous event is nil, we just set it and
// break out of the switch.
if lastEvent == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline, just wanted to leave my question here too:
What if a peer reportedly comes online but is never reported to go offline in between? Will that hurt the flap rate?
Or can this even happen?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve had another look at this, and I think that we may be ok in terms of ordering because we notify on quite a high level. We notify peer online events in addPeer and offline events in removePeer.

addPeer is only called when we get an inbound connection that we do not have an existing record of, or we replace a stale connection with a fresh one (InboundPeerConnected/ OutboundPeerConnected). When we are replacing stale connections, we remove all previous state, which will trigger an offline event before our new online event.

This happens at a much higher level than our connection retries, and has logic in place that’s aware of existing connections. This does make our online events less granular, but it happens at the same level for every peer (and is preexisting for this uptime measure).

However, it’s still difficult to say for sure whether we can rely on in-order events in reality. Perhaps @Crypt-iQ can weigh in on this? I still like the suggestion of counting online and offline events, because it makes sure that our measurement isn’t biased (perhaps certain peers have certain odd connection habits, others don’t). Esp in this case, where the value itself matters less than the comparison of different peers values.


// Uptime is the total amount of time that the channel peer has been
// observed as online during the monitored lifespan.
Uptime time.Duration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question in a different form, but if a peer reportedly comes online such that it's always in the idle timeout then what is the uptime?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this would be the case where a peer connects to us, does not respond to any messages for the idle timeout, we disconnect and then they immediately reconnect but don't respond to any messages again?

With the current notifications, this peer would be up for that period of time, since we've defined online as having a connection open with that peer (so far). I think this case would likely be due to a mis-implemented peer, and while somebody could setup their node to "game" our uptime metric, the flap count would reflect that they go down every 5 minutes, and I don't think there's much to gain because they would have to respond to our messages to participate in forwards (the only incentive I can see for wanting high uptime, we choose to route through you).

We could look at moving notifications for uptime tracking to a lower level, but I think we will then start running into ordering issues.

chanfitness/chanevent.go Outdated Show resolved Hide resolved
chanfitness/interface.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore.go Show resolved Hide resolved
@@ -95,14 +121,34 @@ func (p *peerLog) addOnlineEvent(online bool, time time.Time) {
eventType := peerOnlineEvent
if !online {
eventType = peerOfflineEvent
p.flapCount++
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to the prev comments: we could increase flap count for subsequent online events too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we may need to account for if we increment for online events is that we will then do so on restart for every peer (not just those which are offline when we restart). Perhaps this is more fair, because the online peers may have flapped while our node was down, and the offline ones just have bad luck.

channeldb/peers.go Outdated Show resolved Hide resolved
channeldb/peers.go Outdated Show resolved Hide resolved
@carlaKC carlaKC force-pushed the 4165-ratelimit branch 2 times, most recently from 7427bcf to 00a3a2c Compare August 24, 2020 13:41
Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cooldown needs fixing, other than that, the PR looks good! 👍

chanfitness/chaneventstore.go Outdated Show resolved Hide resolved

// Likewise, if we have an online event, nothing beyond the online state
// of our peer log should change.
// of our peer log should change, but our flap count should change.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think it's good idea to increase flaps regardless of order.

chanfitness/chanevent.go Outdated Show resolved Hide resolved
chanfitness/chanevent.go Outdated Show resolved Hide resolved
for peer, monitor := range c.peers {
flapCount, lastFlap := monitor.getFlapCount()
if lastFlap == nil {
continue
}

flapCounts[peer] = &channeldb.FlapCount{
Count: uint32(flapCount),
Count: uint32(
cooldownFlapCount(now, flapCount, *lastFlap),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc this won't really cooldown the flapcount as we always get it from the peer monitor where the flapcount always increases so the max we could get is 0.9 * flapcount

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this up and tested with some regtest nodes. Since we only need the cooldown for rate limiting, decided to keep it as we need it as you suggest.

It is the case that we may cooldown then increment, but that makes sense to me in the following case:

  • peer with 100 flap count connects, time since flap > cooldown period
  • cooldown to 95
  • update to 96, since they have just flapped, also update their last flapped

Dockerfile Outdated Show resolved Hide resolved
@carlaKC carlaKC force-pushed the 4165-ratelimit branch 3 times, most recently from ca19aba to d765b3e Compare August 30, 2020 07:26
@carlaKC
Copy link
Collaborator Author

carlaKC commented Aug 30, 2020

Changes since last review here, other pushes are rebases.

@carlaKC carlaKC requested a review from bhandras August 31, 2020 11:20
Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost ready, just one question left on my end.

chanfitness/rate_limit.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
Original PR was written with 4 spaces instead of 8, do a once off fix
here rather than fixing bit-by bit in the subsequent commits and
cluttering them for review.
The current implementation of subscribe is difficult to mock because
the queue that we send updates on in unexported, so you cannot create
a subscribe.Client object and then add your own updates. While it is
possible to run a subscribe server in tests, subscribe servers will
shutdown before dispatching their udpates to all clients, which can be
flakey (and is difficult to workaround). In this commit, we add a
subscription interface so that these testing struggles can be addressed
with a mock.
As we add more elements to the chanfitness subsystem, we will require
more complex testing. The current tests are built around the inability
to mock subscriptions, which is remedied by addition of our own mock.
This context allows us to run the full store in a test, rather than
having to manually spin up the main goroutine. Mocking our subscriptions
is required so that we can block our subscribe updates on consumption,
using the real package provides us with no guarantee that the client
receives the update before shutdown, which produces test flakes.

This change also makes a move towards separating out the testing of our
event store from testing the underlying event logs to prepare for
further refactoring.
To get our uptime, we first filter our event log to get online periods.
This change updates this code to be tolerant of consecutive online or
offline events in the log. This will be required for rate limiting,
because we will not record every event for anti-dos reasons, so we could
record an online event, ignore an offline event and then record another
offline event. We could just ignore this duplicate event, but we will
also need this tolerance for when we persist uptime and our peers
can have their last event before restart as an online event and record
another online event when we come back up.
We currently query the store for uptime and lifespan individually. As
we add more fields, we will need to add more queries with this design.
This change combines requests into a single channel infor request so
that we do not need to add unnecessary boilerplate going forward.
When dealing with online events, we actually need to track our events
by peer, not by channel. All we need to track channels is to have a
set of online events for a peer which at least contain those events.
This change refactors chanfitness to track by peer.
In preparation for storing our flap count on disk, we start tracking
flap count per-peer.
To prevent flapping peers from endlessly dos-ing us with online and
offline events, we rate limit the number of events we will store per
period using their flap rate to determine how often we will add their
events to our in memory list of online events.

Since we are tracking online events, we need to track the aggregate
change over the rate limited period, otherwise we will lose track of
a peer's current state. For example, if we store an online event, then
do not store the subsequent offline event, we will believe that the
peer is online when they actually aren't. To address this, we "stage"
a single event which keeps track of all the events that occurred while
we were rate limiting the peer. At the end of the rate limting period,
we will store the last state for that peer, thereby ensureing that
we maintain our record of their most recent state.
Since we will use peer flap rate to determine how we rate limit, we
store this value on disk per peer per channel. This allows us to
restart with memory of our peers past behaviour, so we don't give badly
behaving peers have a fresh start on restart. Last flap timestamp is
stored with our flap count so that we can degrade this all time flap
count over time for peers that have not recently flapped.
Since we store all-time flap count for a peer, we add a cooldown factor
which will discount poor flap counts in the past. This is only applied
to peers that have not flapped for at least a cooldown period, so that
we do not downgrade our rate limiting for badly behaved peers.
Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the last changes, the PR LGTM, great work @carlaKC 💯

v0.12.0-beta automation moved this from Review in progress to Reviewer approved Sep 11, 2020
Copy link
Collaborator

@Crypt-iQ Crypt-iQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! Really like the usage of mock clock + deterministic unit tests. Tested this out locally with a flapping peer and could see the listpeers output increment and noticed no pprof usage for chanfitness. One thing is that both sides will have this flap count even though one of them disconnected first, that could be something addressed in a later PR, but I honestly don't know that it's necessary.

@carlaKC carlaKC merged commit db0802b into lightningnetwork:master Sep 14, 2020
v0.12.0-beta automation moved this from Reviewer approved to Done Sep 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
channel management The management of the nodes channels optimization v0.12
Projects
No open projects
v0.12.0-beta
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants