p2p-governor: fix various P2P governor bugs (with tests) #2526

coot · 2020-08-20T14:49:23Z

No description provided.

dcoutts

As discussed in relation to other similar bugs, I'd like us to adjust the tests so that they can catch these bugs, and then fix them.

coot · 2020-08-27T11:21:15Z

@dcoutts by the end of the day I will put here all the changes that now in coot/connection-manager. Then I'll be able to work on this branch to update the tests (going through all patches). This pr should be a draft.

2658: Network version: NodeToNodeV_4 r=coot a=coot `NodeToNodeV_4` extends NodeToNodeVersionData with `DiffusionMode` flag. This flag decides weather a node runs responder or not and is presented when negotiation connection options using `Handshake` protocol. It is also passed through `DiffusionArguments`: if `IitiatorOnlyDiffusionMode` is used, the diffusion will not run the responder. `NodeToNodeV_4` is not used by the node yet. It can be taken by `p2p-node` (in PR #2526) or by consensus whoever will need it first. Handshake is extended to return 'agreedOptions'. For 'NodeToNode' we then abuse it by allowing to escape with the existencial 'vData'. This is because in near future we will refactor the handshake which does not need any more its dependent type machinery. Instead of it we use version aware codec of 'NodeToNodeVersionData', which this PR introduces (together with tests). Note that `InitiatorAndResponderDiffusionMode` will be enforced by the codec if the version is strictly lower than 'NodeToNodeV_4'. Co-authored-by: Marcin Szamotulski <profunctor@pm.me>

dcoutts

This is generally really good, and a very cleanly constructed series of patches, which made it much easier to review than many PRs I've seen.

I think most of my comments are about the EstablishedPeers stuff. We could probably merge the earlier patches pretty quickly if that's easier.

dcoutts · 2021-02-08T12:55:38Z

ouroboros-network/src/Ouroboros/Network/PeerSelection/Governor/EstablishedPeers.hs

+        let (failCount, knownPeers') = KnownPeers.incrementFailCount
+                                         peeraddr
+                                         (knownPeers st)
+
+            -- exponential backoff: 5s, 10s, 20s, 40s, 80s, 160s.
+            delay :: DiffTime
+            delay = fromIntegral $
+              2 ^ (pred failCount `min` maxColdPeerRetryBackoff) * baseColdPeerRetryDiffTime


Could we structure this so it's more similar to the existing backoff code in the RootPeers job handler? It'd make it easier to see that we're doing the same thing, or what the differences the the parameters are.

Why do we want to pick 2^5 * 5 = 160 as the max time limit? For the root peers backoff we backed off for a max of around an hour. I'm not saying this is wrong, just that we should probably say something about why we chose these numbers.

Could we structure this so it's more similar to the existing backoff code in the RootPeers job handler? It'd make it easier to see that we're doing the same thing, or what the differences the the parameters are.

You'd like to count number of consecutive failures or successes? This could be a useful metric.
So instead incrementFailCount / resetFailCount we'll use decrementBackoff / incrementBackoff.

Why do we want to pick 2^5 * 5 = 160 as the max time limit? For the root peers backoff we backed off for a max of around an hour. I'm not saying this is wrong, just that we should probably say something about why we chose these numbers.

Let's see: if we have 100 cold peers and 30% is offline, then will will try to reconnect to one of them every: 160/30=5.33s (assuming that the policy keeps asking for the 30% offline peers and assuming that the governor is constantly looking for more warm peers). Do you think we should have a slower asymptotic behaviour?

I really just meant the structure of the code should be similar, since as far as I understand it they're using the same way of counting failure and same backoff strategy.

Lets include that explanation in the code next to where we pick the backoff parameters.

I pushed refactoring in 1f90e16. I think it makes sense to use decrementBackoff nad incrementBackoff.

ouroboros-network/src/Ouroboros/Network/PeerSelection/KnownPeers.hs

ouroboros-network/src/Ouroboros/Network/PeerSelection/Governor/Types.hs

dcoutts · 2021-02-08T17:26:48Z

ouroboros-network/test/Ouroboros/Network/PeerSelection/Test.hs

+
+
+prop_governor_overlapping_local_and_root_peers :: Property
+prop_governor_overlapping_local_and_root_peers = prop_governor_nolivelock env


So perhaps this was a bug then?

The unpatched assertion indeed fails on this test case, so the commit message is right the local and root peers might overlap. I checked that in this test case it is true that local and public peers of env :: GovernorMockEnvironment intersect. So it's actually the test generator that should be improved.

I'm still not sure I follow.

Which bug did this test case check for? Which patch fixed it?

To be honest I am not totally sure why I originally included this test case. The environment in this test case contain non trivial intersection of local and public root peers and the test succeeds. This means that the governor keeps local and root peers sets non-overlapping.

ouroboros-network/src/Ouroboros/Network/PeerSelection/KnownPeers.hs

dcoutts · 2021-02-08T23:26:32Z

ouroboros-network/src/Ouroboros/Network/PeerSelection/Governor/ActivePeers.hs

+  | Map.null (EstablishedPeers.establishedReady establishedPeers)
+  = GuardedSkip (Min <$> EstablishedPeers.minActivateTime establishedPeers)


I guess so. This whole thing with a time delay on the re-promotion from warm to hot is rather crude.

It's just one fixed delay, when in the end what we'd want is to look at the tip sample protocol result and only consider promoting if that tells us that the node might be a good hot candidate again.

I mean it's not terrible, and it's probably ok for now, but we should think about the longer term too. Perhaps what we want is to allow the policy for picking warm peers to promote to be allowed to make no progress, if none of the current warm peers are suitable. Then we could have a policy that relies on looking at the tip sample results.

If we went that direction we'd need a governor change to let it deal with warm->hot picking policies that do not always make progress. It'd need some back-off and retry there, so it can try again once we have some more info or some new candidates.

This whole thing with a time delay on the re-promotion from warm to hot is rather crude.

What is progress? A few possible choices:

good progress: adding a peer from whom we have a chance of getting a new header / block fastest with some reasonable chance of success

fake progress: adding a peer from whom we have little chance of getting a new header before getting it from any other peer.

governor progress: progress from the governor perspective (I think this is the meaning you're referring to)

Is fake progress a bad thing? Maybe not: having more hot peers allows transactions to flow faster between peers, and since SPOs are prepared that all their hot peers are good (both can deliver blocks and tx-s), they will be prepared for fake ones too. But on the other side, if we organize the network based on dissemination of headers / blocks, the transactions should have similarly good propagation characteristics (they flow in different direction but using the same path as headers). So we don't need to worry about them.

As of governor progress I think we should allow no progress for the above reasons.

Technically it shouldn't be difficult to add additional metrics to KnownPeerInfo available in policies that can be read from an external store, it just needs an STM interface, but indeed we'd need to extend the governor side. When no progress is made for a longer period, maybe it should do more aggressive churn of warm peers. But we should avoid making it too complex.

It'd need some back-off and retry there, so it can try again once we have some more info or some new candidates.

The current implementation will retry as long as its under its targets and as soon as there are available warm peers. Are you thinking about changing the rate of governor adaptation? The information about peers will slowly change, and we want the governor to adapt at a similar scale and it could be slower than the changes in the environment (connection quality fluctuations, peers becoming busy at times, etc).

It's good to keep peer state changes callbacks in a seprate recrod, they will likely not change much as we go towards decentralisation.

This is only enabled in the re-enabled `prop_governor_nolivelock`. Running the test throws an exception (failed assertion), this is a bug fixed in the next commit. If a peer during asynchronous demotion is demoted to 'ColdPeer', we interpret this as an error and throw an exception. This corresponds to a real scenario. This is important because otherwise the governor will return and finish the transition without running its error handlers which bring the governor to the right state. The above corresponds to the property of 'PeerStateActions': on failures the peer state changes to 'PeerCold'.

'setConnectTime' is supposed to update 'availableToConnect' and `nextConnectTimes' rather than 'availableForGossip' and 'nextGossipTimes'.

When checking if we need to udate the state, we need to take 'nextConnectTimes' into account.

When a connection errored, the governor will see it as a demotion to a cold peer. In such a case we update re-connect time as well as increment fail count.

The connection monitoring needs to update inProgressPromoteWarm, by removing peers that fail to be promoted.

It's more useful to get assertion failures where the state changes. This is only done for some of the state changes.

Use Debug.Trace to annotate the invariant. This gives a better debugging information.

EstablisehdPeers keeps track of established peer connection and their status. It is using the same logic as KnownPeers to track peers which are not ready to promote to hot state. This patch just replaces the original logic. This is just a preparation for the following patch.

When `chain-sync` returns, we use `activateDelay :: DiffTime` to delay re-promotion of that peer to `hot` again.

TracePromoteWarmFailed - show target & number of active peers TracePromoteWarmDone - show target & number of active peers TraceDemoteHotFailed - show target & number of active peers TraceDemoteHotDone - show target & number of active peers TracePromoteColdFailed - show target & number of established peers TracePromoteColdDone - show target & number of established peers TraceDemoteWarmFailed - show target & number of established peers TraceDemoteWarmDone - show target & number of established peers

We do not want to check that the available sets are non-empty since these conditions are checked at the call sites, and the condition at the call sites are more subtle, since it's other subsets that have to be considered.

To match names used in other abstractions like KnownPeers.

The EstablishedPeers contained a Map that kept track of the cold/warm/hot status of each peer. This is redundant information since it can be derived from the known, established and active peer sets. We can simplify things a bit by eliminating the tracking of this redundant information. The info was only being used in one place in the implementation (to determine the async demotions) and in one test (to compare the environment and governor's views of the peers status). So those places have been changed to use the established and active peer sets.

We do not use it in any existing policy. If we need it, we can derive the info by other means, and do so on demand, as policies need it. We do not need to maintain it incrementally.

Turns out we don't currently use any of the KnownPeerInfo in our policies.

We do not use it in any existing policy. If we need it, we can derive the info by other means, and do so on demand, as policies need it. We do not need to maintain it incrementally. This also fixes a bug in the way were were incrementally maintaining the knownPeerAdvertise info. Not tracking it at all is the simple solution.

So also remove KnownPeers.toMap and switch the last remaining uses of it over to toSet.

dcoutts

I've made a few changes on top of Marcin's original:

I've dropped the patch that relaxed the invariant to knownPeerSource. I wasn't happy with relaxing the invariant, since my view is that it was simply accepting that the results were wrong. So that also dropped the huge unit test (a saved test that QC found)
I've removed most of the info in the KnownPeerInfo since it was redundant. It was incrementally maintaining info that we could calculate from other canonical state. For the peer source it was being maintained incorrectly (the above point). So that removed the peer source and the advertising state.
I've stopped passing the KnownPeerInfo to the pick policy functions. We can add any of this info back if we need it later, but in an on-demand style as an attribute lookup function.
This meant adjusting the "available peers" calculations to use sets rather than maps.

dcoutts · 2021-03-06T00:42:29Z

ouroboros-network/src/Ouroboros/Network/PeerSelection/Governor/EstablishedPeers.hs

@@ -92,6 +92,7 @@ belowTarget actions
    -- If we could connect except that there are no peers currently available
    -- then we return the next wakeup time (if any)
  | numEstablishedPeers + numConnectInProgress < targetNumberOfEstablishedPeers
+  , Set.null (availableToConnect Set.\\ inProgressOrEstablishedPeers)


I think we don't need this because it is redundant.

The guard above was

| numEstablishedPeers + numConnectInProgress < targetNumberOfEstablishedPeers , Set.size availableToConnect - numEstablishedPeers - numConnectInProgress > 0

and our guard here repeats the first clause. So we know already that the second clause of the guard is false. Thus we know that

Set.size availableToConnect - numEstablishedPeers - numConnectInProgress = 0

(it cannot be negative since the established and in-progress ones are a subset of the available ones).

But what is this guard?

Set.null (availableToConnect Set.\\ inProgressOrEstablishedPeers)

Isn't it exactly saying that the number available to connect, minus the ones that are established or in progress is zero?

Again, we know this because the established and in-progress are a subset of the available, and they don't overlap with each other. So in this case counting becomes equivalent to the set operations.

And so this is exactly the negation of the previous guard, which we already knew!

QED.

coot · 2021-03-09T11:54:56Z

ouroboros-network/test/Ouroboros/Network/PeerSelection/Test.hs

-        -- * the script returns 'Noop'
-        -- * peer demoted to 'PeerCold'
+        -- + the script returns 'Noop'
+        -- + peer demoted to 'PeerCold'


I heard there are plans to fix this in haddock :), it will just generate a warning.

coot · 2021-03-09T11:58:40Z

btw, the policies used in p2p-master branch are implemented here.

coot · 2021-03-09T11:59:43Z

LGTM from me (I cannot approve as I am the author of the pr).

dcoutts · 2021-03-09T13:13:56Z

bors merge

iohk-bors · 2021-03-09T14:17:13Z

Build succeeded:

2526: p2p-governor: fix various P2P governor bugs (with tests) r=dcoutts a=coot Co-authored-by: Marcin Szamotulski <profunctor@pm.me> Co-authored-by: Duncan Coutts <duncan@well-typed.com>

coot added networking labels Aug 20, 2020

coot requested a review from dcoutts August 20, 2020 14:49

dcoutts reviewed Aug 27, 2020

View reviewed changes

coot force-pushed the coot/governor-fix branch 4 times, most recently from fd3769d to dadfabd Compare September 3, 2020 11:47

coot force-pushed the coot/governor-fix branch 2 times, most recently from b9ac698 to b8a8f19 Compare September 17, 2020 10:41

coot changed the title ~~p2p-governor: fix acounting of established peers~~ p2p-governor: fix various P2P governor bugs (with tests) Sep 17, 2020

coot force-pushed the coot/governor-fix branch from b8a8f19 to 8a1cc19 Compare September 29, 2020 07:53

coot mentioned this pull request Oct 2, 2020

Network version: NodeToNodeV_4 #2658

Merged

coot force-pushed the coot/governor-fix branch 2 times, most recently from 179331e to 163162d Compare October 12, 2020 08:04

coot requested a review from dcoutts October 13, 2020 14:22

coot force-pushed the coot/governor-fix branch from 163162d to 76b2207 Compare October 29, 2020 14:22

coot mentioned this pull request Oct 29, 2020

P2P governor test for minConnectTime #2717

Closed

dcoutts requested changes Feb 8, 2021

View reviewed changes

coot force-pushed the coot/governor-fix branch 2 times, most recently from 1f90e16 to 15a429b Compare February 16, 2021 10:53

coot added 7 commits March 2, 2021 20:52

p2p-governor: introduce PeerStateActions

6f41b66

It's good to keep peer state changes callbacks in a seprate recrod, they will likely not change much as we go towards decentralisation.

p2p-governor: fix acounting of established peers

5fa3e71

p2p-governor: fixed setConnectTime

34b7295

'setConnectTime' is supposed to update 'availableToConnect' and `nextConnectTimes' rather than 'availableForGossip' and 'nextGossipTimes'.

p2p-governor: fixed setCurrentTime

a1520b3

When checking if we need to udate the state, we need to take 'nextConnectTimes' into account.

p2p-governor: DebugPeerSelection - include Time

0846170

p2p-governor: exp. backoff in failures of cold to warm transition

80e6ed4

coot force-pushed the coot/governor-fix branch 2 times, most recently from 182ae16 to a37383f Compare March 3, 2021 08:56

coot and others added 17 commits March 8, 2021 12:01

p2p-governor: asynchronous demotions to cold state

cdbf8f1

When a connection errored, the governor will see it as a demotion to a cold peer. In such a case we update re-connect time as well as increment fail count.

p2p-governor: update inProgressPromoteWarm

66e4ce9

The connection monitoring needs to update inProgressPromoteWarm, by removing peers that fail to be promoted.

p2p-governor: inline some of invariant assertions

44a255a

It's more useful to get assertion failures where the state changes. This is only done for some of the state changes.

p2p-governor: refactor governor invariant

6aab246

Use Debug.Trace to annotate the invariant. This gives a better debugging information.

p2p-governor: removed HasCallStack constraint

9744134

p2p-governor: warm→hot re-promotion delay

fa0a915

When `chain-sync` returns, we use `activateDelay :: DiffTime` to delay re-promotion of that peer to `hot` again.

p2p-governor: minConnectTime and minActivateTime computation

9fd6700

We do not want to check that the available sets are non-empty since these conditions are checked at the call sites, and the condition at the call sites are more subtle, since it's other subsets that have to be considered.

p2p-governor: unused import in test module

9b6789f

Add a comment, fix a comment, fix indentation

e2cd3f4

Rename EstablishedPeers.allPeers to toMap

2c6a21c

To match names used in other abstractions like KnownPeers.

Stop tracking the knownPeerSource

0e06ae2

We do not use it in any existing policy. If we need it, we can derive the info by other means, and do so on demand, as policies need it. We do not need to maintain it incrementally.

Adjust pick policies to take a Set of peers instead of a Map.

1d856c6

Turns out we don't currently use any of the KnownPeerInfo in our policies.

Stop exporting KnownPeerInfo as it's now an internal detail

3d3068b

So also remove KnownPeers.toMap and switch the last remaining uses of it over to toSet.

dcoutts force-pushed the coot/governor-fix branch from a37383f to 3d3068b Compare March 9, 2021 10:29

Haddock syntax fix

ba88824

dcoutts approved these changes Mar 9, 2021

View reviewed changes

coot commented Mar 9, 2021

View reviewed changes

iohk-bors bot merged commit f01599f into master Mar 9, 2021

iohk-bors bot deleted the coot/governor-fix branch March 9, 2021 14:17

coot added a commit that referenced this pull request May 16, 2022

Merge #2526

c9bfa07

2526: p2p-governor: fix various P2P governor bugs (with tests) r=dcoutts a=coot Co-authored-by: Marcin Szamotulski <profunctor@pm.me> Co-authored-by: Duncan Coutts <duncan@well-typed.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p2p-governor: fix various P2P governor bugs (with tests) #2526

p2p-governor: fix various P2P governor bugs (with tests) #2526

coot commented Aug 20, 2020

dcoutts left a comment

coot commented Aug 27, 2020

dcoutts left a comment

dcoutts Feb 8, 2021

coot Feb 9, 2021 •

edited

dcoutts Feb 12, 2021

coot Feb 15, 2021 •

edited

dcoutts Feb 8, 2021

coot Feb 9, 2021

dcoutts Feb 12, 2021

coot Feb 15, 2021

dcoutts Feb 8, 2021

coot Mar 2, 2021

dcoutts left a comment

dcoutts Mar 6, 2021

coot Mar 9, 2021

coot commented Mar 9, 2021

coot commented Mar 9, 2021

dcoutts commented Mar 9, 2021

iohk-bors bot commented Mar 9, 2021



		prop_governor_overlapping_local_and_root_peers :: Property
		prop_governor_overlapping_local_and_root_peers = prop_governor_nolivelock env

		\| Map.null (EstablishedPeers.establishedReady establishedPeers)
		= GuardedSkip (Min <$> EstablishedPeers.minActivateTime establishedPeers)

p2p-governor: fix various P2P governor bugs (with tests) #2526

p2p-governor: fix various P2P governor bugs (with tests) #2526

Conversation

coot commented Aug 20, 2020

dcoutts left a comment

Choose a reason for hiding this comment

coot commented Aug 27, 2020

dcoutts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coot Feb 9, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coot Feb 15, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcoutts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coot commented Mar 9, 2021

coot commented Mar 9, 2021

dcoutts commented Mar 9, 2021

iohk-bors bot commented Mar 9, 2021

coot Feb 9, 2021 •

edited

coot Feb 15, 2021 •

edited