Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p-governor: fix various P2P governor bugs (with tests) #2526

Merged
merged 28 commits into from
Mar 9, 2021

Conversation

coot
Copy link
Contributor

@coot coot commented Aug 20, 2020

No description provided.

Copy link
Contributor

@dcoutts dcoutts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in relation to other similar bugs, I'd like us to adjust the tests so that they can catch these bugs, and then fix them.

@coot
Copy link
Contributor Author

coot commented Aug 27, 2020

@dcoutts by the end of the day I will put here all the changes that now in coot/connection-manager. Then I'll be able to work on this branch to update the tests (going through all patches). This pr should be a draft.

@coot coot force-pushed the coot/governor-fix branch 4 times, most recently from fd3769d to dadfabd Compare September 3, 2020 11:47
@coot coot force-pushed the coot/governor-fix branch 2 times, most recently from b9ac698 to b8a8f19 Compare September 17, 2020 10:41
@coot coot changed the title p2p-governor: fix acounting of established peers p2p-governor: fix various P2P governor bugs (with tests) Sep 17, 2020
iohk-bors bot added a commit that referenced this pull request Oct 5, 2020
2658: Network version: NodeToNodeV_4 r=coot a=coot

`NodeToNodeV_4` extends NodeToNodeVersionData with `DiffusionMode` flag.  This flag decides weather a node runs responder or not and is presented when negotiation connection options using `Handshake` protocol.  It is also passed through `DiffusionArguments`: if `IitiatorOnlyDiffusionMode` is used, the diffusion will not run the responder.

`NodeToNodeV_4` is not used by the node yet.  It can be taken by `p2p-node` (in PR #2526) or by consensus whoever will need it first.

Handshake is extended to return 'agreedOptions'.  For 'NodeToNode' we then
abuse it by allowing to escape with the existencial 'vData'.  This is because in
near future we will refactor the handshake which does not need any more its
dependent type machinery.  Instead of it we use version aware codec of
'NodeToNodeVersionData', which this PR introduces (together with tests).

Note that `InitiatorAndResponderDiffusionMode` will be enforced by the codec if the version is strictly lower than 'NodeToNodeV_4'.


Co-authored-by: Marcin Szamotulski <profunctor@pm.me>
@coot coot force-pushed the coot/governor-fix branch 2 times, most recently from 179331e to 163162d Compare October 12, 2020 08:04
@coot coot requested a review from dcoutts October 13, 2020 14:22
Copy link
Contributor

@dcoutts dcoutts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generally really good, and a very cleanly constructed series of patches, which made it much easier to review than many PRs I've seen.

I think most of my comments are about the EstablishedPeers stuff. We could probably merge the earlier patches pretty quickly if that's easier.

Comment on lines +124 to +136
let (failCount, knownPeers') = KnownPeers.incrementFailCount
peeraddr
(knownPeers st)

-- exponential backoff: 5s, 10s, 20s, 40s, 80s, 160s.
delay :: DiffTime
delay = fromIntegral $
2 ^ (pred failCount `min` maxColdPeerRetryBackoff) * baseColdPeerRetryDiffTime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we structure this so it's more similar to the existing backoff code in the RootPeers job handler? It'd make it easier to see that we're doing the same thing, or what the differences the the parameters are.

Why do we want to pick 2^5 * 5 = 160 as the max time limit? For the root peers backoff we backed off for a max of around an hour. I'm not saying this is wrong, just that we should probably say something about why we chose these numbers.

Copy link
Contributor Author

@coot coot Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we structure this so it's more similar to the existing backoff code in the RootPeers job handler? It'd make it easier to see that we're doing the same thing, or what the differences the the parameters are.

You'd like to count number of consecutive failures or successes? This could be a useful metric.
So instead incrementFailCount / resetFailCount we'll use decrementBackoff / incrementBackoff.

Why do we want to pick 2^5 * 5 = 160 as the max time limit? For the root peers backoff we backed off for a max of around an hour. I'm not saying this is wrong, just that we should probably say something about why we chose these numbers.

Let's see: if we have 100 cold peers and 30% is offline, then will will try to reconnect to one of them every: 160/30=5.33s (assuming that the policy keeps asking for the 30% offline peers and assuming that the governor is constantly looking for more warm peers). Do you think we should have a slower asymptotic behaviour?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really just meant the structure of the code should be similar, since as far as I understand it they're using the same way of counting failure and same backoff strategy.

Lets include that explanation in the code next to where we pick the backoff parameters.

Copy link
Contributor Author

@coot coot Feb 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed refactoring in 1f90e16. I think it makes sense to use decrementBackoff nad incrementBackoff.



prop_governor_overlapping_local_and_root_peers :: Property
prop_governor_overlapping_local_and_root_peers = prop_governor_nolivelock env
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So perhaps this was a bug then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unpatched assertion indeed fails on this test case, so the commit message is right the local and root peers might overlap. I checked that in this test case it is true that local and public peers of env :: GovernorMockEnvironment intersect. So it's actually the test generator that should be improved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure I follow.

Which bug did this test case check for? Which patch fixed it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I am not totally sure why I originally included this test case. The environment in this test case contain non trivial intersection of local and public root peers and the test succeeds. This means that the governor keeps local and root peers sets non-overlapping.

Comment on lines 53 to 54
| Map.null (EstablishedPeers.establishedReady establishedPeers)
= GuardedSkip (Min <$> EstablishedPeers.minActivateTime establishedPeers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so. This whole thing with a time delay on the re-promotion from warm to hot is rather crude.

It's just one fixed delay, when in the end what we'd want is to look at the tip sample protocol result and only consider promoting if that tells us that the node might be a good hot candidate again.

I mean it's not terrible, and it's probably ok for now, but we should think about the longer term too. Perhaps what we want is to allow the policy for picking warm peers to promote to be allowed to make no progress, if none of the current warm peers are suitable. Then we could have a policy that relies on looking at the tip sample results.

If we went that direction we'd need a governor change to let it deal with warm->hot picking policies that do not always make progress. It'd need some back-off and retry there, so it can try again once we have some more info or some new candidates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole thing with a time delay on the re-promotion from warm to hot is rather crude.

What is progress? A few possible choices:

  • good progress: adding a peer from whom we have a chance of getting a new header / block fastest with some reasonable chance of success
  • fake progress: adding a peer from whom we have little chance of getting a new header before getting it from any other peer.
  • governor progress: progress from the governor perspective (I think this is the meaning you're referring to)

Is fake progress a bad thing? Maybe not: having more hot peers allows transactions to flow faster between peers, and since SPOs are prepared that all their hot peers are good (both can deliver blocks and tx-s), they will be prepared for fake ones too. But on the other side, if we organize the network based on dissemination of headers / blocks, the transactions should have similarly good propagation characteristics (they flow in different direction but using the same path as headers). So we don't need to worry about them.

As of governor progress I think we should allow no progress for the above reasons.

Technically it shouldn't be difficult to add additional metrics to KnownPeerInfo available in policies that can be read from an external store, it just needs an STM interface, but indeed we'd need to extend the governor side. When no progress is made for a longer period, maybe it should do more aggressive churn of warm peers. But we should avoid making it too complex.

It'd need some back-off and retry there, so it can try again once we have some more info or some new candidates.

The current implementation will retry as long as its under its targets and as soon as there are available warm peers. Are you thinking about changing the rate of governor adaptation? The information about peers will slowly change, and we want the governor to adapt at a similar scale and it could be slower than the changes in the environment (connection quality fluctuations, peers becoming busy at times, etc).

@coot coot force-pushed the coot/governor-fix branch 2 times, most recently from 1f90e16 to 15a429b Compare February 16, 2021 10:53
coot added 7 commits March 2, 2021 20:52
It's good to keep peer state changes callbacks in a seprate recrod, they
will likely not change much as we go towards decentralisation.
This is only enabled in the re-enabled `prop_governor_nolivelock`.
Running the test throws an exception (failed assertion), this is a bug
fixed in the next commit.

If a peer during asynchronous demotion is demoted to 'ColdPeer', we
interpret this as an error and throw an exception.  This corresponds to
a real scenario.  This is important because otherwise the governor will
return and finish the transition without running its error handlers
which bring the governor to the right state.

The above corresponds to the property of 'PeerStateActions': on failures
the peer state changes to 'PeerCold'.
'setConnectTime' is supposed to update 'availableToConnect' and
`nextConnectTimes' rather than 'availableForGossip' and
'nextGossipTimes'.
When checking if we need to udate the state, we need to take
'nextConnectTimes' into account.
@coot coot force-pushed the coot/governor-fix branch 2 times, most recently from 182ae16 to a37383f Compare March 3, 2021 08:56
coot and others added 17 commits March 8, 2021 12:01
When a connection errored, the governor will see it as a demotion to
a cold peer.  In such a case we update re-connect time as well as
increment fail count.
The connection monitoring needs to update inProgressPromoteWarm, by
removing peers that fail to be promoted.
It's more useful to get assertion failures where the state changes.
This is only done for some of the state changes.
Use Debug.Trace to annotate the invariant.  This gives a better
debugging information.
EstablisehdPeers keeps track of established peer connection and their
status.  It is using the same logic as KnownPeers to track peers which
are not ready to promote to hot state. This patch  just replaces the
original logic.  This is just a preparation for the following patch.
When `chain-sync` returns, we use `activateDelay :: DiffTime` to delay
re-promotion of that peer to `hot` again.
TracePromoteWarmFailed - show target & number of active peers
TracePromoteWarmDone   - show target & number of active peers
TraceDemoteHotFailed   - show target & number of active peers
TraceDemoteHotDone     - show target & number of active peers
TracePromoteColdFailed - show target & number of established peers
TracePromoteColdDone   - show target & number of established peers
TraceDemoteWarmFailed  - show target & number of established peers
TraceDemoteWarmDone    - show target & number of established peers
We do not want to check that the available sets are non-empty since
these conditions are checked at the call sites, and the condition at the
call sites are more subtle, since it's other subsets that have to be
considered.
To match names used in other abstractions like KnownPeers.
The EstablishedPeers contained a Map that kept track of the
cold/warm/hot status of each peer. This is redundant information since
it can be derived from the known, established and active peer sets.
We can simplify things a bit by eliminating the tracking of this redundant
information.

The info was only being used in one place in the implementation (to determine
the async demotions) and in one test (to compare the environment and
governor's views of the peers status). So those places have been changed to
use the established and active peer sets.
We do not use it in any existing policy. If we need it, we can derive
the info by other means, and do so on demand, as policies need it. We
do not need to maintain it incrementally.
Turns out we don't currently use any of the KnownPeerInfo in our
policies.
We do not use it in any existing policy. If we need it, we can derive
the info by other means, and do so on demand, as policies need it. We
do not need to maintain it incrementally.

This also fixes a bug in the way were were incrementally maintaining the
knownPeerAdvertise info. Not tracking it at all is the simple solution.
So also remove KnownPeers.toMap and switch the last remaining uses of
it over to toSet.
Copy link
Contributor

@dcoutts dcoutts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a few changes on top of Marcin's original:

  • I've dropped the patch that relaxed the invariant to knownPeerSource. I wasn't happy with relaxing the invariant, since my view is that it was simply accepting that the results were wrong. So that also dropped the huge unit test (a saved test that QC found)
  • I've removed most of the info in the KnownPeerInfo since it was redundant. It was incrementally maintaining info that we could calculate from other canonical state. For the peer source it was being maintained incorrectly (the above point). So that removed the peer source and the advertising state.
  • I've stopped passing the KnownPeerInfo to the pick policy functions. We can add any of this info back if we need it later, but in an on-demand style as an attribute lookup function.
  • This meant adjusting the "available peers" calculations to use sets rather than maps.

@@ -92,6 +92,7 @@ belowTarget actions
-- If we could connect except that there are no peers currently available
-- then we return the next wakeup time (if any)
| numEstablishedPeers + numConnectInProgress < targetNumberOfEstablishedPeers
, Set.null (availableToConnect Set.\\ inProgressOrEstablishedPeers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this because it is redundant.

The guard above was

| numEstablishedPeers + numConnectInProgress < targetNumberOfEstablishedPeers
, Set.size availableToConnect - numEstablishedPeers - numConnectInProgress > 0

and our guard here repeats the first clause. So we know already that the second clause of the guard is false. Thus we know that

Set.size availableToConnect - numEstablishedPeers - numConnectInProgress = 0

(it cannot be negative since the established and in-progress ones are a subset of the available ones).

But what is this guard?

Set.null (availableToConnect Set.\\ inProgressOrEstablishedPeers)

Isn't it exactly saying that the number available to connect, minus the ones that are established or in progress is zero?

Again, we know this because the established and in-progress are a subset of the available, and they don't overlap with each other. So in this case counting becomes equivalent to the set operations.

And so this is exactly the negation of the previous guard, which we already knew!

QED.

-- * the script returns 'Noop'
-- * peer demoted to 'PeerCold'
-- + the script returns 'Noop'
-- + peer demoted to 'PeerCold'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I heard there are plans to fix this in haddock :), it will just generate a warning.

@coot
Copy link
Contributor Author

coot commented Mar 9, 2021

btw, the policies used in p2p-master branch are implemented here.

@coot
Copy link
Contributor Author

coot commented Mar 9, 2021

LGTM from me (I cannot approve as I am the author of the pr).

@dcoutts
Copy link
Contributor

dcoutts commented Mar 9, 2021

bors merge

@iohk-bors
Copy link
Contributor

iohk-bors bot commented Mar 9, 2021

@iohk-bors iohk-bors bot merged commit f01599f into master Mar 9, 2021
@iohk-bors iohk-bors bot deleted the coot/governor-fix branch March 9, 2021 14:17
coot added a commit that referenced this pull request May 16, 2022
2526: p2p-governor: fix various P2P governor bugs (with tests) r=dcoutts a=coot



Co-authored-by: Marcin Szamotulski <profunctor@pm.me>
Co-authored-by: Duncan Coutts <duncan@well-typed.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants