New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consensus: do not assume the anchor point is genesis in prevPointAndBlockNo #1544
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There must have been a maximal rollback (so that we have an empty fragment not anchored at genesis) before we try to forge a block. Nice find! 👍
(cc, cBno) <- lift $ atomically $ | ||
(,) <$> ChainDB.getCurrentChain chainDB | ||
<*> ChainDB.getTipBlockNo chainDB | ||
let mPrev = prevPointAndBlockNo currentSlot cBno cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the following:
mPrev <- lift $ atomically $
prevPointAndBlockNo currentSlot
<$> ChainDB.getTipBlockNo chainDB
<*> ChainDB.getCurrentChain chainDB
Because of laziness, we won't be doing the extra work in the transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's do this.
Hold on a second. Does this mean switch-to-fork as rollback-followed-by-foll-forward is observable, and we might produce a block after the rollback but before the roll-forward? That can't happen, can it? We just add those blocks to the chain DB, and it picks a different chain. Why would block forging be able to observe a rollback (as opposed to a switch-to-fork)? |
Hmm, yeah, that would be very strange 🤔. @nfrisby Can you give some more details about the scenario in which this happens? |
I'll dig in a bit and report back. As of yet, all I retained was " Edit: OK, I think I've sorted it @mrBliss @edsko . The repro is:
The only remarkable part is that node 1 restarts in slot 1. Node 1 also leads that slot. In the trace, I see that node 1 leads and forges a block in slot 1 before it restarts. After it restarts (there's currently no Here are the two anchored chain fragments passed to
3c2a8234e is the EBB in slot 0. 6d0c1911285 is the subsequent block in slot 0. 9738c2b5448 is the first block that node 1 forges in slot 1. The above explanation seems to justify this PR's current change to the So I think the only adjustment would be to change the outer |
Hmm. Could there be a scenario where the ChainDB was truncated by exactly |
The chain of the ChainDB will never shrink while running, only when starting up. For example, when the whole VolatileDB folder is missing, the ChainDB can start with an empty fragment that is anchored at a point that is not genesis. |
I think this can be merged as-is or with my suggestion. |
Ok, so, this can happen without EBBs or node restarts. Let |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not quite right. Will explain in more detail in a separate comment.
prevPointAndBlockNo slot c = case c of | ||
Empty _ -> Right (genesisPoint, genesisBlockNo) | ||
prevPointAndBlockNo slot bno c = case c of | ||
Empty p -> Right $ mkPredecessor p bno |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty p -> Right $ mkPredecessor p bno | |
Empty p | |
| pointSlot p < At slot | |
-> Right (castPoint p, succ bno) | |
| otherwise | |
-> Left _ -- The new error message, see the comment above |
-> Right (genesisPoint, genesisBlockNo) | ||
-> Right $ case c' of | ||
_ :> hdr' -> (headerPoint hdr', blockNo hdr') | ||
Empty p -> mkPredecessor p (blockNo hdr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty p -> mkPredecessor p (blockNo hdr) | |
Empty p -> (castPoint p, blockNo hdr) |
And we won't even need mkPredecessor
@@ -498,10 +500,11 @@ forkBlockProduction maxBlockSizeOverride IS{..} BlockProduction{..} = | |||
-- another node was also elected leader and managed to produce a block | |||
-- before us, the header right before the one at the tip of the chain. | |||
prevPointAndBlockNo :: SlotNo | |||
-> BlockNo | |||
-> AnchoredFragment (Header blk) | |||
-> Either SlotNo (Point blk, BlockNo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change this BlockNo
into the block number of the block we're going to produce, not the block number of the predecessor, because there might not be a predecessor. The docstring needs to be updated and the name of this function too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of returning the slot from the future as Left
, return the actual error, since we can fail not only because of a slot from the future, but also when there's already a block in the slot we're trying to produce a block for, and, moreover, that block is already immutable. This might happen in two cases: 1) the clock moved backwards, on restart we ignored everything from the VolatileDB since it's all in the future, and now the tip of the ImmutableDB (= anchor point of our empty chain fragment) points to a block produced in the same slot we're trying to produce a block in; 2) k = 0 and we already adopted a block from another leader of the same slot.
So introduce an actual error type for these two cases (use the above as part of its docstring) and add a new trace message for the extra case.
So discussed this in detail with @mrBliss . We left concrete suggestions, but here's what could go wrong with the PR as it stands:
I think the suggestions we made above will address both of these issues. |
I have created #1570 for tracking the bugs. |
b43da03
to
b31933f
Compare
I rewrote it and rebased onto origin/master. The diff is big, but I think it's clearer this way. I am seeing one failure in RealPBFT when I cherry-pick this onto my omnibus PR for Issue 1489. I'm not sure if it's due to this, but it might be. On the other hand, BFT, LeaderSchedule, PBFT, and DualPBFT all passed 40000 tests. The error involves I unfortunately need to prepare for the meeting in the morning, so I'm stopping on this for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks right to me 👍
I'll take care of these minor changes. |
bors r+ |
1544: consensus: do not assume the anchor point is genesis in prevPointAndBlockNo r=edsko a=nfrisby Fixes the first item in #1511. Unfortunately I don't have a test case here, because the only repro I have only fails on my WIP branch for Issue 1489. On that branch, this commit does fix the repro. I'm preparing an omnibus PR for 1489, but I thought I'd open this one too, since the fix seems clear in hindsight (unless I'm overlooking some invariant about the current chain here?) and it's in the library not just the test infrastructure. 1560: Provisioning power-shell script r=coot a=coot Using chocolatey install: * ghc-8.6.5 * git (and git-bash) * firefox * ~openssh~ Co-authored-by: Nicolas Frisby <nick.frisby@iohk.io> Co-authored-by: Edsko de Vries <edsko@well-typed.com> Co-authored-by: Marcin Szamotulski <profunctor@pm.me>
1506: Fix cloneBlockchainTime r=nfrisby a=nfrisby Fixes #1489. Fixes #1524. Fixing Issue #1489 (let nodes lead when they join) and letting k vary in the range [2 .. 10] since the dual-ledger tests do that now revealed several Issues. Issues in the library, not just the test infrastructure: * Issue #1505 -- `removeTxs` cannot use the fast path when validating after removing or else we might have dangling tx inputs references. Was fixed by #1565. * Issue #1511, bullet 1 (closed) -- The `Empty` cases in `prevPointAndBlockNo` were wrong. Recent PRs have addressed this: #1544 and #1589. * Issue #1543 (closed) -- A bracket in `registeredStream` was spoiled by an interruptible operation. Thomas' PR re-designed the vulnerability away. I think this is unrelated to the other changes; it was lurking and happened to pop up here just because I've been running hundreds of thousands of tests. Issues only in the test infrastructure: * Issue #1489: let nodes lead when they join. This bug slipped in recently, when I added cloning of `BlockchainTime`s as part of the restarting/rekeying loop in the test infrastructure. * _PBFT reference simulator_. (Was masked by 1489.) Model competing 1-block chains in Ref.PBFT and use its results where applicable instead of the .Expectations module. Check that the PBFT threadnet and Ref.PBFT.simulate results agree on `Ref.Nominal` slots. The Ref.PBFT module had been making assumptions that were accurate given the guards in RealPBFT generators, given the `k >= n` regime. But outside of that regime, when the security parameter exceeds the node count, it wasn't enough. Also, it couldn't be compared against the PBFT threadnet because of those assumptions. * _PBFT reference simulator_. (Cascade of above.) Add `definitelyEnoughBlocks` to confirm the "at least k blocks in 2k slots" invariant in `genRealPBFTNodeJoinPlan`. The existing guards in the RealPBFT generators are intentionally insufficient by themselves; this way we can optimize them to avoid O(n^2) complexity without risking divergence from the `suchThat`s. * _Origin corner-case_. (Was masked by 1489.) Discard DualPBFT tests that forge in slot 0. The current `cardano-ledger-specs` doesn't allow for that. My hypothesis is that `cvsLastSlot` would need to be able to represent origin. * _Dlg cert tx_. Adjust `genNodeRekeys` to respect "If node X rekeys in slot S and Y leads slot S+1, then either the topology must connect X and Y directly, or Y must join before slot S." * _Dlg cert tx_. (Was (statistically?) masked by relatively large k.) Add `rekeyOracle` and use it to determine which epoch number to record in the dlg cert. The correct value depends on which block the dlg cert tx will end up in, not when we first add it to our mempool. * _Dlg cert tx_. (Was (statistically?) masked by relatively large k.) Add the dlg cert tx to the mempool each time the ledger changes. Co-authored-by: Nicolas Frisby <nick.frisby@iohk.io>
Fixes the first item in IntersectMBO/ouroboros-consensus#726. (Edit: actually, it doesn't. See Issue 1584.)
Unfortunately I don't have a test case here, because the only repro I have only fails on my WIP branch for Issue 1489. On that branch, this commit does fix the repro.
I'm preparing an omnibus PR for 1489, but I thought I'd open this one too, since the fix seems clear in hindsight (unless I'm overlooking some invariant about the current chain here?) and it's in the library not just the test infrastructure.