consensus: do not assume the anchor point is genesis in prevPointAndBlockNo #1544

nfrisby · 2020-01-31T01:42:02Z

Fixes the first item in IntersectMBO/ouroboros-consensus#726. (Edit: actually, it doesn't. See Issue 1584.)

Unfortunately I don't have a test case here, because the only repro I have only fails on my WIP branch for Issue 1489. On that branch, this commit does fix the repro.

I'm preparing an omnibus PR for 1489, but I thought I'd open this one too, since the fix seems clear in hindsight (unless I'm overlooking some invariant about the current chain here?) and it's in the library not just the test infrastructure.

mrBliss

LGTM

There must have been a maximal rollback (so that we have an empty fragment not anchored at genesis) before we try to forge a block. Nice find! 👍

mrBliss · 2020-01-31T08:39:28Z

ouroboros-consensus/src/Ouroboros/Consensus/NodeKernel.hs

+          (cc, cBno) <- lift $ atomically $
+            (,) <$> ChainDB.getCurrentChain chainDB
+                <*> ChainDB.getTipBlockNo chainDB
+          let mPrev = prevPointAndBlockNo currentSlot cBno cc


What about the following:

mPrev <- lift $ atomically $ prevPointAndBlockNo currentSlot <$> ChainDB.getTipBlockNo chainDB <*> ChainDB.getCurrentChain chainDB

Because of laziness, we won't be doing the extra work in the transaction.

Yes, let's do this.

edsko · 2020-01-31T08:54:54Z

There must have been a maximal rollback (so that we have an empty fragment not anchored at genesis) before we try to forge a block. Nice find! +1

Hold on a second. Does this mean switch-to-fork as rollback-followed-by-foll-forward is observable, and we might produce a block after the rollback but before the roll-forward? That can't happen, can it? We just add those blocks to the chain DB, and it picks a different chain. Why would block forging be able to observe a rollback (as opposed to a switch-to-fork)?

mrBliss · 2020-01-31T08:57:02Z

There must have been a maximal rollback (so that we have an empty fragment not anchored at genesis) before we try to forge a block. Nice find! +1

Hold on a second. Does this mean switch-to-fork as rollback-followed-by-foll-forward is observable, and we might produce a block after the rollback but before the roll-forward? That can't happen, can it? We just add those blocks to the chain DB, and it picks a different chain. Why would block forging be able to observe a rollback (as opposed to a switch-to-fork)?

Hmm, yeah, that would be very strange 🤔. @nfrisby Can you give some more details about the scenario in which this happens?

nfrisby · 2020-01-31T14:34:38Z

Hmm, yeah, that would be very strange 🤔. @nfrisby Can you give some more details about the scenario in which this happens?

I'll dig in a bit and report back. As of yet, all I retained was "k = 1".

Edit: OK, I think I've sorted it @mrBliss @edsko . The repro is:

      prop_simple_real_pbft_convergence NoEBBs (SecurityParam 1) TestConfig
        { numCoreNodes = ncn3
        , numSlots     = NumSlots 2
        , nodeJoinPlan = trivialNodeJoinPlan ncn3
        , nodeRestarts = NodeRestarts $ Map.fromList
            [(SlotNo 1,Map.fromList [(CoreNodeId 1,NodeRestart)])]
        , nodeTopology = meshNodeTopology ncn3
        , slotLengths  = defaultSlotLengths
        , initSeed     = ...
        }

The only remarkable part is that node 1 restarts in slot 1. Node 1 also leads that slot.

In the trace, I see that node 1 leads and forges a block in slot 1 before it restarts. After it restarts (there's currently no threadDelay or anything like that before a node restarts), it leads again in slot 1 and forges another block. So it's exercizing the EQ alternative of the case expression.

Here are the two anchored chain fragments passed to prevPointAndBlockNo:

TraceNodeIsLeader (SlotNo {unSlotNo = 1}) (AnchoredFragment
  { anchorPoint = At (Block {blockPointSlot = SlotNo {unSlotNo = 0}, blockPointHash = ByronHash {unByronHash = AbstractHash 3c2a8234e7b7b68b749b9ffd2a5ca4b8dbd97d4ae08285524f109abaa967f980}})
  , unanchorFragment = ChainFragment (SFT (fromList
    [ByronHeader { byronHeaderRaw = ..., byronHeaderSlotNo = SlotNo {unSlotNo = 0}, byronHeaderHash = ByronHash {unByronHash = AbstractHash 6d0c191128571adf009ecb5b8c60834afe6713f7bf4fdd0b5edb57082718a1c6}}
    ]))})
TraceNodeIsLeader (SlotNo {unSlotNo = 1}) (AnchoredFragment
  { anchorPoint = At (Block {blockPointSlot = SlotNo {unSlotNo = 0}, blockPointHash = ByronHash {unByronHash = AbstractHash 6d0c191128571adf009ecb5b8c60834afe6713f7bf4fdd0b5edb57082718a1c6}})
  , unanchorFragment = ChainFragment (SFT (fromList
    [ByronHeader {byronHeaderRaw = ..., byronHeaderSlotNo = SlotNo {unSlotNo = 1}, byronHeaderHash = ByronHash {unByronHash = AbstractHash 9738c2b54489d606e097571c82ab0f7814ad42111814f0e54bfabdfeb66431dd}}
    ]))})

3c2a8234e is the EBB in slot 0. 6d0c1911285 is the subsequent block in slot 0. 9738c2b5448 is the first block that node 1 forges in slot 1.

The above explanation seems to justify this PR's current change to the EQ subcase.

So I think the only adjustment would be to change the outer Empty alternative to assert that the anchor is genesis. Agreed? (I suppose that's the one Edsko had in mind when he raised his objection.)

ouroboros-consensus/src/Ouroboros/Consensus/NodeKernel.hs

nfrisby · 2020-01-31T17:10:07Z

So I think the only adjustment would be to change the outer Empty alternative to assert that the anchor is genesis. Agreed? (I suppose that's the one Edsko had in mind when he raised his objection.)

Hmm. Could there be a scenario where the ChainDB was truncated by exactly k blocks? Or would that cause a restart/reopen/etc such that the getCurrentChain AnchorFragment would still be non-Empty?

mrBliss · 2020-02-04T09:35:56Z

So I think the only adjustment would be to change the outer Empty alternative to assert that the anchor is genesis. Agreed? (I suppose that's the one Edsko had in mind when he raised his objection.)

Hmm. Could there be a scenario where the ChainDB was truncated by exactly k blocks? Or would that cause a restart/reopen/etc such that the getCurrentChain AnchorFragment would still be non-Empty?

The chain of the ChainDB will never shrink while running, only when starting up. For example, when the whole VolatileDB folder is missing, the ChainDB can start with an empty fragment that is anchored at a point that is not genesis.

mrBliss · 2020-02-04T09:44:42Z

I think this can be merged as-is or with my suggestion.

edsko · 2020-02-04T10:13:50Z

Ok, so, this can happen without EBBs or node restarts. Let k = 1. This means that under normal circumstances our fragment will be 1 block long. Now if another node manages to produce a block and send it to us before we manage to produce our own, we strip off the one block from our fragment and are left with an empty fragment. We should make sure that this scenario can occur in tests (#1569).

edsko

This is not quite right. Will explain in more detail in a separate comment.

mrBliss · 2020-02-04T10:41:39Z

ouroboros-consensus/src/Ouroboros/Consensus/NodeKernel.hs

-    prevPointAndBlockNo slot c = case c of
-        Empty _   -> Right (genesisPoint, genesisBlockNo)
+    prevPointAndBlockNo slot bno c = case c of
+        Empty p   -> Right $ mkPredecessor p bno


Suggested change

Empty p -> Right $ mkPredecessor p bno

Empty p

| pointSlot p < At slot

-> Right (castPoint p, succ bno)

| otherwise

-> Left _ -- The new error message, see the comment above

mrBliss · 2020-02-04T10:45:09Z

ouroboros-consensus/src/Ouroboros/Consensus/NodeKernel.hs

-             -> Right (genesisPoint, genesisBlockNo)
+             -> Right $ case c' of
+               _ :> hdr' -> (headerPoint hdr', blockNo hdr')
+               Empty p   -> mkPredecessor p (blockNo hdr)


Suggested change

Empty p -> mkPredecessor p (blockNo hdr)

Empty p -> (castPoint p, blockNo hdr)

And we won't even need mkPredecessor

mrBliss · 2020-02-04T10:46:15Z

ouroboros-consensus/src/Ouroboros/Consensus/NodeKernel.hs

@@ -498,10 +500,11 @@ forkBlockProduction maxBlockSizeOverride IS{..} BlockProduction{..} =
    -- another node was also elected leader and managed to produce a block
    -- before us, the header right before the one at the tip of the chain.
    prevPointAndBlockNo :: SlotNo
+                        -> BlockNo
                        -> AnchoredFragment (Header blk)
                        -> Either SlotNo (Point blk, BlockNo)


Let's change this BlockNo into the block number of the block we're going to produce, not the block number of the predecessor, because there might not be a predecessor. The docstring needs to be updated and the name of this function too.

Instead of returning the slot from the future as Left, return the actual error, since we can fail not only because of a slot from the future, but also when there's already a block in the slot we're trying to produce a block for, and, moreover, that block is already immutable. This might happen in two cases: 1) the clock moved backwards, on restart we ignored everything from the VolatileDB since it's all in the future, and now the tip of the ImmutableDB (= anchor point of our empty chain fragment) points to a block produced in the same slot we're trying to produce a block in; 2) k = 0 and we already adopted a block from another leader of the same slot.

So introduce an actual error type for these two cases (use the above as part of its docstring) and add a new trace message for the extra case.

edsko · 2020-02-04T10:50:37Z

So discussed this in detail with @mrBliss . We left concrete suggestions, but here's what could go wrong with the PR as it stands:

If the fragment is empty (say, due to data corruption and an empty volatile DB), then bno will be the tip of the immutable DB. That bno should then be the block number of the predecessor; we shouldn't try to take its predecessor (which is what currently happens).
If the fragment is empty after stripping off a block, and we reach genesis, mkPredecessor must produce a block number for the predecessor, but there is none, and so it has little choice but to return genesisBlockNo. However, that is wrong: when we then call succ later, we will have skipped a block number.

I think the suggestions we made above will address both of these issues.

mrBliss · 2020-02-04T11:41:58Z

I have created #1570 for tracking the bugs.

…g a block

nfrisby · 2020-02-05T00:30:54Z

I rewrote it and rebased onto origin/master. The diff is big, but I think it's clearer this way.

I am seeing one failure in RealPBFT when I cherry-pick this onto my omnibus PR for Issue 1489. I'm not sure if it's due to this, but it might be. On the other hand, BFT, LeaderSchedule, PBFT, and DualPBFT all passed 40000 tests. The error involves ChainValidationDelegationSchedulingError, so there's plenty of other moving parts besides this PR that could be responsible for it.

I unfortunately need to prepare for the meeting in the morning, so I'm stopping on this for now.

mrBliss

This looks right to me 👍

ouroboros-consensus/src/Ouroboros/Consensus/Node/Tracers.hs

ouroboros-consensus/src/Ouroboros/Consensus/NodeKernel.hs

edsko · 2020-02-05T09:08:54Z

I'll take care of these minor changes.

edsko · 2020-02-05T09:14:30Z

bors r+

1544: consensus: do not assume the anchor point is genesis in prevPointAndBlockNo r=edsko a=nfrisby Fixes the first item in #1511. Unfortunately I don't have a test case here, because the only repro I have only fails on my WIP branch for Issue 1489. On that branch, this commit does fix the repro. I'm preparing an omnibus PR for 1489, but I thought I'd open this one too, since the fix seems clear in hindsight (unless I'm overlooking some invariant about the current chain here?) and it's in the library not just the test infrastructure. 1560: Provisioning power-shell script r=coot a=coot Using chocolatey install: * ghc-8.6.5 * git (and git-bash) * firefox * ~openssh~ Co-authored-by: Nicolas Frisby <nick.frisby@iohk.io> Co-authored-by: Edsko de Vries <edsko@well-typed.com> Co-authored-by: Marcin Szamotulski <profunctor@pm.me>

iohk-bors · 2020-02-05T09:43:47Z

Build succeeded

1506: Fix cloneBlockchainTime r=nfrisby a=nfrisby Fixes #1489. Fixes #1524. Fixing Issue #1489 (let nodes lead when they join) and letting k vary in the range [2 .. 10] since the dual-ledger tests do that now revealed several Issues. Issues in the library, not just the test infrastructure: * Issue #1505 -- `removeTxs` cannot use the fast path when validating after removing or else we might have dangling tx inputs references. Was fixed by #1565. * Issue #1511, bullet 1 (closed) -- The `Empty` cases in `prevPointAndBlockNo` were wrong. Recent PRs have addressed this: #1544 and #1589. * Issue #1543 (closed) -- A bracket in `registeredStream` was spoiled by an interruptible operation. Thomas' PR re-designed the vulnerability away. I think this is unrelated to the other changes; it was lurking and happened to pop up here just because I've been running hundreds of thousands of tests. Issues only in the test infrastructure: * Issue #1489: let nodes lead when they join. This bug slipped in recently, when I added cloning of `BlockchainTime`s as part of the restarting/rekeying loop in the test infrastructure. * _PBFT reference simulator_. (Was masked by 1489.) Model competing 1-block chains in Ref.PBFT and use its results where applicable instead of the .Expectations module. Check that the PBFT threadnet and Ref.PBFT.simulate results agree on `Ref.Nominal` slots. The Ref.PBFT module had been making assumptions that were accurate given the guards in RealPBFT generators, given the `k >= n` regime. But outside of that regime, when the security parameter exceeds the node count, it wasn't enough. Also, it couldn't be compared against the PBFT threadnet because of those assumptions. * _PBFT reference simulator_. (Cascade of above.) Add `definitelyEnoughBlocks` to confirm the "at least k blocks in 2k slots" invariant in `genRealPBFTNodeJoinPlan`. The existing guards in the RealPBFT generators are intentionally insufficient by themselves; this way we can optimize them to avoid O(n^2) complexity without risking divergence from the `suchThat`s. * _Origin corner-case_. (Was masked by 1489.) Discard DualPBFT tests that forge in slot 0. The current `cardano-ledger-specs` doesn't allow for that. My hypothesis is that `cvsLastSlot` would need to be able to represent origin. * _Dlg cert tx_. Adjust `genNodeRekeys` to respect "If node X rekeys in slot S and Y leads slot S+1, then either the topology must connect X and Y directly, or Y must join before slot S." * _Dlg cert tx_. (Was (statistically?) masked by relatively large k.) Add `rekeyOracle` and use it to determine which epoch number to record in the dlg cert. The correct value depends on which block the dlg cert tx will end up in, not when we first add it to our mempool. * _Dlg cert tx_. (Was (statistically?) masked by relatively large k.) Add the dlg cert tx to the mempool each time the ledger changes. Co-authored-by: Nicolas Frisby <nick.frisby@iohk.io>

nfrisby added the consensus issues related to ouroboros-consensus label Jan 31, 2020

nfrisby requested review from mrBliss and edsko January 31, 2020 01:42

nfrisby mentioned this pull request Jan 31, 2020

Fix cloneBlockchainTime #1506

Merged

mrBliss approved these changes Jan 31, 2020

View reviewed changes

nfrisby commented Jan 31, 2020

View reviewed changes

ouroboros-consensus/src/Ouroboros/Consensus/NodeKernel.hs Outdated Show resolved Hide resolved

edsko suggested changes Feb 4, 2020

View reviewed changes

mrBliss suggested changes Feb 4, 2020

View reviewed changes

edsko mentioned this pull request Nov 30, 2023

Apply clock changes to running nodes in the consensus tests IntersectMBO/ouroboros-consensus#675

Open

consensus: correct choice of predecessor and block number when forgin…

b31933f

…g a block

nfrisby force-pushed the nfrisby/issue-1511 branch from b43da03 to b31933f Compare February 5, 2020 00:24

mrBliss approved these changes Feb 5, 2020

View reviewed changes

ouroboros-consensus/src/Ouroboros/Consensus/Node/Tracers.hs Outdated Show resolved Hide resolved

ouroboros-consensus/src/Ouroboros/Consensus/NodeKernel.hs Outdated Show resolved Hide resolved

ouroboros-consensus/src/Ouroboros/Consensus/NodeKernel.hs Outdated Show resolved Hide resolved

edsko approved these changes Feb 5, 2020

View reviewed changes

Address review feedback

576ead9

edsko mentioned this pull request Feb 5, 2020

Add BlockNo to AnchoredFragment #1578

Closed

iohk-bors bot merged commit 576ead9 into master Feb 5, 2020

iohk-bors bot deleted the nfrisby/issue-1511 branch February 5, 2020 09:43

edsko mentioned this pull request Feb 5, 2020

mkCurrentBlockContext is still wrong #1584

Closed

mrBliss mentioned this pull request Feb 6, 2020

Bugs in block production #1570

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consensus: do not assume the anchor point is genesis in prevPointAndBlockNo #1544

consensus: do not assume the anchor point is genesis in prevPointAndBlockNo #1544

nfrisby commented Jan 31, 2020 •

edited

mrBliss left a comment

mrBliss Jan 31, 2020

edsko Feb 4, 2020

edsko commented Jan 31, 2020

mrBliss commented Jan 31, 2020

nfrisby commented Jan 31, 2020 •

edited

nfrisby commented Jan 31, 2020

mrBliss commented Feb 4, 2020

mrBliss commented Feb 4, 2020

edsko commented Feb 4, 2020

edsko left a comment

mrBliss Feb 4, 2020 •

edited

mrBliss Feb 4, 2020

mrBliss Feb 4, 2020

mrBliss Feb 4, 2020

edsko commented Feb 4, 2020 •

edited

mrBliss commented Feb 4, 2020

nfrisby commented Feb 5, 2020

mrBliss left a comment

edsko commented Feb 5, 2020

edsko commented Feb 5, 2020

iohk-bors bot commented Feb 5, 2020

-        Empty p   -> Right $ mkPredecessor p bno
+        Empty p
+          | pointSlot p < At slot
+          -> Right (castPoint p, succ bno)
+          | otherwise
+          -> Left _  -- The new error message, see the comment above

	Empty p -> mkPredecessor p (blockNo hdr)
	Empty p -> (castPoint p, blockNo hdr)

consensus: do not assume the anchor point is genesis in prevPointAndBlockNo #1544

consensus: do not assume the anchor point is genesis in prevPointAndBlockNo #1544

Conversation

nfrisby commented Jan 31, 2020 • edited

mrBliss left a comment

Choose a reason for hiding this comment

mrBliss Jan 31, 2020

Choose a reason for hiding this comment

edsko Feb 4, 2020

Choose a reason for hiding this comment

edsko commented Jan 31, 2020

mrBliss commented Jan 31, 2020

nfrisby commented Jan 31, 2020 • edited

nfrisby commented Jan 31, 2020

mrBliss commented Feb 4, 2020

mrBliss commented Feb 4, 2020

edsko commented Feb 4, 2020

edsko left a comment

Choose a reason for hiding this comment

mrBliss Feb 4, 2020 • edited

Choose a reason for hiding this comment

mrBliss Feb 4, 2020

Choose a reason for hiding this comment

mrBliss Feb 4, 2020

Choose a reason for hiding this comment

mrBliss Feb 4, 2020

Choose a reason for hiding this comment

edsko commented Feb 4, 2020 • edited

mrBliss commented Feb 4, 2020

nfrisby commented Feb 5, 2020

mrBliss left a comment

Choose a reason for hiding this comment

edsko commented Feb 5, 2020

edsko commented Feb 5, 2020

iohk-bors bot commented Feb 5, 2020

Build succeeded

nfrisby commented Jan 31, 2020 •

edited

nfrisby commented Jan 31, 2020 •

edited

mrBliss Feb 4, 2020 •

edited

edsko commented Feb 4, 2020 •

edited