Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consensus: do not assume the anchor point is genesis in prevPointAndBlockNo #1544

Merged
merged 2 commits into from Feb 5, 2020

Conversation

nfrisby
Copy link
Contributor

@nfrisby nfrisby commented Jan 31, 2020

Fixes the first item in IntersectMBO/ouroboros-consensus#726. (Edit: actually, it doesn't. See Issue 1584.)

Unfortunately I don't have a test case here, because the only repro I have only fails on my WIP branch for Issue 1489. On that branch, this commit does fix the repro.

I'm preparing an omnibus PR for 1489, but I thought I'd open this one too, since the fix seems clear in hindsight (unless I'm overlooking some invariant about the current chain here?) and it's in the library not just the test infrastructure.

@nfrisby nfrisby added the consensus issues related to ouroboros-consensus label Jan 31, 2020
@nfrisby nfrisby requested review from mrBliss and edsko January 31, 2020 01:42
@nfrisby nfrisby mentioned this pull request Jan 31, 2020
Copy link
Contributor

@mrBliss mrBliss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

There must have been a maximal rollback (so that we have an empty fragment not anchored at genesis) before we try to forge a block. Nice find! 👍

Comment on lines 362 to 365
(cc, cBno) <- lift $ atomically $
(,) <$> ChainDB.getCurrentChain chainDB
<*> ChainDB.getTipBlockNo chainDB
let mPrev = prevPointAndBlockNo currentSlot cBno cc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the following:

          mPrev <- lift $ atomically $
            prevPointAndBlockNo currentSlot
                <$> ChainDB.getTipBlockNo chainDB
                <*> ChainDB.getCurrentChain chainDB

Because of laziness, we won't be doing the extra work in the transaction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's do this.

@edsko
Copy link
Contributor

edsko commented Jan 31, 2020

There must have been a maximal rollback (so that we have an empty fragment not anchored at genesis) before we try to forge a block. Nice find! +1

Hold on a second. Does this mean switch-to-fork as rollback-followed-by-foll-forward is observable, and we might produce a block after the rollback but before the roll-forward? That can't happen, can it? We just add those blocks to the chain DB, and it picks a different chain. Why would block forging be able to observe a rollback (as opposed to a switch-to-fork)?

@mrBliss
Copy link
Contributor

mrBliss commented Jan 31, 2020

There must have been a maximal rollback (so that we have an empty fragment not anchored at genesis) before we try to forge a block. Nice find! +1

Hold on a second. Does this mean switch-to-fork as rollback-followed-by-foll-forward is observable, and we might produce a block after the rollback but before the roll-forward? That can't happen, can it? We just add those blocks to the chain DB, and it picks a different chain. Why would block forging be able to observe a rollback (as opposed to a switch-to-fork)?

Hmm, yeah, that would be very strange 🤔. @nfrisby Can you give some more details about the scenario in which this happens?

@nfrisby
Copy link
Contributor Author

nfrisby commented Jan 31, 2020

Hmm, yeah, that would be very strange 🤔. @nfrisby Can you give some more details about the scenario in which this happens?

I'll dig in a bit and report back. As of yet, all I retained was "k = 1".

Edit: OK, I think I've sorted it @mrBliss @edsko . The repro is:

      prop_simple_real_pbft_convergence NoEBBs (SecurityParam 1) TestConfig
        { numCoreNodes = ncn3
        , numSlots     = NumSlots 2
        , nodeJoinPlan = trivialNodeJoinPlan ncn3
        , nodeRestarts = NodeRestarts $ Map.fromList
            [(SlotNo 1,Map.fromList [(CoreNodeId 1,NodeRestart)])]
        , nodeTopology = meshNodeTopology ncn3
        , slotLengths  = defaultSlotLengths
        , initSeed     = ...
        }

The only remarkable part is that node 1 restarts in slot 1. Node 1 also leads that slot.

In the trace, I see that node 1 leads and forges a block in slot 1 before it restarts. After it restarts (there's currently no threadDelay or anything like that before a node restarts), it leads again in slot 1 and forges another block. So it's exercizing the EQ alternative of the case expression.

Here are the two anchored chain fragments passed to prevPointAndBlockNo:

TraceNodeIsLeader (SlotNo {unSlotNo = 1}) (AnchoredFragment
  { anchorPoint = At (Block {blockPointSlot = SlotNo {unSlotNo = 0}, blockPointHash = ByronHash {unByronHash = AbstractHash 3c2a8234e7b7b68b749b9ffd2a5ca4b8dbd97d4ae08285524f109abaa967f980}})
  , unanchorFragment = ChainFragment (SFT (fromList
    [ByronHeader { byronHeaderRaw = ..., byronHeaderSlotNo = SlotNo {unSlotNo = 0}, byronHeaderHash = ByronHash {unByronHash = AbstractHash 6d0c191128571adf009ecb5b8c60834afe6713f7bf4fdd0b5edb57082718a1c6}}
    ]))})
TraceNodeIsLeader (SlotNo {unSlotNo = 1}) (AnchoredFragment
  { anchorPoint = At (Block {blockPointSlot = SlotNo {unSlotNo = 0}, blockPointHash = ByronHash {unByronHash = AbstractHash 6d0c191128571adf009ecb5b8c60834afe6713f7bf4fdd0b5edb57082718a1c6}})
  , unanchorFragment = ChainFragment (SFT (fromList
    [ByronHeader {byronHeaderRaw = ..., byronHeaderSlotNo = SlotNo {unSlotNo = 1}, byronHeaderHash = ByronHash {unByronHash = AbstractHash 9738c2b54489d606e097571c82ab0f7814ad42111814f0e54bfabdfeb66431dd}}
    ]))})

3c2a8234e is the EBB in slot 0. 6d0c1911285 is the subsequent block in slot 0. 9738c2b5448 is the first block that node 1 forges in slot 1.


The above explanation seems to justify this PR's current change to the EQ subcase.

So I think the only adjustment would be to change the outer Empty alternative to assert that the anchor is genesis. Agreed? (I suppose that's the one Edsko had in mind when he raised his objection.)

@nfrisby
Copy link
Contributor Author

nfrisby commented Jan 31, 2020

So I think the only adjustment would be to change the outer Empty alternative to assert that the anchor is genesis. Agreed? (I suppose that's the one Edsko had in mind when he raised his objection.)

Hmm. Could there be a scenario where the ChainDB was truncated by exactly k blocks? Or would that cause a restart/reopen/etc such that the getCurrentChain AnchorFragment would still be non-Empty?

@mrBliss
Copy link
Contributor

mrBliss commented Feb 4, 2020

So I think the only adjustment would be to change the outer Empty alternative to assert that the anchor is genesis. Agreed? (I suppose that's the one Edsko had in mind when he raised his objection.)

Hmm. Could there be a scenario where the ChainDB was truncated by exactly k blocks? Or would that cause a restart/reopen/etc such that the getCurrentChain AnchorFragment would still be non-Empty?

The chain of the ChainDB will never shrink while running, only when starting up. For example, when the whole VolatileDB folder is missing, the ChainDB can start with an empty fragment that is anchored at a point that is not genesis.

@mrBliss
Copy link
Contributor

mrBliss commented Feb 4, 2020

I think this can be merged as-is or with my suggestion.

@edsko
Copy link
Contributor

edsko commented Feb 4, 2020

Ok, so, this can happen without EBBs or node restarts. Let k = 1. This means that under normal circumstances our fragment will be 1 block long. Now if another node manages to produce a block and send it to us before we manage to produce our own, we strip off the one block from our fragment and are left with an empty fragment. We should make sure that this scenario can occur in tests (#1569).

Copy link
Contributor

@edsko edsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite right. Will explain in more detail in a separate comment.

prevPointAndBlockNo slot c = case c of
Empty _ -> Right (genesisPoint, genesisBlockNo)
prevPointAndBlockNo slot bno c = case c of
Empty p -> Right $ mkPredecessor p bno
Copy link
Contributor

@mrBliss mrBliss Feb 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Empty p -> Right $ mkPredecessor p bno
Empty p
| pointSlot p < At slot
-> Right (castPoint p, succ bno)
| otherwise
-> Left _ -- The new error message, see the comment above

-> Right (genesisPoint, genesisBlockNo)
-> Right $ case c' of
_ :> hdr' -> (headerPoint hdr', blockNo hdr')
Empty p -> mkPredecessor p (blockNo hdr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Empty p -> mkPredecessor p (blockNo hdr)
Empty p -> (castPoint p, blockNo hdr)

And we won't even need mkPredecessor

@@ -498,10 +500,11 @@ forkBlockProduction maxBlockSizeOverride IS{..} BlockProduction{..} =
-- another node was also elected leader and managed to produce a block
-- before us, the header right before the one at the tip of the chain.
prevPointAndBlockNo :: SlotNo
-> BlockNo
-> AnchoredFragment (Header blk)
-> Either SlotNo (Point blk, BlockNo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this BlockNo into the block number of the block we're going to produce, not the block number of the predecessor, because there might not be a predecessor. The docstring needs to be updated and the name of this function too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of returning the slot from the future as Left, return the actual error, since we can fail not only because of a slot from the future, but also when there's already a block in the slot we're trying to produce a block for, and, moreover, that block is already immutable. This might happen in two cases: 1) the clock moved backwards, on restart we ignored everything from the VolatileDB since it's all in the future, and now the tip of the ImmutableDB (= anchor point of our empty chain fragment) points to a block produced in the same slot we're trying to produce a block in; 2) k = 0 and we already adopted a block from another leader of the same slot.

So introduce an actual error type for these two cases (use the above as part of its docstring) and add a new trace message for the extra case.

@edsko
Copy link
Contributor

edsko commented Feb 4, 2020

So discussed this in detail with @mrBliss . We left concrete suggestions, but here's what could go wrong with the PR as it stands:

  1. If the fragment is empty (say, due to data corruption and an empty volatile DB), then bno will be the tip of the immutable DB. That bno should then be the block number of the predecessor; we shouldn't try to take its predecessor (which is what currently happens).

  2. If the fragment is empty after stripping off a block, and we reach genesis, mkPredecessor must produce a block number for the predecessor, but there is none, and so it has little choice but to return genesisBlockNo. However, that is wrong: when we then call succ later, we will have skipped a block number.

I think the suggestions we made above will address both of these issues.

@mrBliss
Copy link
Contributor

mrBliss commented Feb 4, 2020

I have created #1570 for tracking the bugs.

@nfrisby
Copy link
Contributor Author

nfrisby commented Feb 5, 2020

I rewrote it and rebased onto origin/master. The diff is big, but I think it's clearer this way.

I am seeing one failure in RealPBFT when I cherry-pick this onto my omnibus PR for Issue 1489. I'm not sure if it's due to this, but it might be. On the other hand, BFT, LeaderSchedule, PBFT, and DualPBFT all passed 40000 tests. The error involves ChainValidationDelegationSchedulingError, so there's plenty of other moving parts besides this PR that could be responsible for it.

I unfortunately need to prepare for the meeting in the morning, so I'm stopping on this for now.

Copy link
Contributor

@mrBliss mrBliss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right to me 👍

@edsko
Copy link
Contributor

edsko commented Feb 5, 2020

I'll take care of these minor changes.

@edsko
Copy link
Contributor

edsko commented Feb 5, 2020

bors r+

iohk-bors bot added a commit that referenced this pull request Feb 5, 2020
1544: consensus: do not assume the anchor point is genesis in prevPointAndBlockNo r=edsko a=nfrisby

Fixes the first item in #1511.

Unfortunately I don't have a test case here, because the only repro I have only fails on my WIP branch for Issue 1489. On that branch, this commit does fix the repro.

I'm preparing an omnibus PR for 1489, but I thought I'd open this one too, since the fix seems clear in hindsight (unless I'm overlooking some invariant about the current chain here?) and it's in the library not just the test infrastructure.

1560: Provisioning power-shell script r=coot a=coot

Using chocolatey install:
* ghc-8.6.5
* git (and git-bash)
* firefox
* ~openssh~

Co-authored-by: Nicolas Frisby <nick.frisby@iohk.io>
Co-authored-by: Edsko de Vries <edsko@well-typed.com>
Co-authored-by: Marcin Szamotulski <profunctor@pm.me>
@iohk-bors
Copy link
Contributor

iohk-bors bot commented Feb 5, 2020

@iohk-bors iohk-bors bot merged commit 576ead9 into master Feb 5, 2020
@iohk-bors iohk-bors bot deleted the nfrisby/issue-1511 branch February 5, 2020 09:43
@mrBliss mrBliss mentioned this pull request Feb 6, 2020
iohk-bors bot added a commit that referenced this pull request Mar 2, 2020
1506: Fix cloneBlockchainTime r=nfrisby a=nfrisby

Fixes #1489. Fixes #1524.

Fixing Issue #1489 (let nodes lead when they join) and letting k vary in the
range [2 .. 10] since the dual-ledger tests do that now revealed several
Issues.

Issues in the library, not just the test infrastructure:

  * Issue #1505 -- `removeTxs` cannot use the fast path when validating after
    removing or else we might have dangling tx inputs references. Was fixed
    by #1565.

  * Issue #1511, bullet 1 (closed) -- The `Empty` cases in
    `prevPointAndBlockNo` were wrong. Recent PRs have addressed this: #1544 and
    #1589.

  * Issue #1543 (closed) -- A bracket in `registeredStream` was spoiled by an
    interruptible operation. Thomas' PR re-designed the vulnerability away. I
    think this is unrelated to the other changes; it was lurking and happened
    to pop up here just because I've been running hundreds of thousands of
    tests.

Issues only in the test infrastructure:

  * Issue #1489: let nodes lead when they join. This bug slipped in recently,
    when I added cloning of `BlockchainTime`s as part of the
    restarting/rekeying loop in the test infrastructure.

  * _PBFT reference simulator_. (Was masked by 1489.) Model competing 1-block
    chains in Ref.PBFT and use its results where applicable instead of the
    .Expectations module. Check that the PBFT threadnet and Ref.PBFT.simulate
    results agree on `Ref.Nominal` slots.

    The Ref.PBFT module had been making assumptions that were accurate given
    the guards in RealPBFT generators, given the `k >= n` regime. But outside
    of that regime, when the security parameter exceeds the node count, it
    wasn't enough. Also, it couldn't be compared against the PBFT threadnet
    because of those assumptions.

  * _PBFT reference simulator_. (Cascade of above.) Add
    `definitelyEnoughBlocks` to confirm the "at least k blocks in 2k slots"
    invariant in `genRealPBFTNodeJoinPlan`. The existing guards in the RealPBFT
    generators are intentionally insufficient by themselves; this way we can
    optimize them to avoid O(n^2) complexity without risking divergence from
    the `suchThat`s.

  * _Origin corner-case_. (Was masked by 1489.) Discard DualPBFT tests that
    forge in slot 0. The current `cardano-ledger-specs` doesn't allow for that.
    My hypothesis is that `cvsLastSlot` would need to be able to represent
    origin.

  * _Dlg cert tx_. Adjust `genNodeRekeys` to respect "If node X rekeys in slot
    S and Y leads slot S+1, then either the topology must connect X and Y
    directly, or Y must join before slot S."

  * _Dlg cert tx_. (Was (statistically?) masked by relatively large k.) Add
    `rekeyOracle` and use it to determine which epoch number to record in the
    dlg cert. The correct value depends on which block the dlg cert tx will end
    up in, not when we first add it to our mempool.

  * _Dlg cert tx_. (Was (statistically?) masked by relatively large k.) Add the
    dlg cert tx to the mempool each time the ledger changes.

Co-authored-by: Nicolas Frisby <nick.frisby@iohk.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consensus issues related to ouroboros-consensus
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants