Wrong validation of LightBlock's commits #650

andrey-kuprianov · 2020-10-22T09:16:33Z

I have a failing MBT test for the LightClient verifier, with an empty commit in the LightBlock.

This is how the test fails:

    > step 2, expecting Invalid
      > lite: NotEnoughTrust(NotEnoughTrust(VotingPowerTally { total: 100, tallied: 0, trust_threshold: TrustThresholdFraction { numerator: 1, denominator: 3 } }))

The relevant test part is this:

[current |->
    [Commits |-> {},
      header |->
        [NextVS |-> { "n1", "n2", "n3", "n4" },
          VS |-> { "n2", "n3" },
          height |-> 5,
          lastCommit |-> { "n1", "n3" },
          time |-> 6]],
  now |-> 1401,
  verdict |-> "INVALID",
  verified |->
    [Commits |-> { "n1", "n2", "n4" },
      header |->
        [NextVS |-> { "n2", "n3" },
          VS |-> { "n1", "n2", "n4" },
          height |-> 2,
          lastCommit |-> { "n1", "n2", "n3", "n4" },
          time |-> 3]]]

current here is the untrusted header, and verified is the trusted header, which are submitted to the verifier.verify() function. From the point of view of the spec, and also from my personal understanding, an untrusted signed header with an empty commit should be rejected as invalid. The implementation, on the other hand, returns the NotEnoughTrust verdict. My understanding is that this comes from the specific order of executing checks in this code, probably for performance reasons, as verifying the signatures is costly.

We could address this from different angles:

Change the spec, to allow more behaviors in this case. While this is doable, in my view a signed header with an empty commit still should be under the Invalid category, and not under NotEnoughTrust.
Change the order of checks in the code; don't know whether this is doable or acceptable from the performance point of view.
Add some lightweight checks that would verify, e.g., that the commit contains more than 1/3 of votes from the validator set, without actually verifying the signatures. This is going to be lightweight from the performance point of view, but would make the code a bit more complicated.

I think @romac should probably decide here; I am open for discussions.

P.S. The English spec (see especially the text below under "Details of the Functions") is also not very precise about how this is computed (e.g. in multiple places it talks about the number of validators instead of their voting power), so, probably, it should be also improved. (cc. @josef-widder?)

The text was updated successfully, but these errors were encountered:

andrey-kuprianov · 2020-10-22T11:30:47Z

Ok, now I understand more, and I think this is a real bug in the implementation.

I actually have two distinct tests, where there is an empty commit. The one above fails, but the one below passes:

[current |->
    [Commits |-> {},
      header |->
        [NextVS |-> { "n3", "n5", "n6", "n7", "n8" },
          VS |-> {},
          height |-> 3,
          lastCommit |->
            { "n1", "n10", "n2", "n3", "n4", "n5", "n6", "n7", "n8", "n9" },
          time |-> 2]],
  now |-> 4,
  verdict |-> "INVALID",
  verified |->
    [Commits |->
        { "n1", "n10", "n2", "n3", "n4", "n5", "n6", "n7", "n8", "n9" },
      header |->
        [NextVS |-> { "n1", "n10", "n2", "n4", "n9" },
          VS |->
            { "n1", "n10", "n2", "n3", "n4", "n5", "n6", "n7", "n8", "n9" },
          height |-> 1,
          lastCommit |-> {},
          time |-> 1]]]

This one has the outcome:

    > step 0, expecting Invalid
      > lite: Invalid(ImplementationSpecific("no signatures for commit"))

Examining this outcome points to CommitValidator.validate().

The reason why this check fires for the second case is that the validator set in empty here, but non-empty in the first case. This results in the following translations by the Testgen:

First case:

          "commit": {
            "height": "5",
            "round": 1,
            "block_id": {
              "hash": "43656DB8FE2E7A7C7F0FBEFED072538FEBF7CF66447E850B60A5242E2E832732",
              "parts": {
                "total": 0,
                "hash": "0000000000000000000000000000000000000000000000000000000000000000"
              }
            },
            "signatures": [
              {
                "block_id_flag": 1,
                "validator_address": null,
                "timestamp": null,
                "signature": null
              },
              {
                "block_id_flag": 1,
                "validator_address": null,
                "timestamp": null,
                "signature": null
              }
            ]
          }
        },

Second case:

          "commit": {
            "height": "3",
            "round": 1,
            "block_id": {
              "hash": "DD774230DAE6853ED35710C7FCC735F33B9A97E09A8E25A7F5495DF61BF13727",
              "parts": {
                "total": 0,
                "hash": "0000000000000000000000000000000000000000000000000000000000000000"
              }
            },
            "signatures": []
          }
        },

As can be seen, Testgen actually models the absent signatures as BlockIDFlagAbsent, and this case is not accounted for by CommitValidator.validate().

konnov · 2020-10-22T11:36:47Z

Accepting an empty commit definitely should be a bug. Amazing! How many tests did you generate to stumble upon that one?

andrey-kuprianov · 2020-10-22T11:59:23Z

If I correctly understand the implications, in the function verify_to_target(), incorrectly classifying the LightBlock as NotEnoughTrust instead of Invalid has pretty severe consequences: the block will be added to the light store, and on any further attempt to get the block for this height it will be fetched from the light store, i.e. it will be stuck there forever.

@romac could you please confirm whether I understand the things right?

andrey-kuprianov · 2020-10-22T12:11:37Z

@konnov I was in the process of replacing our JSON fixtures with TLA+ tests, and wrote one TLA+ test that tests for <2/3 validator power in the commit, and another test that has an empty commit. In fact, both of those were kind of "flickering" -- on one execution of Apalache it's passing, on another it fails... What I've definitely learned so far is not to ignore any failing test;) Because it's so simple to rerun Apalache, and the test will pass...

What this also shows, I think, is that we should eventually incorporate running Apalache as part of the CI, not only running the static auto-generated tests. Because each time you run the model checker, the produced test is a bit different.

konnov · 2020-10-22T12:31:59Z

This also shows that we should think of initializing the random seeds in SMT. Flickering is good but it is not good :)

andrey-kuprianov · 2020-10-22T12:34:35Z

I think we want both modes probably: with initialization for predictability, and without -- for variability;)

andrey-kuprianov · 2020-10-22T12:51:19Z

@romac FYI: this PR now contains a static auto-generated test that fails as described above. The branch is updated to the latest master.

romac · 2020-10-22T13:11:18Z

Awesome work, @andrey-kuprianov!

If I correctly understand the implications, in the function verify_to_target(), incorrectly classifying the LightBlock as NotEnoughTrust instead of Invalid has pretty severe consequences: the block will be added to the light store, and on any further attempt to get the block for this height it will be fetched from the light store, i.e. it will be stuck there forever.

@romac could you please confirm whether I understand the things right?

Yep, that's correct.

@romac FYI: this PR now contains a static auto-generated test that fails as described above. The branch is updated to the latest master.

Awesome, will take a look at it asap!

romac · 2020-10-22T14:45:34Z

I have a fix for this, should I push it to #641?

romac · 2020-10-22T14:47:49Z

While we are at it, does anyone know what (if anything) should be done for this TODO?

tendermint-rs/light-client/src/operations/commit_validator.rs

Lines 56 to 57 in 6be5ebf

    
           // TODO: `self.commit.block_id` cannot be zero in the same way as in Go 
        
           //       Clarify if this another encoding related issue

romac · 2020-10-22T15:00:06Z

Here is the result of running cargo test -- --nocapture after applying this patch to #641: https://gist.github.com/romac/b40640cab1e62765bdca14fff9066d9f

@andrey-kuprianov Any idea what might be going wrong at https://gist.github.com/romac/b40640cab1e62765bdca14fff9066d9f#file-gistfile1-txt-L388-L430?

andrey-kuprianov · 2020-10-22T16:29:40Z

I have a fix for this, should I push it to #641?

Great, thanks @romac! Does it make sense to merge the fix to master? I will then merge master into my branch. Or, alternatively, pushing your fix to #641 is also fine; whatever you prefer.

andrey-kuprianov · 2020-10-22T16:35:24Z

Here is the result of running cargo test -- --nocapture after applying this patch to #641: https://gist.github.com/romac/b40640cab1e62765bdca14fff9066d9f

@andrey-kuprianov Any idea what might be going wrong at https://gist.github.com/romac/b40640cab1e62765bdca14fff9066d9f#file-gistfile1-txt-L388-L430?

It looks like you have tried to setup the full MBT on your Mac, and this doesn't work again... It didn't work for Shivani and Greg, for different reasons, maybe you have the third reason;) One reason might be that you need to compile Testgen separately, because the model-based tests depend on the Testgen binary, and this dependency is not discovered by Cargo automatically.

On the one hand, you don't need the full MBT with Apalache running to test against static files. If you do want to have the full MBT environment, and recompiling Testgen doesn't help, I am ready to assist in setting it up. It's not yet a very user-friendly process, unfortunately

…650

romac · 2020-10-22T18:24:36Z

I have a fix for this, should I push it to #641?

Great, thanks @romac! Does it make sense to merge the fix to master? I will then merge master into my branch. Or, alternatively, pushing your fix to #641 is also fine; whatever you prefer.

Okay I just opened a PR against master, let's now see what the CI says.

romac · 2020-10-22T18:25:34Z

It looks like you have tried to setup the full MBT on your Mac, and this doesn't work again... It didn't work for Shivani and Greg, for different reasons, maybe you have the third reason;) One reason might be that you need to compile Testgen separately, because the model-based tests depend on the Testgen binary, and this dependency is not discovered by Cargo automatically.

Yeah perhaps my installed versions of Testgen and Apalache are out of date, will update them and try again :)

konnov · 2020-10-22T18:55:54Z

On the one hand, you don't need the full MBT with Apalache running to test against static files. If you do want to have the full MBT environment, and recompiling Testgen doesn't help, I am ready to assist in setting it up. It's not yet a very user-friendly process, unfortunately

Would it make sense to generate new static tests every night with CI? In the course of several months, you will have myriads of tests.

ebuchman · 2020-10-22T19:08:30Z

What about a continuous test streaming service? So rather than generate new static tests every day or whatever we're just continuously feeding tests from apalache to the code, until the end of time ?

konnov · 2020-10-22T19:14:07Z

...and record the failing ones? Yes, that may work.

andrey-kuprianov · 2020-10-23T09:26:47Z

Unfortunately the fix proposed in this commit doesn't work.

I have a more refined test, here is the relevant part of it:

TestLessThanTwoThirdsCommit ==
    /\ \E s \in DOMAIN history :
       LET CMS == history[s].current.Commits
           UVS == history[s].current.header.VS
           TVS == history[s].verified.header.VS
       IN
       /\ history[s].current.header.height > history[s].verified.header.height + 1
       /\ CMS /= ({} <: {STRING})
       /\ CMS \subseteq UVS
       /\ 3 * Cardinality(CMS) < 2 * Cardinality(UVS)
       /\ 3 * Cardinality(CMS \intersect TVS) < Cardinality(TVS)

and here is the relevant part of the failing test:

[current |->
    [Commits |-> { "n2", "n4", "n8" },
      header |->
        [NextVS |-> {"n8"},
          VS |-> { "n1", "n10", "n2", "n3", "n4", "n6", "n7", "n8", "n9" },
          height |-> 3,
          lastCommit |-> { "n3", "n6", "n8" },
          time |-> 3]],
  now |-> 1400,
  verdict |-> "INVALID",
  verified |->
    [Commits |->
        { "n1", "n10", "n2", "n3", "n4", "n5", "n6", "n7", "n8", "n9" },
      header |->
        [NextVS |-> { "n10", "n3", "n6", "n8" },
          VS |->
            { "n1", "n10", "n2", "n3", "n4", "n5", "n6", "n7", "n8", "n9" },
          height |-> 1,
          lastCommit |-> {},
          time |-> 1]]]

I think the right solution is to move this call into this position.

romac · 2020-10-23T10:19:39Z

Right, we should indeed only check the validators overlap (ie. trust) if the block has already passed verification. I pushed a commit with the fix you suggested.

andrey-kuprianov · 2020-10-23T10:25:52Z

Looks good, thanks:)

…lid instead of invalid (#652) * Ensure there is at least one non-absent signatures in the commit. See #650 * Update changelog * Improve readability by renaming `has_non_absent_signatures` to `has_present_signatures` Co-authored-by: Thane Thomson <thane@informal.systems> * Fix typo Co-authored-by: Thane Thomson <thane@informal.systems> * Only check for validators overlap after a block has been deemed valid * Remove and ignore detailed.log file * Remove outdated TODO * Link to issue instead of PR in changelogc Co-authored-by: Thane Thomson <thane@informal.systems>

andrey-kuprianov · 2020-10-24T03:31:32Z

Re: #652 (review)

That's cool! I am glad that problem we've discovered helps to rethink the protocols, this is the best outcome possible. I agree that in the current LightClient protocol doesn't look repairable to any satisfactory state -- for a malicious peer there will always be a possibility to trigger long computations. The idea with changing the client-server protocol also sounds very nice to me! This will mean that the misbehavior of a full node will be detectable in 1 step -- if it provides a header that's not immediately verifiable by the LightClient.

One issue with what we've found is that with our current single-step MBT tests we are looking only at one step of the whole protocol, i.e. it's a very local view, which actually doesn't tell the whole story. So we need to sooner than later do the bisection tests. What also comes to my mind wrt. this is that we probably want MBT to drive performance tests for us. Here it would generate different scenarios/schedules for the protocols, but besides checking the logical correctness we will also measure wall-clock performance, and store the outliers for later analysis. One more point for integrating MBT as a continuously running service generating tests.

konnov · 2020-10-24T08:14:14Z

It is really cool that the bug provokes new thoughts. But we have to be a bit careful here. If we start increasing the protocol complexity only because we believe it works better in some cases, we may end up having a very complex protocol that is much harder to analyze. In the presence of faults computational complexity becomes a nightmare.

There is also the detector protocol that works on top of this verification protocol. It runs the verifier multiple times. It is quite important not to accidentally break the assumptions about this composition.

andrey-kuprianov · 2020-10-24T16:44:45Z

yes, we should be careful, agree. But my feeling is that this change will make the protocols and their analysis much simpler. In particular, there will be no bisection anymore, only one-step validation, and probably cross-validation to detect full nodes that might try to slow light clients down.

ebuchman · 2020-10-26T12:54:56Z

Note this is already how the protocol works for the onchain IBC handler, which is the more important case anyways. We've been trying to align how we think about the light node to be closer to the IBC handler, and this would be a significant step in that direction as well.

andrey-kuprianov added bug Something isn't working light-client Issues/features which involve the light client verification tests tla+ labels Oct 22, 2020

andrey-kuprianov changed the title ~~Discrepancy between code and specs when verifying LightBlock's commit~~ Wrong validation of LightBlock's commits Oct 22, 2020

romac added a commit that referenced this issue Oct 22, 2020

Ensure there is at least one non-absent signatures in the commit. See #…

985eab3

…650

romac mentioned this issue Oct 22, 2020

Fix bug where a commit with only absent signatures would be deemed valid instead of invalid #652

Merged

5 tasks

romac closed this as completed in #652 Oct 23, 2020

andrey-kuprianov mentioned this issue Nov 6, 2020

Update LightClient TLA+ model to allow signers overlap check last #659

Open

ebuchman mentioned this issue Feb 21, 2022

New client-server protocol for better light client support tendermint/tendermint#7916

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong validation of LightBlock's commits #650

Wrong validation of LightBlock's commits #650

andrey-kuprianov commented Oct 22, 2020 •

edited

Loading

andrey-kuprianov commented Oct 22, 2020

konnov commented Oct 22, 2020

andrey-kuprianov commented Oct 22, 2020

andrey-kuprianov commented Oct 22, 2020

konnov commented Oct 22, 2020

andrey-kuprianov commented Oct 22, 2020 •

edited

Loading

andrey-kuprianov commented Oct 22, 2020

romac commented Oct 22, 2020

romac commented Oct 22, 2020 •

edited

Loading

romac commented Oct 22, 2020 •

edited

Loading

romac commented Oct 22, 2020

andrey-kuprianov commented Oct 22, 2020

andrey-kuprianov commented Oct 22, 2020

romac commented Oct 22, 2020

romac commented Oct 22, 2020 •

edited

Loading

konnov commented Oct 22, 2020

ebuchman commented Oct 22, 2020

konnov commented Oct 22, 2020

andrey-kuprianov commented Oct 23, 2020

romac commented Oct 23, 2020

andrey-kuprianov commented Oct 23, 2020

andrey-kuprianov commented Oct 24, 2020 •

edited

Loading

konnov commented Oct 24, 2020 •

edited

Loading

andrey-kuprianov commented Oct 24, 2020 •

edited

Loading

ebuchman commented Oct 26, 2020

Wrong validation of LightBlock's commits #650

Wrong validation of LightBlock's commits #650

Comments

andrey-kuprianov commented Oct 22, 2020 • edited Loading

andrey-kuprianov commented Oct 22, 2020

konnov commented Oct 22, 2020

andrey-kuprianov commented Oct 22, 2020

andrey-kuprianov commented Oct 22, 2020

konnov commented Oct 22, 2020

andrey-kuprianov commented Oct 22, 2020 • edited Loading

andrey-kuprianov commented Oct 22, 2020

romac commented Oct 22, 2020

romac commented Oct 22, 2020 • edited Loading

romac commented Oct 22, 2020 • edited Loading

romac commented Oct 22, 2020

andrey-kuprianov commented Oct 22, 2020

andrey-kuprianov commented Oct 22, 2020

romac commented Oct 22, 2020

romac commented Oct 22, 2020 • edited Loading

konnov commented Oct 22, 2020

ebuchman commented Oct 22, 2020

konnov commented Oct 22, 2020

andrey-kuprianov commented Oct 23, 2020

romac commented Oct 23, 2020

andrey-kuprianov commented Oct 23, 2020

andrey-kuprianov commented Oct 24, 2020 • edited Loading

konnov commented Oct 24, 2020 • edited Loading

andrey-kuprianov commented Oct 24, 2020 • edited Loading

ebuchman commented Oct 26, 2020

andrey-kuprianov commented Oct 22, 2020 •

edited

Loading

andrey-kuprianov commented Oct 22, 2020 •

edited

Loading

romac commented Oct 22, 2020 •

edited

Loading

romac commented Oct 22, 2020 •

edited

Loading

romac commented Oct 22, 2020 •

edited

Loading

andrey-kuprianov commented Oct 24, 2020 •

edited

Loading

konnov commented Oct 24, 2020 •

edited

Loading

andrey-kuprianov commented Oct 24, 2020 •

edited

Loading