Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong validation of LightBlock's commits #650

Closed
andrey-kuprianov opened this issue Oct 22, 2020 · 25 comments · Fixed by #652
Closed

Wrong validation of LightBlock's commits #650

andrey-kuprianov opened this issue Oct 22, 2020 · 25 comments · Fixed by #652
Labels
bug Something isn't working light-client Issues/features which involve the light client tests tla+ verification

Comments

@andrey-kuprianov
Copy link
Contributor

andrey-kuprianov commented Oct 22, 2020

I have a failing MBT test for the LightClient verifier, with an empty commit in the LightBlock.

This is how the test fails:

    > step 2, expecting Invalid
      > lite: NotEnoughTrust(NotEnoughTrust(VotingPowerTally { total: 100, tallied: 0, trust_threshold: TrustThresholdFraction { numerator: 1, denominator: 3 } }))

The relevant test part is this:

[current |->
    [Commits |-> {},
      header |->
        [NextVS |-> { "n1", "n2", "n3", "n4" },
          VS |-> { "n2", "n3" },
          height |-> 5,
          lastCommit |-> { "n1", "n3" },
          time |-> 6]],
  now |-> 1401,
  verdict |-> "INVALID",
  verified |->
    [Commits |-> { "n1", "n2", "n4" },
      header |->
        [NextVS |-> { "n2", "n3" },
          VS |-> { "n1", "n2", "n4" },
          height |-> 2,
          lastCommit |-> { "n1", "n2", "n3", "n4" },
          time |-> 3]]]

current here is the untrusted header, and verified is the trusted header, which are submitted to the verifier.verify() function. From the point of view of the spec, and also from my personal understanding, an untrusted signed header with an empty commit should be rejected as invalid. The implementation, on the other hand, returns the NotEnoughTrust verdict. My understanding is that this comes from the specific order of executing checks in this code, probably for performance reasons, as verifying the signatures is costly.

We could address this from different angles:

  • Change the spec, to allow more behaviors in this case. While this is doable, in my view a signed header with an empty commit still should be under the Invalid category, and not under NotEnoughTrust.
  • Change the order of checks in the code; don't know whether this is doable or acceptable from the performance point of view.
  • Add some lightweight checks that would verify, e.g., that the commit contains more than 1/3 of votes from the validator set, without actually verifying the signatures. This is going to be lightweight from the performance point of view, but would make the code a bit more complicated.

I think @romac should probably decide here; I am open for discussions.

P.S. The English spec (see especially the text below under "Details of the Functions") is also not very precise about how this is computed (e.g. in multiple places it talks about the number of validators instead of their voting power), so, probably, it should be also improved. (cc. @josef-widder?)

@andrey-kuprianov andrey-kuprianov added bug Something isn't working light-client Issues/features which involve the light client verification tests tla+ labels Oct 22, 2020
@andrey-kuprianov
Copy link
Contributor Author

Ok, now I understand more, and I think this is a real bug in the implementation.

I actually have two distinct tests, where there is an empty commit. The one above fails, but the one below passes:

[current |->
    [Commits |-> {},
      header |->
        [NextVS |-> { "n3", "n5", "n6", "n7", "n8" },
          VS |-> {},
          height |-> 3,
          lastCommit |->
            { "n1", "n10", "n2", "n3", "n4", "n5", "n6", "n7", "n8", "n9" },
          time |-> 2]],
  now |-> 4,
  verdict |-> "INVALID",
  verified |->
    [Commits |->
        { "n1", "n10", "n2", "n3", "n4", "n5", "n6", "n7", "n8", "n9" },
      header |->
        [NextVS |-> { "n1", "n10", "n2", "n4", "n9" },
          VS |->
            { "n1", "n10", "n2", "n3", "n4", "n5", "n6", "n7", "n8", "n9" },
          height |-> 1,
          lastCommit |-> {},
          time |-> 1]]]

This one has the outcome:

    > step 0, expecting Invalid
      > lite: Invalid(ImplementationSpecific("no signatures for commit"))

Examining this outcome points to CommitValidator.validate().

The reason why this check fires for the second case is that the validator set in empty here, but non-empty in the first case. This results in the following translations by the Testgen:

First case:

          "commit": {
            "height": "5",
            "round": 1,
            "block_id": {
              "hash": "43656DB8FE2E7A7C7F0FBEFED072538FEBF7CF66447E850B60A5242E2E832732",
              "parts": {
                "total": 0,
                "hash": "0000000000000000000000000000000000000000000000000000000000000000"
              }
            },
            "signatures": [
              {
                "block_id_flag": 1,
                "validator_address": null,
                "timestamp": null,
                "signature": null
              },
              {
                "block_id_flag": 1,
                "validator_address": null,
                "timestamp": null,
                "signature": null
              }
            ]
          }
        },

Second case:

          "commit": {
            "height": "3",
            "round": 1,
            "block_id": {
              "hash": "DD774230DAE6853ED35710C7FCC735F33B9A97E09A8E25A7F5495DF61BF13727",
              "parts": {
                "total": 0,
                "hash": "0000000000000000000000000000000000000000000000000000000000000000"
              }
            },
            "signatures": []
          }
        },

As can be seen, Testgen actually models the absent signatures as BlockIDFlagAbsent, and this case is not accounted for by CommitValidator.validate().

@andrey-kuprianov andrey-kuprianov changed the title Discrepancy between code and specs when verifying LightBlock's commit Wrong validation of LightBlock's commits Oct 22, 2020
@konnov
Copy link
Contributor

konnov commented Oct 22, 2020

Accepting an empty commit definitely should be a bug. Amazing! How many tests did you generate to stumble upon that one?

@andrey-kuprianov
Copy link
Contributor Author

If I correctly understand the implications, in the function verify_to_target(), incorrectly classifying the LightBlock as NotEnoughTrust instead of Invalid has pretty severe consequences: the block will be added to the light store, and on any further attempt to get the block for this height it will be fetched from the light store, i.e. it will be stuck there forever.

@romac could you please confirm whether I understand the things right?

@andrey-kuprianov
Copy link
Contributor Author

@konnov I was in the process of replacing our JSON fixtures with TLA+ tests, and wrote one TLA+ test that tests for <2/3 validator power in the commit, and another test that has an empty commit. In fact, both of those were kind of "flickering" -- on one execution of Apalache it's passing, on another it fails... What I've definitely learned so far is not to ignore any failing test;) Because it's so simple to rerun Apalache, and the test will pass...

What this also shows, I think, is that we should eventually incorporate running Apalache as part of the CI, not only running the static auto-generated tests. Because each time you run the model checker, the produced test is a bit different.

@konnov
Copy link
Contributor

konnov commented Oct 22, 2020

This also shows that we should think of initializing the random seeds in SMT. Flickering is good but it is not good :)

@andrey-kuprianov
Copy link
Contributor Author

andrey-kuprianov commented Oct 22, 2020

I think we want both modes probably: with initialization for predictability, and without -- for variability;)

@andrey-kuprianov
Copy link
Contributor Author

@romac FYI: this PR now contains a static auto-generated test that fails as described above. The branch is updated to the latest master.

@romac
Copy link
Member

romac commented Oct 22, 2020

Awesome work, @andrey-kuprianov!

If I correctly understand the implications, in the function verify_to_target(), incorrectly classifying the LightBlock as NotEnoughTrust instead of Invalid has pretty severe consequences: the block will be added to the light store, and on any further attempt to get the block for this height it will be fetched from the light store, i.e. it will be stuck there forever.

@romac could you please confirm whether I understand the things right?

Yep, that's correct.

@romac FYI: this PR now contains a static auto-generated test that fails as described above. The branch is updated to the latest master.

Awesome, will take a look at it asap!

@romac
Copy link
Member

romac commented Oct 22, 2020

I have a fix for this, should I push it to #641?

@romac
Copy link
Member

romac commented Oct 22, 2020

While we are at it, does anyone know what (if anything) should be done for this TODO?

// TODO: `self.commit.block_id` cannot be zero in the same way as in Go
// Clarify if this another encoding related issue

@romac
Copy link
Member

romac commented Oct 22, 2020

Here is the result of running cargo test -- --nocapture after applying this patch to #641: https://gist.github.com/romac/b40640cab1e62765bdca14fff9066d9f

@andrey-kuprianov Any idea what might be going wrong at https://gist.github.com/romac/b40640cab1e62765bdca14fff9066d9f#file-gistfile1-txt-L388-L430?

@andrey-kuprianov
Copy link
Contributor Author

I have a fix for this, should I push it to #641?

Great, thanks @romac! Does it make sense to merge the fix to master? I will then merge master into my branch. Or, alternatively, pushing your fix to #641 is also fine; whatever you prefer.

@andrey-kuprianov
Copy link
Contributor Author

Here is the result of running cargo test -- --nocapture after applying this patch to #641: https://gist.github.com/romac/b40640cab1e62765bdca14fff9066d9f

@andrey-kuprianov Any idea what might be going wrong at https://gist.github.com/romac/b40640cab1e62765bdca14fff9066d9f#file-gistfile1-txt-L388-L430?

It looks like you have tried to setup the full MBT on your Mac, and this doesn't work again... It didn't work for Shivani and Greg, for different reasons, maybe you have the third reason;) One reason might be that you need to compile Testgen separately, because the model-based tests depend on the Testgen binary, and this dependency is not discovered by Cargo automatically.

On the one hand, you don't need the full MBT with Apalache running to test against static files. If you do want to have the full MBT environment, and recompiling Testgen doesn't help, I am ready to assist in setting it up. It's not yet a very user-friendly process, unfortunately

@romac
Copy link
Member

romac commented Oct 22, 2020

I have a fix for this, should I push it to #641?

Great, thanks @romac! Does it make sense to merge the fix to master? I will then merge master into my branch. Or, alternatively, pushing your fix to #641 is also fine; whatever you prefer.

Okay I just opened a PR against master, let's now see what the CI says.

@romac
Copy link
Member

romac commented Oct 22, 2020

It looks like you have tried to setup the full MBT on your Mac, and this doesn't work again... It didn't work for Shivani and Greg, for different reasons, maybe you have the third reason;) One reason might be that you need to compile Testgen separately, because the model-based tests depend on the Testgen binary, and this dependency is not discovered by Cargo automatically.

Yeah perhaps my installed versions of Testgen and Apalache are out of date, will update them and try again :)

@konnov
Copy link
Contributor

konnov commented Oct 22, 2020

On the one hand, you don't need the full MBT with Apalache running to test against static files. If you do want to have the full MBT environment, and recompiling Testgen doesn't help, I am ready to assist in setting it up. It's not yet a very user-friendly process, unfortunately

Would it make sense to generate new static tests every night with CI? In the course of several months, you will have myriads of tests.

@ebuchman
Copy link
Member

What about a continuous test streaming service? So rather than generate new static tests every day or whatever we're just continuously feeding tests from apalache to the code, until the end of time ?

@konnov
Copy link
Contributor

konnov commented Oct 22, 2020

...and record the failing ones? Yes, that may work.

@andrey-kuprianov
Copy link
Contributor Author

Unfortunately the fix proposed in this commit doesn't work.

I have a more refined test, here is the relevant part of it:

TestLessThanTwoThirdsCommit ==
    /\ \E s \in DOMAIN history :
       LET CMS == history[s].current.Commits
           UVS == history[s].current.header.VS
           TVS == history[s].verified.header.VS
       IN
       /\ history[s].current.header.height > history[s].verified.header.height + 1
       /\ CMS /= ({} <: {STRING})
       /\ CMS \subseteq UVS
       /\ 3 * Cardinality(CMS) < 2 * Cardinality(UVS)
       /\ 3 * Cardinality(CMS \intersect TVS) < Cardinality(TVS)

and here is the relevant part of the failing test:

[current |->
    [Commits |-> { "n2", "n4", "n8" },
      header |->
        [NextVS |-> {"n8"},
          VS |-> { "n1", "n10", "n2", "n3", "n4", "n6", "n7", "n8", "n9" },
          height |-> 3,
          lastCommit |-> { "n3", "n6", "n8" },
          time |-> 3]],
  now |-> 1400,
  verdict |-> "INVALID",
  verified |->
    [Commits |->
        { "n1", "n10", "n2", "n3", "n4", "n5", "n6", "n7", "n8", "n9" },
      header |->
        [NextVS |-> { "n10", "n3", "n6", "n8" },
          VS |->
            { "n1", "n10", "n2", "n3", "n4", "n5", "n6", "n7", "n8", "n9" },
          height |-> 1,
          lastCommit |-> {},
          time |-> 1]]]

I think the right solution is to move this call into this position.

@romac
Copy link
Member

romac commented Oct 23, 2020

Right, we should indeed only check the validators overlap (ie. trust) if the block has already passed verification. I pushed a commit with the fix you suggested.

@andrey-kuprianov
Copy link
Contributor Author

Looks good, thanks:)

romac added a commit that referenced this issue Oct 23, 2020
…lid instead of invalid (#652)

* Ensure there is at least one non-absent signatures in the commit. See #650

* Update changelog

* Improve readability by renaming `has_non_absent_signatures` to `has_present_signatures`

Co-authored-by: Thane Thomson <thane@informal.systems>

* Fix typo

Co-authored-by: Thane Thomson <thane@informal.systems>

* Only check for validators overlap after a block has been deemed valid

* Remove and ignore detailed.log file

* Remove outdated TODO

* Link to issue instead of PR in changelogc

Co-authored-by: Thane Thomson <thane@informal.systems>
@andrey-kuprianov
Copy link
Contributor Author

andrey-kuprianov commented Oct 24, 2020

Re: #652 (review)

That's cool! I am glad that problem we've discovered helps to rethink the protocols, this is the best outcome possible. I agree that in the current LightClient protocol doesn't look repairable to any satisfactory state -- for a malicious peer there will always be a possibility to trigger long computations. The idea with changing the client-server protocol also sounds very nice to me! This will mean that the misbehavior of a full node will be detectable in 1 step -- if it provides a header that's not immediately verifiable by the LightClient.

One issue with what we've found is that with our current single-step MBT tests we are looking only at one step of the whole protocol, i.e. it's a very local view, which actually doesn't tell the whole story. So we need to sooner than later do the bisection tests. What also comes to my mind wrt. this is that we probably want MBT to drive performance tests for us. Here it would generate different scenarios/schedules for the protocols, but besides checking the logical correctness we will also measure wall-clock performance, and store the outliers for later analysis. One more point for integrating MBT as a continuously running service generating tests.

@konnov
Copy link
Contributor

konnov commented Oct 24, 2020

It is really cool that the bug provokes new thoughts. But we have to be a bit careful here. If we start increasing the protocol complexity only because we believe it works better in some cases, we may end up having a very complex protocol that is much harder to analyze. In the presence of faults computational complexity becomes a nightmare.

There is also the detector protocol that works on top of this verification protocol. It runs the verifier multiple times. It is quite important not to accidentally break the assumptions about this composition.

@andrey-kuprianov
Copy link
Contributor Author

andrey-kuprianov commented Oct 24, 2020

yes, we should be careful, agree. But my feeling is that this change will make the protocols and their analysis much simpler. In particular, there will be no bisection anymore, only one-step validation, and probably cross-validation to detect full nodes that might try to slow light clients down.

@ebuchman
Copy link
Member

Note this is already how the protocol works for the onchain IBC handler, which is the more important case anyways. We've been trying to align how we think about the light node to be closer to the IBC handler, and this would be a significant step in that direction as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working light-client Issues/features which involve the light client tests tla+ verification
Projects
None yet
4 participants