Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stateless_validation] Fix tests that fail with stateless validation enabled #10506

Closed
8 of 9 tasks
shreyan-gupta opened this issue Jan 25, 2024 · 3 comments
Closed
8 of 9 tasks
Assignees
Labels
A-stateless-validation Area: stateless validation

Comments

@shreyan-gupta
Copy link
Contributor

shreyan-gupta commented Jan 25, 2024

When trying to enable stateless validation, some of the integration tests are failing. I did some initial investigation and found all the problems are related to our testing infrastructure. It's impossible to fix all these in one go so I'm creating a task to track these.

Tasks

github-merge-queue bot pushed a commit that referenced this issue Jan 26, 2024
…10507)

This PR uncomments two checks
- We now filter chunks based on whether they have enough stake endorsed
or not.
- We validate chunk endorsements in block header.

Enabling this breaks a bunch of tests mainly for two reasons
- Some tests are handled by mocking out parts/doing manual block
production which doesn't take care of chunk endorsement signatures
- A number of tests that work with multiple validators don't have the
appropriate network request handlers for chunk endorsement and state
witness processing.

These tests are impossible to explore and fix in a single PR and I've
created a task list to help with this here:
#10506

Note that we need to merge in the changes from
#10501 ,
#10503 and
#10504 before we can merge this.
Till then probably the tests would keep failing

Tip: Probably a good idea to review PR commit by commit
@jancionear jancionear self-assigned this Jan 29, 2024
github-merge-queue bot pushed a commit that referenced this issue Jan 31, 2024
This test didn't work with stateless validation. Let's fix it by
properly handling `ChunkStateWitness` and `ChunkEndorsement` messages.

Refs: #10506
@Longarithm
Copy link
Member

We should also look into nayduck, which indicates ~9 more failing tests since enabling endorsements https://nayduck.near.org/#/

github-merge-queue bot pushed a commit that referenced this issue Jan 31, 2024
It looks like this test now works with stateless validation, so let's
enable it again.
I didn't need to make any changes to the test, maybe something was fixed
in the meantime.

Refs: #10506
github-merge-queue bot pushed a commit that referenced this issue Feb 5, 2024
…es writing integration tests easier (#10559)

State witnesses are processed asynchronously - a task is spawned on a
separate thread, which validates the state witness and then sends out
chunk endorsements. Currently there's no way to directly wait for this
processing to finish, one can only wait for the chunk endorsements to
appear. This makes writing integration tests difficult. Let's add a way
to wait for the processing to finish, which will make writing the tests
easier.

Details:
There's no network stack in integration tests, all messages are passed
around manually. We first propagate all state witness messages, which
starts their processing in the background, but then we have to wait for
this processing to finish before propagating chunk endorsements. As
there's no way to wait for it, we have to do tricky things. The current
approach is to
define the set of expected chunk endorsements that should appear, and
then wait for the right endorsement messages to be sent out. This
approach has some limitations - it's often hard to figure out which
endorsements should be sent out exactly, as the tests create various
strange scenarios - missing chunks, blocks, forks, etc... It requires a
lot of manual work to fully understand the test and prepare the right
set. It's also quite fragile, any change in chunk endorsements (e.g not
sending them to ourselves) will break all the tests.
Adding the ability to wait for witness processing to finish greatly
simplifies things. We can just pass around witnesses, wait for them to
finish, and then propagate the generated endorsements. Without defining
any expected sets to wait for.
As an example of the benefits offered by this solution, compare the fix
in #10537 to the one added in this
PR. The one in #10537 is much more
complex and fragile.

I saw that there is also a thing called `ChunkEndorsementTracker`, which
tracks what chunk endorsements were sent out, but IMO it's a different
thing from `ProcessingDoneTracker`. We were already able to track which
endorsements are sent out in the tests, but that didn't make writing
tests easier. Additionally, in some tests we send out invalid chunks,
which wouldn't produce any endorsements at all, so we couldn't wait for
them.

Refs: #10506,
#10537, [zulip
conversation](https://near.zulipchat.com/#narrow/stream/407237-pagoda.2Fcore.2Fstateless-validation/topic/A.20way.20to.20wait.20for.20state.20witness.20processing.20to.20finish/near/419062119)
github-merge-queue bot pushed a commit that referenced this issue Feb 6, 2024
…10571)

## Description
The tests in `adversarial_behaviors.rs` didn't work with stateless
validation, let's make them pass again.
To achieve this we first need to propagate chunk state witnesses and
endorsements, and then adjust the expected behavior to match the one
exhibited by stateless validation.

The most complex part of this PR is modifying which blocks should be
skipped in
`test_banning_chunk_producer_when_seeing_invalid_chunk_base()`.
This function creates a blockchain with one malicious chunk producer and
runs it for a few epochs.
Which blocks will be skipped depends on whether we are using stateless
validation or not.

### Without stateless validation
In the old version of the protocol the invalid chunk is included in a
block and then the whole block is validated. The chunk is invalid, so
the whole block is also invalid and the block should be skipped. After
that the malicious chunk producer is banned for the rest of the epoch,
which means that block producers will reject chunks sent by this chunk
producer, so the subsequent blocks won't contain any invalid chunks and
won't be skipped.
Only the first block with invalid chunks in an epoch should be skipped. 

### With stateless validation
With stateless validation the situation is entirely different. When the
malicious chunk producer produces an invalid chunk, chunk validators
will refuse to send out endorsements for this chunk, so the chunk won't
be included in any block. This means that no blocks should be skipped.

There is one exception - in the test the block produced at height 2 is
invalid and gets skipped. I suspect that this is because the first few
blocks after genesis don't use stateless validation, and they are still
handled by old protocol. I'm not 100% sure if that's the case,
@Longarithm could you confirm that this explanation makes sense?
Something is definitely different for the first few blocks, in the logs
I can see `client: Banning chunk producer for producing invalid` at
height 2, which seems to belong to the old protocol. In the new version
malicious chunk producers are banned using `BanPeer`, without any log
messages.

Refs: #10506, [zulip
discussion](https://near.zulipchat.com/#narrow/stream/295558-pagoda.2Fcore/topic/Skipping.20the.20first.20block.20with.20missing.20chunks.20in.20an.20epoch.3F/near/419838610)
jancionear added a commit to jancionear/nearcore that referenced this issue Feb 7, 2024
This test didn't work with stateless validation because
it didn't propagate chunk state witnesses and endorsements as needed.

I modified the test to use `TestEnv`, which allowed me to use the
exisiting functionality for propagating all the partial chunk requests,
state witnesses and endorsements.
This changed the assignment of chunk and block producers.
Previously one of the clients was a chunk producer and the
other a block producer, with their roles swapped on each block.
Now the roles are assigned based on the logic in `EpochManager`.
IMO it doesn't affect the test, in the crucial moment when the
bad chunk mask is tested, the chunk validator is different from
the chunk producer, just like in the original test.

The test manually produces a single chunk on shard 0, so I couldn't just
use the default way of generating chunks during block processing. I had
to implement a function that would manually produce the required chunk.

Refs: near#10506
jancionear added a commit to jancionear/nearcore that referenced this issue Feb 7, 2024
This test didn't work with stateless validation because
it didn't propagate chunk state witnesses and endorsements as needed.

I modified the test to use `TestEnv`, which allowed me to use the
exisiting functionality for propagating all the partial chunk requests,
state witnesses and endorsements.
This changed the assignment of chunk and block producers.
Previously one of the clients was a chunk producer and the
other a block producer, with their roles swapped on each block.
Now the roles are assigned based on the logic in `EpochManager`.
IMO it doesn't affect the test, in the crucial moment when the
bad chunk mask is tested, the chunk producer is different from
the block producer, just like in the original test.

The test manually produces a single chunk on shard 0, so I couldn't just
use the default way of generating chunks during block processing. I had
to implement a function that would manually produce the required chunk.

Refs: near#10506
jancionear added a commit to jancionear/nearcore that referenced this issue Feb 7, 2024
This test didn't work with stateless validation because
it didn't propagate chunk state witnesses and endorsements as needed.

I modified the test to use `TestEnv`, which allowed me to use the
exisiting functionality for propagating all the partial chunk requests,
state witnesses and endorsements.
This changed the assignment of chunk and block producers.
Previously one of the clients was a chunk producer and the
other a block producer, with their roles swapped on each block.
Now the roles are assigned based on the logic in `EpochManager`.
IMO it doesn't affect the test, in the crucial moment when the
bad chunk mask is tested, the chunk producer is different from
the block producer, just like in the original test.

The test manually produces a single chunk on shard 0 to achieve the desired
chunk mask, so I couldn't just use the default way of generating chunks during
block processing. I had to implement a function that would manually produce
the required chunk.

Refs: near#10506
jancionear added a commit to jancionear/nearcore that referenced this issue Feb 7, 2024
This test didn't work with stateless validation because
it didn't propagate chunk state witnesses and endorsements as needed.

I modified the test to use `TestEnv`, which allowed me to use the
exisiting functionality for propagating all the partial chunk requests,
state witnesses and endorsements.
This changed the assignment of chunk and block producers.
Previously one of the clients was a chunk producer and the
other a block producer, with their roles swapped on each block.
Now the roles are assigned based on the logic in `EpochManager`.
IMO it doesn't affect the test, in the crucial moment when the
bad chunk mask is tested, the chunk producer is different from
the block producer, just like in the original test.

The test manually produces a single chunk on shard 0 to achieve the desired
chunk mask, so I couldn't just use the default way of generating chunks during
block processing. I had to implement a function that would manually produce
the required chunk.

Refs: near#10506
jancionear added a commit to jancionear/nearcore that referenced this issue Feb 7, 2024
This test didn't work with stateless validation because
it didn't propagate chunk state witnesses and endorsements as needed.

I modified the test to use `TestEnv`, which allowed me to use the
exisiting functionality for propagating all the partial chunk requests,
state witnesses and endorsements.
This changed the assignment of chunk and block producers.
Previously one of the clients was a chunk producer and the
other a block producer, with their roles swapped on each block.
Now the roles are assigned based on the logic in `EpochManager`.
IMO it doesn't affect the test, in the crucial moment when the
bad chunk mask is tested, the chunk producer is different from
the block producer, just like in the original test.

The test manually produces a single chunk on shard 0 to achieve the desired
chunk mask, so I couldn't just use the default way of generating chunks during
block processing. I had to implement a function that would manually produce
the required chunk.

Refs: near#10506
github-merge-queue bot pushed a commit that referenced this issue Feb 8, 2024
This test didn't work with stateless validation because it didn't
propagate chunk state witnesses and endorsements as needed.

I modified the test to use `TestEnv`, which allowed me to use the
exisiting functionality for propagating all the partial chunk requests,
state witnesses and endorsements.
This changed the assignment of chunk and block producers. Previously one
of the clients was a chunk producer and the other a block producer, with
their roles swapped on each block. Now the roles are assigned based on
the logic in `EpochManager`. IMO it doesn't affect the test, in the
crucial moment when the bad chunk mask is tested, the chunk producer is
different from the block producer, just like in the original test.

The test manually produces a single chunk on shard 0 to achieve the
desired chunk mask, so I couldn't just use the default way of generating
chunks during block processing. I had to implement a function that would
manually produce the required chunk.

Refs: #10506
github-merge-queue bot pushed a commit that referenced this issue Feb 13, 2024
…10603)

This test didn't work with stateless validation.
It manually creates a chunk and blocks containing this chunk, but it
didn't set the `chunk_endorsements` field in the blocks, because of
which the blocks couldn't be accepted.
Let's fix it by generating the correct chunk endoresment and setting the
field in the generated blocks.

Refs: #10506
@walnut-the-cat
Copy link
Contributor

cc. @shreyan-gupta to have another look at the current status of this task

@tayfunelmas tayfunelmas added the A-stateless-validation Area: stateless validation label Apr 1, 2024
@tayfunelmas tayfunelmas changed the title [stateless_validation] Fix tests that fail stateless validation [stateless_validation] Fix tests that fail with stateless validation enabled Apr 12, 2024
github-merge-queue bot pushed a commit that referenced this issue May 6, 2024
To make chunks management tests work for stateless validation (#10506),
I want to simplify them and fix the broken ones.

One problem with these is that main assertions are hidden inside
"peer_manager_mock" function which intercepts internal Block messages
and checks that chunks are (not) included. This is very ugly. Instead, I
create a separate loop which calls view_client GetBlock method
sequentially for tested heights. That's the main part of change -
unfortunately diff doesn't handle tabs, but you can find that checks
move from `NetworkRequests::Block { block } => { ... }` to simple `async
move { ... }`.

And this allows to fix long-failing
`chunks_recovered_from_full_timeout_too_short` tests. As pointed out on
[Zulip](https://near.zulipchat.com/#narrow/stream/299922-nearone.2Fnode.2Fprivate.2Farchive/topic/.E2.9C.94.20oncall.20week.20Nov.206.202023/near/402138497),
these tests never worked, they panicked because node test4 couldn't
state sync, and once it got handled better, test4 stopped failing.

Instead, we need to fix the assert condition. The check should be "if
some messages to test4 are dropped, then we should assert that some
chunks are missing in blocks it produces". This allows to replace
`should_panic` with proper assertion. With `block.author` field we can
look who produced block directly, instead of check `h % 4 == 3` which
was in fact incorrect.

Next steps are to handle stateless validation messages and see that
everything still works. Could be nice to move that to TestLoop and make
tests less expensive, but that's a long way. First step for that could
be extending `multi.rs` with tests with assertions on sequences of
blocks, not a single block.

Nayduck https://nayduck.nearone.org/#/run/77

---------

Co-authored-by: Longarithm <the.aleksandr.logunov@gmail.com>
@shreyan-gupta
Copy link
Contributor Author

Closing this task. We can separately track any issues related to chunk management tests and other failing tests in nayduck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-stateless-validation Area: stateless validation
Projects
None yet
Development

No branches or pull requests

6 participants