[stateless_validation] Fix tests that fail with stateless validation enabled #10506

shreyan-gupta · 2024-01-25T23:43:33Z

When trying to enable stateless validation, some of the integration tests are failing. I did some initial investigation and found all the problems are related to our testing infrastructure. It's impossible to fix all these in one go so I'm creating a task to track these.

…10507) This PR uncomments two checks - We now filter chunks based on whether they have enough stake endorsed or not. - We validate chunk endorsements in block header. Enabling this breaks a bunch of tests mainly for two reasons - Some tests are handled by mocking out parts/doing manual block production which doesn't take care of chunk endorsement signatures - A number of tests that work with multiple validators don't have the appropriate network request handlers for chunk endorsement and state witness processing. These tests are impossible to explore and fix in a single PR and I've created a task list to help with this here: #10506 Note that we need to merge in the changes from #10501 , #10503 and #10504 before we can merge this. Till then probably the tests would keep failing Tip: Probably a good idea to review PR commit by commit

This test didn't work with stateless validation. Let's fix it by properly handling `ChunkStateWitness` and `ChunkEndorsement` messages. Refs: #10506

Longarithm · 2024-01-31T12:43:47Z

We should also look into nayduck, which indicates ~9 more failing tests since enabling endorsements https://nayduck.near.org/#/

It looks like this test now works with stateless validation, so let's enable it again. I didn't need to make any changes to the test, maybe something was fixed in the meantime. Refs: #10506

…es writing integration tests easier (#10559) State witnesses are processed asynchronously - a task is spawned on a separate thread, which validates the state witness and then sends out chunk endorsements. Currently there's no way to directly wait for this processing to finish, one can only wait for the chunk endorsements to appear. This makes writing integration tests difficult. Let's add a way to wait for the processing to finish, which will make writing the tests easier. Details: There's no network stack in integration tests, all messages are passed around manually. We first propagate all state witness messages, which starts their processing in the background, but then we have to wait for this processing to finish before propagating chunk endorsements. As there's no way to wait for it, we have to do tricky things. The current approach is to define the set of expected chunk endorsements that should appear, and then wait for the right endorsement messages to be sent out. This approach has some limitations - it's often hard to figure out which endorsements should be sent out exactly, as the tests create various strange scenarios - missing chunks, blocks, forks, etc... It requires a lot of manual work to fully understand the test and prepare the right set. It's also quite fragile, any change in chunk endorsements (e.g not sending them to ourselves) will break all the tests. Adding the ability to wait for witness processing to finish greatly simplifies things. We can just pass around witnesses, wait for them to finish, and then propagate the generated endorsements. Without defining any expected sets to wait for. As an example of the benefits offered by this solution, compare the fix in #10537 to the one added in this PR. The one in #10537 is much more complex and fragile. I saw that there is also a thing called `ChunkEndorsementTracker`, which tracks what chunk endorsements were sent out, but IMO it's a different thing from `ProcessingDoneTracker`. We were already able to track which endorsements are sent out in the tests, but that didn't make writing tests easier. Additionally, in some tests we send out invalid chunks, which wouldn't produce any endorsements at all, so we couldn't wait for them. Refs: #10506, #10537, [zulip conversation](https://near.zulipchat.com/#narrow/stream/407237-pagoda.2Fcore.2Fstateless-validation/topic/A.20way.20to.20wait.20for.20state.20witness.20processing.20to.20finish/near/419062119)

@Longarithm

…10571) ## Description The tests in `adversarial_behaviors.rs` didn't work with stateless validation, let's make them pass again. To achieve this we first need to propagate chunk state witnesses and endorsements, and then adjust the expected behavior to match the one exhibited by stateless validation. The most complex part of this PR is modifying which blocks should be skipped in `test_banning_chunk_producer_when_seeing_invalid_chunk_base()`. This function creates a blockchain with one malicious chunk producer and runs it for a few epochs. Which blocks will be skipped depends on whether we are using stateless validation or not. ### Without stateless validation In the old version of the protocol the invalid chunk is included in a block and then the whole block is validated. The chunk is invalid, so the whole block is also invalid and the block should be skipped. After that the malicious chunk producer is banned for the rest of the epoch, which means that block producers will reject chunks sent by this chunk producer, so the subsequent blocks won't contain any invalid chunks and won't be skipped. Only the first block with invalid chunks in an epoch should be skipped. ### With stateless validation With stateless validation the situation is entirely different. When the malicious chunk producer produces an invalid chunk, chunk validators will refuse to send out endorsements for this chunk, so the chunk won't be included in any block. This means that no blocks should be skipped. There is one exception - in the test the block produced at height 2 is invalid and gets skipped. I suspect that this is because the first few blocks after genesis don't use stateless validation, and they are still handled by old protocol. I'm not 100% sure if that's the case, @Longarithm could you confirm that this explanation makes sense? Something is definitely different for the first few blocks, in the logs I can see `client: Banning chunk producer for producing invalid` at height 2, which seems to belong to the old protocol. In the new version malicious chunk producers are banned using `BanPeer`, without any log messages. Refs: #10506, [zulip discussion](https://near.zulipchat.com/#narrow/stream/295558-pagoda.2Fcore/topic/Skipping.20the.20first.20block.20with.20missing.20chunks.20in.20an.20epoch.3F/near/419838610)

This test didn't work with stateless validation because it didn't propagate chunk state witnesses and endorsements as needed. I modified the test to use `TestEnv`, which allowed me to use the exisiting functionality for propagating all the partial chunk requests, state witnesses and endorsements. This changed the assignment of chunk and block producers. Previously one of the clients was a chunk producer and the other a block producer, with their roles swapped on each block. Now the roles are assigned based on the logic in `EpochManager`. IMO it doesn't affect the test, in the crucial moment when the bad chunk mask is tested, the chunk validator is different from the chunk producer, just like in the original test. The test manually produces a single chunk on shard 0, so I couldn't just use the default way of generating chunks during block processing. I had to implement a function that would manually produce the required chunk. Refs: near#10506

This test didn't work with stateless validation because it didn't propagate chunk state witnesses and endorsements as needed. I modified the test to use `TestEnv`, which allowed me to use the exisiting functionality for propagating all the partial chunk requests, state witnesses and endorsements. This changed the assignment of chunk and block producers. Previously one of the clients was a chunk producer and the other a block producer, with their roles swapped on each block. Now the roles are assigned based on the logic in `EpochManager`. IMO it doesn't affect the test, in the crucial moment when the bad chunk mask is tested, the chunk producer is different from the block producer, just like in the original test. The test manually produces a single chunk on shard 0, so I couldn't just use the default way of generating chunks during block processing. I had to implement a function that would manually produce the required chunk. Refs: near#10506

This test didn't work with stateless validation because it didn't propagate chunk state witnesses and endorsements as needed. I modified the test to use `TestEnv`, which allowed me to use the exisiting functionality for propagating all the partial chunk requests, state witnesses and endorsements. This changed the assignment of chunk and block producers. Previously one of the clients was a chunk producer and the other a block producer, with their roles swapped on each block. Now the roles are assigned based on the logic in `EpochManager`. IMO it doesn't affect the test, in the crucial moment when the bad chunk mask is tested, the chunk producer is different from the block producer, just like in the original test. The test manually produces a single chunk on shard 0 to achieve the desired chunk mask, so I couldn't just use the default way of generating chunks during block processing. I had to implement a function that would manually produce the required chunk. Refs: near#10506

This test didn't work with stateless validation because it didn't propagate chunk state witnesses and endorsements as needed. I modified the test to use `TestEnv`, which allowed me to use the exisiting functionality for propagating all the partial chunk requests, state witnesses and endorsements. This changed the assignment of chunk and block producers. Previously one of the clients was a chunk producer and the other a block producer, with their roles swapped on each block. Now the roles are assigned based on the logic in `EpochManager`. IMO it doesn't affect the test, in the crucial moment when the bad chunk mask is tested, the chunk producer is different from the block producer, just like in the original test. The test manually produces a single chunk on shard 0 to achieve the desired chunk mask, so I couldn't just use the default way of generating chunks during block processing. I had to implement a function that would manually produce the required chunk. Refs: #10506

…10603) This test didn't work with stateless validation. It manually creates a chunk and blocks containing this chunk, but it didn't set the `chunk_endorsements` field in the blocks, because of which the blocks couldn't be accepted. Let's fix it by generating the correct chunk endoresment and setting the field in the generated blocks. Refs: #10506

walnut-the-cat · 2024-03-14T16:28:43Z

cc. @shreyan-gupta to have another look at the current status of this task

To make chunks management tests work for stateless validation (#10506), I want to simplify them and fix the broken ones. One problem with these is that main assertions are hidden inside "peer_manager_mock" function which intercepts internal Block messages and checks that chunks are (not) included. This is very ugly. Instead, I create a separate loop which calls view_client GetBlock method sequentially for tested heights. That's the main part of change - unfortunately diff doesn't handle tabs, but you can find that checks move from `NetworkRequests::Block { block } => { ... }` to simple `async move { ... }`. And this allows to fix long-failing `chunks_recovered_from_full_timeout_too_short` tests. As pointed out on [Zulip](https://near.zulipchat.com/#narrow/stream/299922-nearone.2Fnode.2Fprivate.2Farchive/topic/.E2.9C.94.20oncall.20week.20Nov.206.202023/near/402138497), these tests never worked, they panicked because node test4 couldn't state sync, and once it got handled better, test4 stopped failing. Instead, we need to fix the assert condition. The check should be "if some messages to test4 are dropped, then we should assert that some chunks are missing in blocks it produces". This allows to replace `should_panic` with proper assertion. With `block.author` field we can look who produced block directly, instead of check `h % 4 == 3` which was in fact incorrect. Next steps are to handle stateless validation messages and see that everything still works. Could be nice to move that to TestLoop and make tests less expensive, but that's a long way. First step for that could be extending `multi.rs` with tests with assertions on sequences of blocks, not a single block. Nayduck https://nayduck.nearone.org/#/run/77 --------- Co-authored-by: Longarithm <the.aleksandr.logunov@gmail.com>

shreyan-gupta · 2024-06-13T21:01:34Z

Closing this task. We can separately track any issues related to chunk management tests and other failing tests in nayduck.

shreyan-gupta mentioned this issue Jan 26, 2024

[stateless_validation] Enable chunk endorsements in block validation #10507

Merged

wacban mentioned this issue Jan 26, 2024

[ProjectTracking]: Stateless validation MVP near/near-one-project-tracking#5

Closed

23 tasks

jancionear self-assigned this Jan 29, 2024

jancionear mentioned this issue Jan 30, 2024

fix: test_chunk_forwarding_optimization #10527

Merged

github-merge-queue bot pushed a commit that referenced this issue Jan 31, 2024

fix: test_chunk_forwarding_optimization (#10527)

c11ad4a

This test didn't work with stateless validation. Let's fix it by properly handling `ChunkStateWitness` and `ChunkEndorsement` messages. Refs: #10506

This was referenced Jan 31, 2024

fix: make test_in_memory_trie_node_consistency work with stateless validation #10537

Closed

fix: enable test_function_limit_change #10538

Merged

github-actions bot mentioned this issue Feb 1, 2024

Monthly issue metrics report utnet-org/utility-readonly#7

Closed

This was referenced Feb 2, 2024

Add a way to wait until a state witness is fully processed, which makes writing integration tests easier #10559

Merged

Enable some integration tests that were disabled because of stateless validation #10562

Closed

Longarithm mentioned this issue Feb 2, 2024

[ProjectTracking]: Stateless validation Mainnet Release near/near-one-project-tracking#46

Open

52 tasks

jancionear mentioned this issue Feb 6, 2024

fix: make adversarial_behaviors tests pass with stateless validation #10571

Merged

jancionear mentioned this issue Feb 7, 2024

fix: make test_bad_chunk_mask work with stateless validation #10583

Merged

This was referenced Feb 12, 2024

[stateless_validation] test_chunk_forwarding_optimization fails with --features statelessnet_protocol #10600

Closed

fix: make test_validate_chunk_extra work with stateless validation #10603

Merged

tayfunelmas added the A-stateless-validation Area: stateless validation label Apr 1, 2024

tayfunelmas changed the title ~~[stateless_validation] Fix tests that fail stateless validation~~ [stateless_validation] Fix tests that fail with stateless validation enabled Apr 12, 2024

Longarithm mentioned this issue May 3, 2024

fix: stateful chunk management tests #11224

Merged

walnut-the-cat assigned pugachAG and unassigned jancionear May 28, 2024

Longarithm mentioned this issue May 30, 2024

Re-enable chunk management tests on nightly #11422

Closed

shreyan-gupta closed this as completed Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stateless_validation] Fix tests that fail with stateless validation enabled #10506

[stateless_validation] Fix tests that fail with stateless validation enabled #10506

shreyan-gupta commented Jan 25, 2024 •

edited by jancionear

Loading

Tasks

Longarithm commented Jan 31, 2024

walnut-the-cat commented Mar 14, 2024

shreyan-gupta commented Jun 13, 2024

[stateless_validation] Fix tests that fail with stateless validation enabled #10506

[stateless_validation] Fix tests that fail with stateless validation enabled #10506

Comments

shreyan-gupta commented Jan 25, 2024 • edited by jancionear Loading

Tasks

Longarithm commented Jan 31, 2024

walnut-the-cat commented Mar 14, 2024

shreyan-gupta commented Jun 13, 2024

shreyan-gupta commented Jan 25, 2024 •

edited by jancionear

Loading