fix: Third wave of fixing nightly tests #2207

SkidanovAlex · 2020-02-28T23:07:27Z

Many timeout tweaks, and small typos.
cross_shard_tx_with_validator_rotation with 150 block time has
extremely long forks (doomslug is disabled), and takes lots of time per
iteration. I split it into two: one with 150ms block time, but only 16
iterations, and one with 400ms block time, but all 64 iterations. Both
locally take ~45 minutes, will see how long it takes on gcloud.

Many timeout tweaks, and small typos. `cross_shard_tx_with_validator_rotation` with 150 block time has extremely long forks (doomslug is disabled), and takes lots of time per iteration. I split it into two: one with 150ms block time, but only 16 iterations, and one with 400ms block time, but all 64 iterations. Both locally take ~45 minutes, will see how long it takes on gcloud.

@Kouprin

* Enable floats but prohibit some CPU architectures (#1941) (#2079) * Enable floats but prohibit some CPU architectures * Merge branch 'staging' into enable_floats * Merge refs/heads/staging into enable_floats * Avoid overflowing u32 during contract preparation (#1946) * Update runtime/near-vm-runner/src/runner.rs Co-Authored-By: Evgeny Kuzyakov <ek@nearprotocol.com> * Merge branch 'staging' into enable_floats * Nit Co-authored-by: Maksym Zavershynskyi <35039879+nearmax@users.noreply.github.com> * Move sysinfo fix from staging (#2085) * Move sysinfo fix from staging * Introduce keccak256 and keccak512 native support (#2072) * Introduce keccak256 and keccak512 native support * modify genesis Co-authored-by: Bowen Wang <bowenwang1996@users.noreply.github.com> * Create code-of-conduct.md (#1932) * Create code-of-conduct.md * Move code of conduct to proper name Co-authored-by: Illia Polosukhin <illia@nearprotocol.com> * Ref #2067: querying contract state (#2075) * Fix #2067: querying contract state * Change format to {key, value, proof} for state view. Proofs are empty for now * Change state dump to binary format (#2086) Instead of dumping state in a json file which is slow, this PR changes it to dumping state in a binary format. Fixes #2070. Test plan --------- Manually test this with a local node to make sure that the state is preserved correctly after state dumping. * Add json state dump for debugging (#2120) * Add json state dump for debugging * state_dump.json * fix * docs: add links to Issues/Milestones (#2144) * fix(runtime): Fix empty method_names bug (#2188) Fixes: #2183 # Test plan: - Extracted the code into utils crate and added unit tests. * State dump split (#2126) - Revert binary state dump - #1922 for separating config and records - scripts/new-genesis-from-existing-state.sh that dump state, calculate new genesis hash, upload to s3 - testnet genesis records separate from near binary - download testnet genesis from s3 in python start_testnet - check genesis hash when run testnet Co-authored-by: Evgeny Kuzyakov <h3r0k1ll3r@gmail.com> Co-authored-by: Bo Yao <bo@nearprotocol.com> Test Plans ------------- - it can `near init --chain-id=testnet --genesis-config near/res/testnet_genesis_config.json --genesis-records <records download from s3> --genesis-hash <expected-hash>` to initialize testnet config from external genesis records - When `near run`, with incorrect `~/.near/genesis_hash` it will panic - It can start testnet with updated start_testnet.py, which download genesis_records from s3 - After stop a node, we can call `state-viewer dump_genesis` to dump genesis_records, config, and genesis_hash. - If there was no state change happen since genesis block, load the dump genesis_records/config will give you a same genesis_hash, and dump again generate same genesis_records, config & genesis_hash. * Avoid panic while panicking because of actix isn't running (#2178) Some of our tests don't have actix running. @Kouprin found when assert in such tests failed it's not able to see which test fails, only `thread panicked while panicking`. So check actix is running and only shutdown it if so fix this. Test Plan --------- ``` #[test] fn test_assert() { init_stop_on_panic(); assert(false); } ``` should give assert fails instead of `thread panicked while panicking` * Fix epoch manager bug (#2193) * feat(runtime): Validate incoming receipts #2155 If an invalid chunk made it into block the invalid incoming receipts might be forwarded to different shards. Ideally the chunk should be challenged and the block is reverted, but before this happens we need to handle invalid receipts in the Runtime. Another possibility is to create invalid receipts in the state directly and then create a challenge on this invalid state. So any field in the state can potentially contain invalid value. So if an invalid receipt is present in the delayed receipts state, we consider it a StorageError. Fixes: #1850 # Test plan - Added 2 tests to cover handling of invalid receipts in the Runtime - Filed an issue to handle error in NightshadeRuntime #2152 * fix: Use stderr for logs generated with tracing. (#2198) * fix: Fixing nightly python tests (#2201) * `block_production.py` started failing because the blocks are now produced faster, and by the first time it checked the heights the heights exceeded 2, causing poorly stated assert to trigger * `staking*.py` were never adjusted for yocto-near * `state_sync*.py` both header sync and state sync were broken Header sync: we were cleaning up mapping from heights to headers during GC, which is necessary for header sync. State sync: we changed the state sync to always happen at the epoch boundary, but did it in a wrong way: the hash we stored locally was pre-pushing to the epoch boundary, and thus the node was rejecting any incoming state responces * `state_sync1.py` was also failing because it was relying on one out of two validating nodes being able to produce blocks, which it cannot, because of Doomslug * `skip_epoch.py` was expecting one validator to be producing blocks, which with doomslug requires such validator to have more than half the stake. Also was not updated for yocto-near * `one_val.py` similarly was expecting one validator to be producing blocks, and thus needed that validator stake bumped to worh with DS * `lightclnt.py` with doomslug blocks are produced faster, and by the time the test started a few blocks were already produced. Making the test expect such behavior * `block_sync.py` was failing because Doomslug requires `max_block_production_delay` to be at least 2x `min_block_production_delay`. Also disabling `network_stress`, because the current runner crashes trying to launch it and skips all the consecutive tests Test Plan --------- All the above python tests pass. * fix: Fix pytest after log to stderr (#2203) Similar to #1985 but now we changed back to log to stderr Test Plans ------------- Run pytest locally, now it's no longer have `AssertionError: node dirs: 0 num_nodes: 4 num_observers: 0` error * fix: fixing cross_shard_tx* tests and small fixes for some other nightly failures (#2204) There were several issues in the test infra: 1. The peer info in the client test infra was the largest height and the largest score ever observed. If a block with a higher score but lower height than the previous tip was created, it would report incorrect peer info, and peers would attempt header sync believing the peer has higher header head height, and such header sync would fail. 2. The tests tamper with the FG, and the last final block could be way more than 5 epochs in the past. That makes creating light client blocks potentially require blocks from 5 epoch lengths ago. I'm just making all nodes in cross_shard_tx archival. In practice if one epoch has been lasting for five epoch lengths, we have bigger problems. 3. We historically see cross_shard_tx tests fail with `InvalidBlockHeight` error when a block is more than epoch length ahead of the previous block. Since that check is a heuristic anyway, I'm doubling the distance, to reduce the flakiness of the test. Separately, increasing the timeouts for NFG tests, they take more than 15 minutes. Also bumping timeouts for the `test_all_chunks_accepted_1000*` tests, it's clear that they need at least 2000 / 4000 / 1000 seconds to complete, I set the timeouts to 3600 / 7200 / 1800 for some extra room. Also the one that requires 7200 (`*_slow`) seems to provide no value compared to the base test, and is the slowest test in our entire suite, so I completely disable it. Separately, fixing the issue with state sync tests, where the transition to state sync happens before the log tracker is initialized, and the check for the transition in the log later fails Slightly bumping block production time in `test_catchup_sanity_blocks_produced`, it works on local machine, but on the gcloud runner doesn't keep up. Test plan --------- All cross_shard_tx* tests passed at least three runs. If they are flaky, nightly will catch that. * feat(chain): Exposed genesis config + runtime config and genesis records via RPC Resolves #2007 and #2025 We needed to make Near config query-able through node RPC. Specifically, one of our clients wanted to know how many blocks remain until an account is going to be evicted due to rent. This information can be derived from the account balance and the config. In this commit, two new RPC endpoints are exposed: EXPERIMENTAL_genesis_config and EXPERIMENTAL_genesis_records. Learn more in PR #2109. # Test plan Added tests to query the endpoints with a happy path and also invalid parameters. * fix: Third wave of fixing nightly tests (#2207) Many timeout tweaks, and small typos. `cross_shard_tx_with_validator_rotation` with 150 block time has extremely long forks (doomslug is disabled), and takes lots of time per iteration. I split it into two: one with 150ms block time, but only 16 iterations, and one with 400ms block time, but all 64 iterations. Both locally take ~45 minutes, will see how long it takes on gcloud. * fix: Fixing old rust multi-node tests (#2210) Fixing the following issues: 1. `ThreadNode` was not properly setting its state on kill 2. `ThreadNode` doesn't properly free up its port, but instead of figuring out why, I just replaced it with `ProcessNode` in the tests that are affected 3. `test_4_20_kill1` wasn't accounting for fees. Disabling fees. 4. In the same test, the same chunk producer was always mapped to the same block producer, and thus killing the second node was making the 0-th chunk producer (who happens to be attached to the 2nd BP) to not be able to have their transactions included. Address it by having 17 seats. Also split the test in two, with one shard and with two shards 5. In multiple places we were using the wrong node to get the access key Test plan --------- Locally `test_4_20_kill1` and `test_*_multiple_nodes` pass, let's see how the next nightly looks * fix: Disabling some tests + small fixes (#2211) Disabling old tests that fail due to the runtime cache unti they are fixed; Changing the `cross_shard_tx_with_validator_rotation` slightly based on its performance on gcloud. (increasing the block prod time speeds up test, because it results in fewer forks, so the 150->200 change is to make more iterations fit. With 150ms it fits ~6 iterations into one hour) * Buildkite CI (#2217) The actual pipeline definition is saved on buildkite ui, this way, we got it shared between stable, beta and master. There'll be some master only builds (nightly release) Test Plans -------------- Buildkite CI should pass * fix(doctest): Refactor comment. (#2219) Remove doctest that was running incorrectly. * fix: Fixing stress.py (#2220) The test was not updated after some config parameters were renamed. Also because of #2195, tx status of lost transactions times out, so adding a workaround into the test for now Disabling the version of stress that messes up with network, because the nightly runner is currently not configured to support the utility I use to stop network between processes Couple other changes: 1. Made `node_restart` worker not restart the node if no blocks were produced in the meantime. It is needed because with only two nodes after a long restart doomslug can take a while to recover (it is equivalent to half the network shutting down and restarting after some delay). Block production worker will fail the test if the block production actually stalls 2. For the same reason increased the tolerated delays for block production 3. Limited how many txs are sent per iteration of tx worker, since due to #2195 it takes one second to query one transaction, and if the test finished in the middle of querying, the allowed one minute for workers to stop is not sufficient for the tx worker. 4. Also generally increasing the allowance from 1m to 2m at the end, since the tx worker at the end of the test might take some time before it even starts querying the transaction outcomes * Try setting up GitPod (#2190) * Try setting up GitPod * Try pre-building nearcore in Docker image * Remove rustup command (nightly should already be available) * Update location for nearcore prebuild * Fix the way cargo is executed in Dockerfile * Do cargo test as well when building docker image * fix(epoch manager): Fix fishermen unstake (#2212) * Fix fishermen unstake * fix kickout set * add test * fix coverage, badge, docker release on in master branch (#2218) Equivalent coverage and docker image release from gitlab ci. Except release is a docker image release and will also add s3 release in future, since gcloud storage requires google login, which is inconvenient for used in scripts Test Plans -------------- coverage only in master, beta, stable branch and docker release only in master branch (For test purpose also this cov-release branch but will be removed). badge updated. docker release in beta/stable branch will be added but not ready now as tests in beta/stable is more strict * feat(adversary): adv_disable_doomslug + tests update * fix(epoch manager): Fix validator kickout set (#2214) * Fix validator kickout set * fix stake change * fix(epoch_manager): Some fixes and comments (#2222) Some fixes, refactoring and comments. Test plan --------- Run existing tests Add more asserts * fix: Increasing tolerance in stress.py, and fixing nightly.txt (#2228) Further investigation of `stress.py` failures in nightly shows that all the workers appear to work as expected, but frequent restarts cause block production to be delayed more than the current tolerance. Locally spradically also more than 50% of transactions get lost if the node get restarted too frequently. Increasing both tolerances. Also making the prints unbuffered, otherwise several workers in the nightly runs don't flush their outputs. Finally, fixing a typo in the nightly.txt Test plan --------- Locally `stress.py` doesn't fail, so there's no easy way to test whether the new tolerances would be sufficient. It also doesn't completely fix all the known issues, there are some other failures in stress.py that I haven't gotten to yet * Add doc test run in CI (#2230) Add doc test runs in CI Test Plan ------------ Should see doc test log at begining of other tests * fix(runtime): Fix state change check during view call (#2229) Fixes #2226 The current behavior of a view call is to prohibit state changes. This change moves check from the end of the state viewer to VMLogic by prohibiting state changes functions during view calls. ## Test plan: - Fixed state change tests. Previously the test didn't verify error type. The error was `MethodNotFound` and also alice account was wrong. - Added unit tests for the new prohibited methods. * Bump Borsh and runtime version (#2232) * Bump Borsh version * Nit * Bump Borsh versions * Bump borsh more * Add binary release script (#2235) scripts for binary release Test Plan ------------ Download uploaded binary in a few popular linux vm and see if they works * Skip phantom in genesis serialization (#2238) * Improve 100 node test to not reserve static ip and firewall rule (#2241) Two people run 100 node together is often blocked by firewall rules limit and static ip address limit (The recent fail by @mfornet is this case). Made they do not reserve ip address and use organization global firewall rules would fix this. Test Plans -------------- The same setting has been proved to work in create devnet nodes. * feat(chain): Improved `changes` RPC (#2148) Resolves: #2034 and #2048 The changes RPC API was broken implementation-wise (#2048), and design-wise (#2034). This version has meaningful API design for all the exposed data, and it is also tested better. This PR is massive since, initially, we missed quite a point of exposing deserialized internal data (like account info, access key used to be returned as a Borsh-serialized blob, which is useless for the API user as they don't have easy access to the schema of those structures). ## Test plan I did not succeed in writing Rust tests as we used a mocked runtime there. Thus, I had extended end-to-end tests (pytest) with: * Test for account changes on account creation * Test for access key changes on account creation and access key removal * Test for code changes * Test for several transactions on the same block/chunk Co-authored-by: Anton Bukov <k06aaa@gmail.com> Co-authored-by: Maksym Zavershynskyi <35039879+nearmax@users.noreply.github.com> Co-authored-by: Bowen Wang <bowenwang1996@users.noreply.github.com> Co-authored-by: AnaisUrlichs <33576047+AnaisUrlichs@users.noreply.github.com> Co-authored-by: Illia Polosukhin <illia@nearprotocol.com> Co-authored-by: Illia Polosukhin <ilblackdragon@gmail.com> Co-authored-by: Lazaridis <59408072+lazaridiscom@users.noreply.github.com> Co-authored-by: Evgeny Kuzyakov <ek@nearprotocol.com> Co-authored-by: mikhailOK <mikhail.kever@gmail.com> Co-authored-by: Marcelo Fornet <mfornet94@gmail.com> Co-authored-by: Alexander Skidanov <skidanov.alexander@gmail.com> Co-authored-by: Bowen Wang <bowenwang1996@uchicago.edu> Co-authored-by: Vlad Frolov <frolvlad@gmail.com> Co-authored-by: Vladimir Grichina <vgrichina@gmail.com> Co-authored-by: nearprotocol-bulldozer[bot] <56702484+nearprotocol-bulldozer[bot]@users.noreply.github.com> Co-authored-by: Alex Kouprin <kpr@nearprotocol.com>

SkidanovAlex requested a review from bowenwang1996 as a code owner February 28, 2020 23:07

bowenwang1996 approved these changes Feb 28, 2020

View reviewed changes

SkidanovAlex merged commit 2e91de1 into master Feb 29, 2020

bowenwang1996 deleted the fix_nightly3 branch February 29, 2020 07:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Third wave of fixing nightly tests #2207

fix: Third wave of fixing nightly tests #2207

SkidanovAlex commented Feb 28, 2020

fix: Third wave of fixing nightly tests #2207

fix: Third wave of fixing nightly tests #2207

Conversation

SkidanovAlex commented Feb 28, 2020