Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Third wave of fixing nightly tests #2207

Merged
merged 1 commit into from Feb 29, 2020
Merged

fix: Third wave of fixing nightly tests #2207

merged 1 commit into from Feb 29, 2020

Conversation

SkidanovAlex
Copy link
Collaborator

Many timeout tweaks, and small typos.
cross_shard_tx_with_validator_rotation with 150 block time has
extremely long forks (doomslug is disabled), and takes lots of time per
iteration. I split it into two: one with 150ms block time, but only 16
iterations, and one with 400ms block time, but all 64 iterations. Both
locally take ~45 minutes, will see how long it takes on gcloud.

Many timeout tweaks, and small typos.
`cross_shard_tx_with_validator_rotation` with 150 block time has
extremely long forks (doomslug is disabled), and takes lots of time per
iteration. I split it into two: one with 150ms block time, but only 16
iterations, and one with 400ms block time, but all 64 iterations. Both
locally take ~45 minutes, will see how long it takes on gcloud.
@SkidanovAlex SkidanovAlex merged commit 2e91de1 into master Feb 29, 2020
@bowenwang1996 bowenwang1996 deleted the fix_nightly3 branch February 29, 2020 07:56
ailisp added a commit that referenced this pull request Mar 11, 2020
* Enable floats but prohibit some CPU architectures (#1941) (#2079)

* Enable floats but prohibit some CPU architectures
* Merge branch 'staging' into enable_floats
* Merge refs/heads/staging into enable_floats
* Avoid overflowing u32 during contract preparation (#1946)
* Update runtime/near-vm-runner/src/runner.rs

Co-Authored-By: Evgeny Kuzyakov <ek@nearprotocol.com>
* Merge branch 'staging' into enable_floats
* Nit

Co-authored-by: Maksym Zavershynskyi <35039879+nearmax@users.noreply.github.com>

* Move sysinfo fix from staging (#2085)

* Move sysinfo fix from staging

* Introduce keccak256 and keccak512 native support (#2072)

* Introduce keccak256 and keccak512 native support

* modify genesis

Co-authored-by: Bowen Wang <bowenwang1996@users.noreply.github.com>

* Create code-of-conduct.md (#1932)

* Create code-of-conduct.md

* Move code of conduct to proper name

Co-authored-by: Illia Polosukhin <illia@nearprotocol.com>

* Ref #2067: querying contract state (#2075)

* Fix #2067: querying contract state

* Change format to {key, value, proof} for state view. Proofs are empty for now

* Change state dump to binary format (#2086)

Instead of dumping state in a json file which is slow, this PR changes it to dumping state in a binary format. Fixes #2070.

Test plan
---------
Manually test this with a local node to make sure that the state is preserved correctly after state dumping.

* Add json state dump for debugging (#2120)

* Add json state dump for debugging

* state_dump.json

* fix

* docs: add links to Issues/Milestones (#2144)

* fix(runtime): Fix empty method_names bug (#2188)

Fixes: #2183

# Test plan:
- Extracted the code into utils crate and added unit tests.

* State dump split (#2126)

- Revert binary state dump
- #1922 for separating config and records
- scripts/new-genesis-from-existing-state.sh that dump state, calculate new genesis hash, upload to s3
- testnet genesis records separate from near binary
- download testnet genesis from s3 in python start_testnet
- check genesis hash when run testnet
 
Co-authored-by: Evgeny Kuzyakov <h3r0k1ll3r@gmail.com>
Co-authored-by: Bo Yao <bo@nearprotocol.com>

Test Plans
-------------
- it can `near init --chain-id=testnet --genesis-config near/res/testnet_genesis_config.json --genesis-records <records download from s3> --genesis-hash <expected-hash>` to initialize testnet config from external genesis records
- When `near run`, with incorrect `~/.near/genesis_hash` it will panic
- It can start testnet with updated start_testnet.py, which download genesis_records from s3
- After stop a node, we can call `state-viewer dump_genesis` to dump genesis_records, config, and genesis_hash.
- If there was no state change happen since genesis block, load the dump genesis_records/config will give you a same genesis_hash, and dump again generate same genesis_records, config & genesis_hash.

* Avoid panic while panicking because of actix isn't running (#2178)

Some of our tests don't have actix running. @Kouprin found when assert in such tests failed it's not able to see which test fails, only `thread panicked while panicking`. So check actix is running and only shutdown it if so fix this.

Test Plan
---------
```
#[test] 
fn test_assert() {
init_stop_on_panic();
assert(false);
}
```
should give assert fails instead of `thread panicked while panicking`

* Fix epoch manager bug (#2193)

* feat(runtime): Validate incoming receipts #2155

If an invalid chunk made it into block the invalid incoming receipts might be forwarded to different shards. Ideally the chunk should be challenged and the block is reverted, but before this happens we need to handle invalid receipts in the Runtime.

Another possibility is to create invalid receipts in the state directly and then create a challenge on this invalid state. So any field in the state can potentially contain invalid value. So if an invalid receipt is present in the delayed receipts state, we consider it a StorageError.  

Fixes: #1850

# Test plan

- Added 2 tests to cover handling of invalid receipts in the Runtime
- Filed an issue to handle error in NightshadeRuntime #2152

* fix: Use stderr for logs generated with tracing. (#2198)

* fix: Fixing nightly python tests (#2201)

* `block_production.py` started failing because the blocks are now
produced faster, and by the first time it checked the heights the
heights exceeded 2, causing poorly stated assert to trigger
* `staking*.py` were never adjusted for yocto-near
* `state_sync*.py` both header sync and state sync were broken
Header sync: we were cleaning up mapping from heights to headers during
GC, which is necessary for header sync.
State sync: we changed the state sync to always happen at the epoch
boundary, but did it in a wrong way: the hash we stored locally was
pre-pushing to the epoch boundary, and thus the node was rejecting any
incoming state responces
* `state_sync1.py` was also failing because it was relying on one out of
two validating nodes being able to produce blocks, which it cannot,
because of Doomslug
* `skip_epoch.py` was expecting one validator to be producing blocks,
which with doomslug requires such validator to have more than half the
stake. Also was not updated for yocto-near
* `one_val.py` similarly was expecting one validator to be producing
blocks, and thus needed that validator stake bumped to worh with DS
* `lightclnt.py` with doomslug blocks are produced faster, and by the
time the test started a few blocks were already produced. Making the
test expect such behavior
* `block_sync.py` was failing because Doomslug requires
`max_block_production_delay` to be at least 2x
`min_block_production_delay`.

Also disabling `network_stress`, because the current runner crashes
trying to launch it and skips all the consecutive tests

Test Plan
---------
All the above python tests pass.

* fix: Fix pytest after log to stderr (#2203)

Similar to #1985 but now we changed back to log to stderr

Test Plans
-------------
Run pytest locally, now it's no longer have `AssertionError: node dirs: 0 num_nodes: 4 num_observers: 0` error

* fix: fixing cross_shard_tx* tests and small fixes for some other nightly failures (#2204)

There were several issues in the test infra:
1. The peer info in the client test infra was the largest height and the
largest score ever observed. If a block with a higher score but lower
height than the previous tip was created, it would report incorrect peer
info, and peers would attempt header sync believing the peer has higher
header head height, and such header sync would fail.
2. The tests tamper with the FG, and the last final block could be way
more than 5 epochs in the past. That makes creating light client blocks
potentially require blocks from 5 epoch lengths ago. I'm just making all
nodes in cross_shard_tx archival. In practice if one epoch has been
lasting for five epoch lengths, we have bigger problems.
3. We historically see cross_shard_tx tests fail with
`InvalidBlockHeight` error when a block is more than epoch length ahead
of the previous block. Since that check is a heuristic anyway, I'm
doubling the distance, to reduce the flakiness of the test.

Separately, increasing the timeouts for NFG tests, they take more than
15 minutes.

Also bumping timeouts for the `test_all_chunks_accepted_1000*` tests,
it's clear that they need at least 2000 / 4000 / 1000 seconds to
complete, I set the timeouts to 3600 / 7200 / 1800 for some extra room.
Also the one that requires 7200 (`*_slow`) seems to provide no value
compared to the base test, and is the slowest test in our entire suite,
so I completely disable it.

Separately, fixing the issue with state sync tests, where the transition
to state sync happens before the log tracker is initialized, and the
check for the transition in the log later fails

Slightly bumping block production time in
`test_catchup_sanity_blocks_produced`, it works on local machine, but on
the gcloud runner doesn't keep up.

Test plan
---------
All cross_shard_tx* tests passed at least three runs.
If they are flaky, nightly will catch that.

* feat(chain): Exposed genesis config + runtime config and genesis records via RPC

Resolves #2007 and #2025

We needed to make Near config query-able through node RPC. Specifically,
one of our clients wanted to know how many blocks remain until an
account is going to be evicted due to rent. This information can be
derived from the account balance and the config.

In this commit, two new RPC endpoints are exposed:
EXPERIMENTAL_genesis_config and EXPERIMENTAL_genesis_records. Learn more
in PR #2109.

# Test plan

Added tests to query the endpoints with a happy path and also invalid
parameters.

* fix: Third wave of fixing nightly tests (#2207)

Many timeout tweaks, and small typos.
`cross_shard_tx_with_validator_rotation` with 150 block time has
extremely long forks (doomslug is disabled), and takes lots of time per
iteration. I split it into two: one with 150ms block time, but only 16
iterations, and one with 400ms block time, but all 64 iterations. Both
locally take ~45 minutes, will see how long it takes on gcloud.

* fix: Fixing old rust multi-node tests (#2210)

Fixing the following issues:
1. `ThreadNode` was not properly setting its state on kill
2. `ThreadNode` doesn't properly free up its port, but instead of
figuring out why, I just replaced it with `ProcessNode` in the tests
that are affected
3. `test_4_20_kill1` wasn't accounting for fees. Disabling fees.
4. In the same test, the same chunk producer was always mapped to the
same block producer, and thus killing the second node was making the
0-th chunk producer (who happens to be attached to the 2nd BP) to not be
able to have their transactions included. Address it by having 17 seats.
Also split the test in two, with one shard and with two shards
5. In multiple places we were using the wrong node to get the access key

Test plan
---------
Locally `test_4_20_kill1` and `test_*_multiple_nodes` pass, let's see
how the next nightly looks

* fix: Disabling some tests + small fixes (#2211)

Disabling old tests that fail due to the runtime cache unti they are
fixed;

Changing the `cross_shard_tx_with_validator_rotation` slightly based on
its performance on gcloud.
(increasing the block prod time speeds up test, because it results in
fewer forks, so the 150->200 change is to make more iterations fit.
With 150ms it fits ~6 iterations into one hour)

* Buildkite CI (#2217)

The actual pipeline definition is saved on buildkite ui, this way, we got it shared between stable, beta and master. There'll be some master only builds (nightly release)

Test Plans
--------------
Buildkite CI should pass

* fix(doctest): Refactor comment. (#2219)

Remove doctest that was running incorrectly.

* fix: Fixing stress.py (#2220)

The test was not updated after some config parameters were renamed.
Also because of #2195, tx status of lost transactions times out, so
adding a workaround into the test for now

Disabling the version of stress that messes up with network, because the
nightly runner is currently not configured to support the utility I use
to stop network between processes

Couple other changes:
1. Made `node_restart` worker not restart the node if no blocks were
produced in the meantime. It is needed because with only two nodes after
a long restart doomslug can take a while to recover (it is equivalent to
half the network shutting down and restarting after some delay).
Block production worker will fail the test if the block production
actually stalls
2. For the same reason increased the tolerated delays for block
production
3. Limited how many txs are sent per iteration of tx worker, since due
to #2195 it takes one second to query one transaction, and if the test
finished in the middle of querying, the allowed one minute for workers
to stop is not sufficient for the tx worker.
4. Also generally increasing the allowance from 1m to 2m at the end,
since the tx worker at the end of the test might take some time before
it even starts querying the transaction outcomes

* Try setting up GitPod (#2190)

* Try setting up GitPod

* Try pre-building nearcore in Docker image

* Remove rustup command  (nightly should already be available)

* Update location for nearcore prebuild

* Fix the way cargo is executed in Dockerfile

* Do cargo test as well when building docker image

* fix(epoch manager): Fix fishermen unstake (#2212)

* Fix fishermen unstake

* fix kickout set

* add test

* fix coverage, badge, docker release on in master branch (#2218)

Equivalent coverage and docker image release from gitlab ci. Except release is a docker image release and will also add s3 release in future, since gcloud storage requires google login, which is inconvenient for used in scripts

Test Plans
--------------
coverage only in master, beta, stable branch and docker release only in master branch (For test purpose also this cov-release branch but will be removed). badge updated. docker release in beta/stable branch will be added but not ready now as tests in beta/stable is more strict

* feat(adversary): adv_disable_doomslug + tests update

* fix(epoch manager): Fix validator kickout set (#2214)

* Fix validator kickout set

* fix stake change

* fix(epoch_manager): Some fixes and comments (#2222)

Some fixes, refactoring and comments.

Test plan
---------
Run existing tests
Add more asserts

* fix: Increasing tolerance in stress.py, and fixing nightly.txt (#2228)

Further investigation of `stress.py` failures in nightly shows that all
the workers appear to work as expected, but frequent restarts cause
block production to be delayed more than the current tolerance.
Locally spradically also more than 50% of transactions get lost if the
node get restarted too frequently.
Increasing both tolerances.

Also making the prints unbuffered, otherwise several workers in the
nightly runs don't flush their outputs.

Finally, fixing a typo in the nightly.txt

Test plan
---------
Locally `stress.py` doesn't fail, so there's no easy way to test whether
the new tolerances would be sufficient.
It also doesn't completely fix all the known issues, there are some
other failures in stress.py that I haven't gotten to yet

* Add doc test run in CI (#2230)

Add doc test runs in CI

Test Plan
------------
Should see doc test log at begining of other tests

* fix(runtime): Fix state change check during view call (#2229)

Fixes #2226 
The current behavior of a view call is to prohibit state changes. This change moves check from the end of the state viewer to VMLogic by prohibiting state changes functions during view calls.

## Test plan:
- Fixed state change tests. Previously the test didn't verify error type. The error was `MethodNotFound` and also alice account was wrong.
- Added unit tests for the new prohibited methods.

* Bump Borsh and runtime version (#2232)

* Bump Borsh version

* Nit

* Bump Borsh versions

* Bump borsh more

* Add binary release script (#2235)

scripts for binary release

Test Plan
------------
Download uploaded binary in a few popular linux vm and see if they works

* Skip phantom in genesis serialization (#2238)

* Improve 100 node test to not reserve static ip and firewall rule (#2241)

Two people run 100 node together is often blocked by firewall rules limit and static ip address limit (The recent fail by @mfornet is this case). Made they do not reserve ip address and use organization global firewall rules would fix this.

Test Plans
--------------
The same setting has been proved to work in create devnet nodes.

* feat(chain): Improved `changes` RPC (#2148)

Resolves: #2034 and #2048

The changes RPC API was broken implementation-wise (#2048), and design-wise (#2034).

This version has meaningful API design for all the exposed data, and it is also tested better. This PR is massive since, initially, we missed quite a point of exposing deserialized internal data (like account info, access key used to be returned as a Borsh-serialized blob, which is useless for the API user as they don't have easy access to the schema of those structures).

## Test plan

I did not succeed in writing Rust tests as we used a mocked runtime there. Thus, I had extended end-to-end tests (pytest) with:

* Test for account changes on account creation
* Test for access key changes on account creation and access key removal
* Test for code changes
* Test for several transactions on the same block/chunk

Co-authored-by: Anton Bukov <k06aaa@gmail.com>
Co-authored-by: Maksym Zavershynskyi <35039879+nearmax@users.noreply.github.com>
Co-authored-by: Bowen Wang <bowenwang1996@users.noreply.github.com>
Co-authored-by: AnaisUrlichs <33576047+AnaisUrlichs@users.noreply.github.com>
Co-authored-by: Illia Polosukhin <illia@nearprotocol.com>
Co-authored-by: Illia Polosukhin <ilblackdragon@gmail.com>
Co-authored-by: Lazaridis <59408072+lazaridiscom@users.noreply.github.com>
Co-authored-by: Evgeny Kuzyakov <ek@nearprotocol.com>
Co-authored-by: mikhailOK <mikhail.kever@gmail.com>
Co-authored-by: Marcelo Fornet <mfornet94@gmail.com>
Co-authored-by: Alexander Skidanov <skidanov.alexander@gmail.com>
Co-authored-by: Bowen Wang <bowenwang1996@uchicago.edu>
Co-authored-by: Vlad Frolov <frolvlad@gmail.com>
Co-authored-by: Vladimir Grichina <vgrichina@gmail.com>
Co-authored-by: nearprotocol-bulldozer[bot] <56702484+nearprotocol-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: Alex Kouprin <kpr@nearprotocol.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants