Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

ParityDb migration tests are flaky #7609

Closed
altonen opened this issue Aug 14, 2023 · 19 comments · Fixed by #7612
Closed

ParityDb migration tests are flaky #7609

altonen opened this issue Aug 14, 2023 · 19 comments · Fixed by #7612
Labels
I5-tests Tests need fixing, improving or augmenting.

Comments

@altonen
Copy link
Contributor

altonen commented Aug 14, 2023

PR #7601 has now failed twice on chains_db::upgrade::tests::test_paritydb_migrate_2_to_3 test. The issue is also reproducible locally. If the tests are run in a loop, both test_paritydb_migrate_2_to_3 and test_paritydb_migrate_1_to_2 fail but not consistently.

The error that is printed:

---- parachains_db::upgrade::tests::test_paritydb_migrate_1_to_2 stdout ----
thread 'parachains_db::upgrade::tests::test_paritydb_migrate_1_to_2' panicked at 'called `Result::unwrap()` on an `Err` value: Locked(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" })', node/service/src/parachains_db/upgrade.rs:387:62
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

These tests failures are also observable on master. Bisect revealed #7337 to be the PR that introduced regression. cc @mrcnski @s0me0ne-unkn0wn

@altonen altonen added the I5-tests Tests need fixing, improving or augmenting. label Aug 14, 2023
@ggwpez
Copy link
Member

ggwpez commented Aug 14, 2023

Yea we may need to comment those if they impede the development workflow.

@mrcnski
Copy link
Contributor

mrcnski commented Aug 14, 2023

Trying to reproduce but unsuccessful so far. What command did you run? I'm doing

while cargo ltest test_paritydb_migrate_2_to_3; do :; done

@mrcnski
Copy link
Contributor

mrcnski commented Aug 14, 2023

OK I get it if I just do while cargo ltest; [...]. Weird.

@altonen
Copy link
Contributor Author

altonen commented Aug 14, 2023

Trying to reproduce but unsuccessful so far. What command did you run? I'm doing

while cargo ltest test_paritydb_migrate_2_to_3; do :; done

running

while true; do cargo test || break; done

in node/service/src/parachain_db worked for me

@mrcnski
Copy link
Contributor

mrcnski commented Aug 14, 2023

I'm flummoxed, I don't see how these tests can be related to #7337. The tests and the error seem quite specific to paritydb.

called `Result::unwrap()` on an `Err` value: Io(Custom { kind: Other, error: "Error removing COL_SESSION_WINDOW_DATA Locked(Os { code: 35, kind: WouldBlock, message: \"Resource temporarily unavailable\" })" })

@arkpar
Copy link
Member

arkpar commented Aug 14, 2023

If I understand correctly this only happens when there are multuple tests running? The error message here means that the database is already open. This could happen if mutiple tests use the same temp dir.

In any case, these upgrade tests are probably obsolete. I'd check with the polkadot team if it even makes sense to still maintain the upgrade code.
cc @ordian @sandreim

@mrcnski
Copy link
Contributor

mrcnski commented Aug 14, 2023

I've confirmed locally the issue was introduced by #7337. Weird.

@ordian
Copy link
Member

ordian commented Aug 14, 2023

If I understand correctly this only happens when there are multuple tests running? The error message here means that the database is already open. This could happen if mutiple tests use the same temp dir.

In any case, these upgrade tests are probably obsolete. I'd check with the polkadot team if it even makes sense to still maintain the upgrade code. cc @ordian @sandreim

Indeed, the test can be removed now.

@altonen
Copy link
Contributor Author

altonen commented Aug 14, 2023

I've confirmed locally the issue was introduced by #7337. Weird.

It is weird. Since the branch is still available, it would be possible to go through the entire history and check which commit caused the regression but if the tests can be removed, I'll create a PR

@mrcnski
Copy link
Contributor

mrcnski commented Aug 15, 2023

Yeah I'll go bisect the branch. Partly out of morbid curiosity and partly because I'm afraid there is a deeper issue that could reappear later.

@mrcnski
Copy link
Contributor

mrcnski commented Aug 15, 2023

The test failures were introduced in dc5d6a0 where we added some other tests in the same crate. The new tests could not run at the same time as each other1, so I had added some code like this:

// Ensure tests run one at a time so they don't interfere with each other.
static OVERRIDES_LOCK: Mutex<()> = Mutex::new(());
let _lock = OVERRIDES_LOCK.lock()?;

in the next commit this was replaced with the serial_test crate which does the same thing and the test failures persisted.

My guess is that this affected the order that tests are run in, in such a way that caused paritydb tests to step on each other's toes.

Question: is this is expected behavior in paritydb or a bug that could manifest in production? I guess in production there is only ever one instance of paritydb? Maybe we should at least catch this error and throw a more descriptive one for the benefit of future devs? @ordian

Footnotes

  1. though in retrospect there is a slight tweak that could be made so they don't step on each other's toes.

@arkpar
Copy link
Member

arkpar commented Aug 15, 2023

Question: is this is expected behavior in paritydb or a bug that could manifest in production? I guess in production there is only ever one instance of paritydb? Maybe we should at least catch this error and throw a more descriptive one for the benefit of future devs? @ordian

Mutiple database instances should not interfere unless they use the same path. Each test must use a unique temp dir. No serial_test or mutex hacks should be required.

@mrcnski
Copy link
Contributor

mrcnski commented Aug 15, 2023

@arkpar If I understand correctly, the paritydb tests which have now been removed were simply not written correctly, so there's no actual bug there? BTW the hacks were for tests related to #7337 (totally unrelated to paritydb), and on further inspection they do actually need to be there, using a unique temp dir is not enough.

@arkpar
Copy link
Member

arkpar commented Aug 15, 2023

I don't see anything wrong with the tests. They should have worked. I've looked at the tempfile crate to see if it could produce the same path if called simulatneously from two different threads, but it seems to be allright.

I've tried reproducing the issue with while true; do cargo test || break; done, on macos but was unable to. So maybe the issue is platform dependent. I'd add some logs to see if the paths are indeeed different. setting RUST_LOG=parity-db=debug and running cargo test --release -- --nocapture could also shed some light.

@ggwpez
Copy link
Member

ggwpez commented Aug 15, 2023

Someone sent me this log output form a parachain DB migration:

Aug 11 13:14:21 region2-collator-us-1 systemd[1]: Started watr Systemd Service.
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: CLI parameter `--execution` has no effect anymore and will be removed in the future!
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:21.648  INFO main sc_cli::runner: Watr Collator
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:21.649  INFO main sc_cli::runner: ✌️   version 1.0.0-6ec9229d97c
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:21.649  INFO main sc_cli::runner: ❤️  by Parity Technologies <admin@parity.io>, 2023-2023
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:21.649  INFO main sc_cli::runner: 📋 Chain specification: Watr Network
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:21.649  INFO main sc_cli::runner: 🏷   Node name: region2-collator-us-1
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:21.649  INFO main sc_cli::runner: 👤 Role: AUTHORITY
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:21.649  INFO main sc_cli::runner: 💾 Database: RocksDb at /home/polkadot/.local/share/polkadot/chains/watr_network/db/full
Aug 11 13:14:23 region2-collator-us-1 watr[2020990]: CLI parameter `--execution` has no effect anymore and will be removed in the future!
Aug 11 13:14:23 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:23.804  INFO main watr_node::command: Parachain id: Id(2058)
Aug 11 13:14:23 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:23.804  INFO main watr_node::command: Parachain Account: 5Ec4AhNoGMRn8j7rpcumbokVMzWeTNdqw42r7pcH8dtqH29G
Aug 11 13:14:23 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:23.804  INFO main watr_node::command: Parachain genesis state: 0x00000000000000000000000000000000000000000000000000000000000000000032d52a557fc0b1e3b6ce56981091c11c652646eb3b39733928b143f2993fe47003170a2e7597b7b7e3d84c05391d139a62b157e78786d8c082f29dcf4c11131400
Aug 11 13:14:23 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:23.804  INFO main watr_node::command: Is collating: yes
Aug 11 13:15:36 region2-collator-us-1 watr[2020990]: 2023-08-11 13:15:36.803  INFO main sub-libp2p: [Relaychain] 🏷   Local node identity is: 12D3KooWPELBXukq5cju5VKoKaFcRe5r1ti8jBTXeQAEoXZTjHyk
Aug 11 13:15:36 region2-collator-us-1 watr[2020990]: 2023-08-11 13:15:36.818  INFO main sub-libp2p: [Relaychain] 🔍 Discovered new external address for our node: /ip4/3.93.142.105/tcp/30333/p2p/12D3KooWPELBXukq5cju5VKoKaFcRe5r1ti8jBTXeQAEoXZTjHyk
Aug 11 13:15:36 region2-collator-us-1 watr[2020990]: 2023-08-11 13:15:36.821  INFO main parachain::db: [Relaychain] Migrating parachains db from version 1 to version 2 ...
Aug 11 13:15:50 region2-collator-us-1 watr[2020990]: 2023-08-11 13:15:50.022  INFO main parachain::db: [Relaychain] Migration complete!
Aug 11 13:15:50 region2-collator-us-1 watr[2020990]: 2023-08-11 13:15:50.025  WARN tokio-runtime-worker sc_service::builder: [Relaychain] The NetworkStart returned as part of `build_network` has been silently dropped
Aug 11 13:15:50 region2-collator-us-1 watr[2020990]: Error: Service(Application(Application(Io(Custom { kind: Other, error: Error { message: "Invalid argument: Column families not opened: col5, col4, col3, col2, col1, col0" } }))))
Aug 11 13:15:50 region2-collator-us-1 systemd[1]: watr.service: Main process exited, code=exited, status=1/FAILURE
Aug 11 13:15:50 region2-collator-us-1 systemd[1]: watr.service: Failed with result 'exit-code'.

There is a Q in the General elements chat from Peter W. about it. Could this be related?

@mrcnski
Copy link
Contributor

mrcnski commented Aug 16, 2023

@arkpar I was able to reproduce on M1 Mac as well as Linux, but it is flaky. Try cargo clean first, otherwise try also the next commit 6ef14fe.

@mrcnski
Copy link
Contributor

mrcnski commented Aug 16, 2023

@ggwpez No idea if there is a connection but IMO should be investigated. Reopening.

@mrcnski mrcnski reopened this Aug 16, 2023
@arkpar
Copy link
Member

arkpar commented Aug 16, 2023

There is a Q in the General elements chat from Peter W. about it. Could this be related?

I don't think so. This is rocksdb error. If this can be reproduced please file a separate issue.

@arkpar
Copy link
Member

arkpar commented Aug 17, 2023

Added an explicit DB unlock in paritytech/parity-db#218 which resolves this. Normally file locks should be removed automatically once the file handle is closed. For some reason when running tests the lock may take some time to disappear on its own.

@arkpar arkpar closed this as completed Aug 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I5-tests Tests need fixing, improving or augmenting.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants