Mitigation of SIGBUS #2440

pepyakin · 2021-02-15T17:40:35Z

On Rococo we observe that some nodes constantly crash with SIGBUS signal. The coredump would show this stack trace

  * frame #0: 0x00005622f1602294 rococo-20210212`shared_memory::conf::SharedMemConf::create::h3b4bdaf4fd303997 + 1140
    frame #1: 0x00005622f15f3063 rococo-20210212`polkadot_parachain::wasm_executor::validation_host::ValidationHost::validate_candidate::h841dbe531d5b4b1f + 2371
    frame #2: 0x00005622f15f0837 rococo-20210212`polkadot_parachain::wasm_executor::validation_host::ValidationPool::validate_candidate_custom::hc64b3ac33eb43727 + 407
    frame #3: 0x00005622f15f0605 rococo-20210212`polkadot_parachain::wasm_executor::validation_host::ValidationPool::validate_candidate::h250300cc4b686d08 + 661
    frame #4: 0x00005622ef1f337b rococo-20210212`polkadot_parachain::wasm_executor::validate_candidate::h0dd7d3b4f578e7df + 379
    frame #5: 0x00005622f018aa86 rococo-20210212`polkadot_node_core_candidate_validation::validate_candidate_exhaustive::h601288254d015d57 + 1814
    frame #6: 0x00005622ef9d51df rococo-20210212`_$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h2c834a476e8bd245 + 335
    frame #7: 0x00005622f092f3e1 rococo-20210212`_$LT$sc_service..task_manager..prometheus_future..PrometheusFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::hb22887fdbf627ea5 + 65
    frame #8: 0x00005622f092ab4b rococo-20210212`_$LT$futures_util..future..select..Select$LT$A$C$B$GT$$u20$as$u20$core..future..future..Future$GT$::poll::hac8861c9054cdc28 + 75
    frame #9: 0x00005622f0917b66 rococo-20210212`_$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::hb8a5b16daeea0b8a + 230
    frame #10: 0x00005622f091aba6 rococo-20210212`_$LT$tracing_futures..Instrumented$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h869f77cae2e9183b (.llvm.13681125076900316595) + 38
    frame #11: 0x00005622f092d4a7 rococo-20210212`_$LT$sc_service..task_manager..WithTelemetrySpan$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h7e367afd349789ec + 135
    frame #12: 0x00005622ef0b5b31 rococo-20210212`std::thread::local::LocalKey$LT$T$GT$::with::hac0311a348c6d2f7 + 97
    frame #13: 0x00005622eef8db9e rococo-20210212`futures_executor::local_pool::block_on::hc1ec056b085a3d59 + 62
    frame #14: 0x00005622ef092d5c rococo-20210212`_$LT$std..panic..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::ha0d22998eaf8d416 + 252
    frame #15: 0x00005622ef01327d rococo-20210212`tokio::runtime::task::harness::Harness$LT$T$C$S$GT$::poll::h1dea41e2d7641c36 + 125
    frame #16: 0x00005622f1ebfee1 rococo-20210212`tokio::runtime::blocking::pool::Inner::run::hc38f319445b5d344 + 257
    frame #17: 0x00005622f1ed09a8 rococo-20210212`tokio::runtime::context::enter::h8923a2a871eec0e9 + 104
    frame #18: 0x00005622f1ecefef rococo-20210212`std::sys_common::backtrace::__rust_begin_short_backtrace::h7a3e99e8a2c2c4bd + 79
    frame #19: 0x00005622f1ecf8a0 rococo-20210212`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h895c3678b7c51e76 + 224
    frame #20: 0x00005622f227b59a rococo-20210212`std::sys::unix::thread::Thread::new::thread_start::h587efff279c68ba7 [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::h09ff301006f1aeca at boxed.rs:1307:9
    frame #21: 0x00005622f227b594 rococo-20210212`std::sys::unix::thread::Thread::new::thread_start::h587efff279c68ba7 [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::he79488c8f00b5f31 at boxed.rs:1307
    frame #22: 0x00005622f227b58b rococo-20210212`std::sys::unix::thread::Thread::new::thread_start::h587efff279c68ba7 at thread.rs:71
    frame #23: 0x00007f76b4431fa3 libpthread.so.0`setxid_mark_thread.isra.0 + 99

The error is coming from the fact that /dev/shm, the place where the shmem files are typically placed on Linux, is full. /dev/shm in that case was backed by a tmpfs, with limited amount of memory. It is actually ok to call shm_open and ftruncate on a full filesystem, but the first access to the region will lead to a SIGBUS.

In order to mitigate this issue we first decrease the amount of memory required, prior to this PR it was 1GiB and now it's around 32MiB. Secondly, as soon as the worker connected to the shared memory we unlink the file, which, if not prevents it from accounting towards the total occupied size of the /dev/shm then it makes sure that the shm file is not laying around if something goes wrong.

Ideal solution would be to avoid using shared memory. I see we should solve this problem with a better approach to caching, outlined here

This two are combined in a single commit because the new version of shared-memory doesn't provide the used functionality anymore. Therefore in order to update the version of this crate we implement the functionality that we need by ourselves, providing a cleaner API along the way.

For some reason it was allocating an entire GiB of memory. I suspect this has something to do with the current memory size limit of a PVF execution environment (the prior name suggests that). However, we don't need so much memory anywhere near that amount. In fact, we could reduce the allocated size even more, but that maybe for the next time.

That will make sure that we don't leak the shmem accidentally.

parachain/src/wasm_executor/workspace.rs

bkchr

Much better interface! :)

parachain/src/wasm_executor/workspace.rs

bkchr · 2021-02-15T17:55:19Z

parachain/src/wasm_executor/workspace.rs

+			let base_ptr = shmem.as_ptr();
+			let mut consumed = 0;
+
+			let candidate_ready_ev = add_event(base_ptr, &mut consumed, mode);


Should we not ensure that consumed stay less than the total size of the shmem? And if not, throw an error?

I figured that is not important since we control all the sizes and that those events (they are backed by pthread cond vars) are negligible small compared to the whole size. Can add a check though

parachain/src/wasm_executor/workspace.rs

rphmeier

LGTM

pepyakin · 2021-02-15T19:28:22Z

localnet with a validator built from this branch produces blocks OK

arkpar · 2021-02-16T05:48:32Z

parachain/src/wasm_executor/validation_host/workspace.rs

+};
+
+// maximum memory in bytes
+const MAX_PARAMS_MEM: usize = 1024 * 1024; // 1 MiB


Is 1 Mib hard limit on proof size really enough?

Yeah, good point. We over allocate the buffer so it will still fit, but this could be express more clearer.

By how much do we over allocate? Or do you mean that the POV could be any size and we always can allocate enough memory for it?

Scratch that, even if we over allocate, the call data is rejected by the params size check. See the fix here #2445

…ing-work-on-a-small-testnet * ao-fix-approval-import-tests: fix expected ancestry in tests fix ordering in determine_new_blocks fix infinite loop in determine_new_blocks fix test assertion fix panic in cache_session_info_for_head tests: use future::join Clean up sizes for a workspace (#2445) Integrate Approval Voting into Overseer / Service / GRANDPA (#2412) Mitigation of SIGBUS (#2440) node: migrate grandpa voting rule to async api (#2422) Initializer + Paras Clean Up Messages When Offboarding (#2413) Polkadot companion for 8114 (#2437) Companion for substrate#8079 (#2408) Bump if-watch to `0.1.8` to fix panic (#2436) Notify collators about seconded collation (#2430) CancelProxy uses `reject_announcement` instead of `remove_announcement` (#2429) Rococo genesis 1337 (#2425)

pepyakin added 3 commits February 15, 2021 17:24

Unlink shmem just after opening

365b65b

That will make sure that we don't leak the shmem accidentally.

pepyakin added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Feb 15, 2021

Do not compile workspace mod for androind and wasm

17a0ca9