Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Mitigation of SIGBUS #2440

Merged
merged 14 commits into from
Feb 15, 2021
Merged

Mitigation of SIGBUS #2440

merged 14 commits into from
Feb 15, 2021

Conversation

pepyakin
Copy link
Contributor

On Rococo we observe that some nodes constantly crash with SIGBUS signal. The coredump would show this stack trace

  * frame #0: 0x00005622f1602294 rococo-20210212`shared_memory::conf::SharedMemConf::create::h3b4bdaf4fd303997 + 1140
    frame #1: 0x00005622f15f3063 rococo-20210212`polkadot_parachain::wasm_executor::validation_host::ValidationHost::validate_candidate::h841dbe531d5b4b1f + 2371
    frame #2: 0x00005622f15f0837 rococo-20210212`polkadot_parachain::wasm_executor::validation_host::ValidationPool::validate_candidate_custom::hc64b3ac33eb43727 + 407
    frame #3: 0x00005622f15f0605 rococo-20210212`polkadot_parachain::wasm_executor::validation_host::ValidationPool::validate_candidate::h250300cc4b686d08 + 661
    frame #4: 0x00005622ef1f337b rococo-20210212`polkadot_parachain::wasm_executor::validate_candidate::h0dd7d3b4f578e7df + 379
    frame #5: 0x00005622f018aa86 rococo-20210212`polkadot_node_core_candidate_validation::validate_candidate_exhaustive::h601288254d015d57 + 1814
    frame #6: 0x00005622ef9d51df rococo-20210212`_$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h2c834a476e8bd245 + 335
    frame #7: 0x00005622f092f3e1 rococo-20210212`_$LT$sc_service..task_manager..prometheus_future..PrometheusFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::hb22887fdbf627ea5 + 65
    frame #8: 0x00005622f092ab4b rococo-20210212`_$LT$futures_util..future..select..Select$LT$A$C$B$GT$$u20$as$u20$core..future..future..Future$GT$::poll::hac8861c9054cdc28 + 75
    frame #9: 0x00005622f0917b66 rococo-20210212`_$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::hb8a5b16daeea0b8a + 230
    frame #10: 0x00005622f091aba6 rococo-20210212`_$LT$tracing_futures..Instrumented$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h869f77cae2e9183b (.llvm.13681125076900316595) + 38
    frame #11: 0x00005622f092d4a7 rococo-20210212`_$LT$sc_service..task_manager..WithTelemetrySpan$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h7e367afd349789ec + 135
    frame #12: 0x00005622ef0b5b31 rococo-20210212`std::thread::local::LocalKey$LT$T$GT$::with::hac0311a348c6d2f7 + 97
    frame #13: 0x00005622eef8db9e rococo-20210212`futures_executor::local_pool::block_on::hc1ec056b085a3d59 + 62
    frame #14: 0x00005622ef092d5c rococo-20210212`_$LT$std..panic..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::ha0d22998eaf8d416 + 252
    frame #15: 0x00005622ef01327d rococo-20210212`tokio::runtime::task::harness::Harness$LT$T$C$S$GT$::poll::h1dea41e2d7641c36 + 125
    frame #16: 0x00005622f1ebfee1 rococo-20210212`tokio::runtime::blocking::pool::Inner::run::hc38f319445b5d344 + 257
    frame #17: 0x00005622f1ed09a8 rococo-20210212`tokio::runtime::context::enter::h8923a2a871eec0e9 + 104
    frame #18: 0x00005622f1ecefef rococo-20210212`std::sys_common::backtrace::__rust_begin_short_backtrace::h7a3e99e8a2c2c4bd + 79
    frame #19: 0x00005622f1ecf8a0 rococo-20210212`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h895c3678b7c51e76 + 224
    frame #20: 0x00005622f227b59a rococo-20210212`std::sys::unix::thread::Thread::new::thread_start::h587efff279c68ba7 [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::h09ff301006f1aeca at boxed.rs:1307:9
    frame #21: 0x00005622f227b594 rococo-20210212`std::sys::unix::thread::Thread::new::thread_start::h587efff279c68ba7 [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::he79488c8f00b5f31 at boxed.rs:1307
    frame #22: 0x00005622f227b58b rococo-20210212`std::sys::unix::thread::Thread::new::thread_start::h587efff279c68ba7 at thread.rs:71
    frame #23: 0x00007f76b4431fa3 libpthread.so.0`setxid_mark_thread.isra.0 + 99

The error is coming from the fact that /dev/shm, the place where the shmem files are typically placed on Linux, is full. /dev/shm in that case was backed by a tmpfs, with limited amount of memory. It is actually ok to call shm_open and ftruncate on a full filesystem, but the first access to the region will lead to a SIGBUS.

In order to mitigate this issue we first decrease the amount of memory required, prior to this PR it was 1GiB and now it's around 32MiB. Secondly, as soon as the worker connected to the shared memory we unlink the file, which, if not prevents it from accounting towards the total occupied size of the /dev/shm then it makes sure that the shm file is not laying around if something goes wrong.

Ideal solution would be to avoid using shared memory. I see we should solve this problem with a better approach to caching, outlined here

This two are combined in a single commit because the new version of
shared-memory doesn't provide the used functionality anymore.

Therefore in order to update the version of this crate we implement the
functionality that we need by ourselves, providing a cleaner API along
the way.
For some reason it was allocating an entire GiB of memory. I suspect
this has something to do with the current memory size limit of a PVF
execution environment (the prior name suggests that). However, we don't
need so much memory anywhere near that amount.

In fact, we could reduce the allocated size even more, but that maybe
for the next time.
That will make sure that we don't leak the shmem accidentally.
@pepyakin pepyakin added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Feb 15, 2021
Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better interface! :)

parachain/src/wasm_executor/workspace.rs Outdated Show resolved Hide resolved
parachain/src/wasm_executor/workspace.rs Outdated Show resolved Hide resolved
parachain/src/wasm_executor/workspace.rs Outdated Show resolved Hide resolved
parachain/src/wasm_executor/workspace.rs Outdated Show resolved Hide resolved
let base_ptr = shmem.as_ptr();
let mut consumed = 0;

let candidate_ready_ev = add_event(base_ptr, &mut consumed, mode);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not ensure that consumed stay less than the total size of the shmem? And if not, throw an error?

Copy link
Contributor Author

@pepyakin pepyakin Feb 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured that is not important since we control all the sizes and that those events (they are backed by pthread cond vars) are negligible small compared to the whole size. Can add a check though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine :)

parachain/src/wasm_executor/workspace.rs Outdated Show resolved Hide resolved
parachain/src/wasm_executor/workspace.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@rphmeier rphmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pepyakin
Copy link
Contributor Author

pepyakin commented Feb 15, 2021

  • localnet with a validator built from this branch produces blocks OK

@rphmeier rphmeier merged commit 69b1058 into master Feb 15, 2021
@rphmeier rphmeier deleted the ser-sigbus-fix branch February 15, 2021 20:40
};

// maximum memory in bytes
const MAX_PARAMS_MEM: usize = 1024 * 1024; // 1 MiB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1 Mib hard limit on proof size really enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. We over allocate the buffer so it will still fit, but this could be express more clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By how much do we over allocate? Or do you mean that the POV could be any size and we always can allocate enough memory for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that, even if we over allocate, the call data is rejected by the params size check. See the fix here #2445

ordian added a commit that referenced this pull request Feb 17, 2021
…ing-work-on-a-small-testnet

* ao-fix-approval-import-tests:
  fix expected ancestry in tests
  fix ordering in determine_new_blocks
  fix infinite loop in determine_new_blocks
  fix test assertion
  fix panic in cache_session_info_for_head
  tests: use future::join
  Clean up sizes for a workspace (#2445)
  Integrate Approval Voting into Overseer / Service / GRANDPA (#2412)
  Mitigation of SIGBUS (#2440)
  node: migrate grandpa voting rule to async api (#2422)
  Initializer + Paras Clean Up Messages When Offboarding (#2413)
  Polkadot companion for 8114 (#2437)
  Companion for substrate#8079 (#2408)
  Bump if-watch to `0.1.8` to fix panic (#2436)
  Notify collators about seconded collation (#2430)
  CancelProxy uses `reject_announcement` instead of `remove_announcement` (#2429)
  Rococo genesis 1337 (#2425)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants