-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Conversation
This two are combined in a single commit because the new version of shared-memory doesn't provide the used functionality anymore. Therefore in order to update the version of this crate we implement the functionality that we need by ourselves, providing a cleaner API along the way.
For some reason it was allocating an entire GiB of memory. I suspect this has something to do with the current memory size limit of a PVF execution environment (the prior name suggests that). However, we don't need so much memory anywhere near that amount. In fact, we could reduce the allocated size even more, but that maybe for the next time.
That will make sure that we don't leak the shmem accidentally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better interface! :)
let base_ptr = shmem.as_ptr(); | ||
let mut consumed = 0; | ||
|
||
let candidate_ready_ev = add_event(base_ptr, &mut consumed, mode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not ensure that consumed
stay less than the total size of the shmem? And if not, throw an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured that is not important since we control all the sizes and that those events (they are backed by pthread cond vars) are negligible small compared to the whole size. Can add a check though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
}; | ||
|
||
// maximum memory in bytes | ||
const MAX_PARAMS_MEM: usize = 1024 * 1024; // 1 MiB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 1 Mib hard limit on proof size really enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. We over allocate the buffer so it will still fit, but this could be express more clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By how much do we over allocate? Or do you mean that the POV could be any size and we always can allocate enough memory for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch that, even if we over allocate, the call data is rejected by the params size check. See the fix here #2445
…ing-work-on-a-small-testnet * ao-fix-approval-import-tests: fix expected ancestry in tests fix ordering in determine_new_blocks fix infinite loop in determine_new_blocks fix test assertion fix panic in cache_session_info_for_head tests: use future::join Clean up sizes for a workspace (#2445) Integrate Approval Voting into Overseer / Service / GRANDPA (#2412) Mitigation of SIGBUS (#2440) node: migrate grandpa voting rule to async api (#2422) Initializer + Paras Clean Up Messages When Offboarding (#2413) Polkadot companion for 8114 (#2437) Companion for substrate#8079 (#2408) Bump if-watch to `0.1.8` to fix panic (#2436) Notify collators about seconded collation (#2430) CancelProxy uses `reject_announcement` instead of `remove_announcement` (#2429) Rococo genesis 1337 (#2425)
On Rococo we observe that some nodes constantly crash with SIGBUS signal. The coredump would show this stack trace
The error is coming from the fact that /dev/shm, the place where the shmem files are typically placed on Linux, is full. /dev/shm in that case was backed by a tmpfs, with limited amount of memory. It is actually ok to call
shm_open
andftruncate
on a full filesystem, but the first access to the region will lead to a SIGBUS.In order to mitigate this issue we first decrease the amount of memory required, prior to this PR it was 1GiB and now it's around 32MiB. Secondly, as soon as the worker connected to the shared memory we unlink the file, which, if not prevents it from accounting towards the total occupied size of the /dev/shm then it makes sure that the shm file is not laying around if something goes wrong.
Ideal solution would be to avoid using shared memory. I see we should solve this problem with a better approach to caching, outlined here