runtime: temporarily workaround the LimitedMemoryPool limiting the concurrency of the contract runtime too much #10733

nagisa · 2024-03-08T10:58:49Z

See individual commits and their descriptions for an easier review experience.

This PR is intended to be easily backportable with a minimal risk, with perhaps only the 2nd commit being superfluous in that context. There are definitely things that we can do to make the reliability and sensibility of this area much better, but they would result in a much more invasive change. My belief is that we should land both this workaround and fix #10732 as a follow-up.

cc #10732
cc @posvyatokum we will want this backported to 1.37.1, let me know if you'd like me to prepare a backport and/or where you want it to end up at.

runtime: hotfix to allow up-to 256 concurrent runtime invocations

This is not a proper fix as certain configuration settings can still
make this number to be exceeded, but the changes there are a little more
involved than what I'd be comfortable with making in a rush, especially
for a backport to the 1.37 release.

In order to keep the initial memory use reasonably bounded, I have
chosen to reduce the initial size of each map to a smaller number.
Since the runtime reuses most recently returned memories first, in
use-cases where there are few concurrent accesses, only that many
memories will get resized to fit all the required contracts over time,
thus keeping the memory use reasonably limited. This will result in
slightly less predictable function call performance for the first
invocation of the very few first contract invocations that
require more memory, but this seems like a required tradeoff to avoid
preallocating dozens of GB of mappings that will later go mostly unused.

runtime: avoid losing memories when failing to resize them

This is not something that I am aware of happening but a fresh look at
the code immediately spotted it as icky part of the code. The new
ordering will still fail to resize and most likely lead to the
termination of the node, but if the caller chooses to do something else
about the error, this new setup will correctly handle returning the
previous memory to the memory pool.

runtime: switch LimitedMemoryPool to a LIFO queue

crossbeam's implementation is a FIFO queue, which will repeatedly touch
all of the memories, regardless of the concurrency which then can result
in more memory use than strictly necessary in low-concurrency scenarios.

This is not a proper fix as certain configuration settings can still make this number to be exceeded, but the changes there are a little more involved than what I'd be comfortable with making in a rush, especially for a backport to the 1.37 release. In order to keep the initial memory use reasonably bounded, I have chosen to reduce the initial size of each map to a smaller number. Since the runtime reuses most recently returned memories first, in use-cases where there are few concurrent accesses, only that many memories will get resized to fit all the required contracts over time, thus keeping the memory use reasonably limited. This will result in slightly less predictable function call performance for the first invocation of the very few first contract invocations that require more memory, but this seems like a required tradeoff to avoid preallocating dozens of GB of mappings that will later go mostly unused.

This is not something that I am aware of happening but a fresh look at the code immediately spotted it as icky part of the code. The new ordering will still fail to resize and most likely lead to the termination of the node, but if the caller chooses to do something else about the error, this new setup will correctly handle returning the previous memory to the memory pool.

crossbeam's implementation is a FIFO queue, which will repeatedly touch all of the memories, regardless of the concurrency which then can result in more memory use than strictly necessary in low-concurrency scenarios.

nagisa · 2024-03-08T11:01:09Z

runtime/near-vm-runner/src/near_vm_runner.rs

@@ -258,7 +258,7 @@ impl NearVM {
                // that particular function call will be slower. Not to mention there isn't a
                // strong guarantee on the upper bound of the memory that the contract runtime may
                // require.
-                LimitedMemoryPool::new(8, 64 * 1024 * 1024).unwrap_or_else(|e| {
+                LimitedMemoryPool::new(256, 1 * 1024 * 1024).unwrap_or_else(|e| {


The size of each individual memory in the pool grows as required (see CodeMemory::resize). This grow operation is fallible, unfortunately, but at the very least with the change to the LIFO queue for the memory pool, in most "regular" applications only the required number of memories should actually grow. For a validator that tracks 4 shards, that would mean just 4 of these memories.

akhi3030

At first glance, this looks good so I am approving to unblock. Still a thorough review from Leo would be great.

Ekleog-NEAR

LGTM! Just a few comments around mutex handling :)

runtime/near-vm/engine/src/universal/code_memory.rs

@posvyatokum

…ncurrency of the contract runtime too much (near#10733) See individual commits and their descriptions for an easier review experience. This PR is intended to be easily backportable with a minimal risk, with perhaps only the 2nd commit being superfluous in that context. There are definitely things that we can do to make the reliability and sensibility of this area much better, but they would result in a much more invasive change. My belief is that we should land both this workaround *and* fix near#10732 as a follow-up. cc near#10732 cc @posvyatokum we will want this backported to 1.37.1, let me know if you'd like me to prepare a backport and/or where you want it to end up at. --- runtime: hotfix to allow up-to 256 concurrent runtime invocations This is not a proper fix as certain configuration settings can still make this number to be exceeded, but the changes there are a little more involved than what I'd be comfortable with making in a rush, especially for a backport to the 1.37 release. In order to keep the initial memory use reasonably bounded, I have chosen to reduce the initial size of each map to a smaller number. Since the runtime reuses most recently returned memories first, in use-cases where there are few concurrent accesses, only that many memories will get resized to fit all the required contracts over time, thus keeping the memory use reasonably limited. This will result in slightly less predictable function call performance for the first invocation of the very few first contract invocations that require more memory, but this seems like a required tradeoff to avoid preallocating dozens of GB of mappings that will later go mostly unused. --- runtime: avoid losing memories when failing to resize them This is not something that I am aware of happening but a fresh look at the code immediately spotted it as icky part of the code. The new ordering will still fail to resize and most likely lead to the termination of the node, but if the caller chooses to do something else about the error, this new setup will correctly handle returning the previous memory to the memory pool. --- runtime: switch LimitedMemoryPool to a LIFO queue crossbeam's implementation is a FIFO queue, which will repeatedly touch all of the memories, regardless of the concurrency which then can result in more memory use than strictly necessary in low-concurrency scenarios.

… the concurrency of the contract runtime too much (#10736) This is a backport of #10733 for inclusion in 1.37.x series.

nagisa added 4 commits March 8, 2024 11:41

runtime: switch LimitedMemoryPool to a LIFO queue

dcb0ac5

crossbeam's implementation is a FIFO queue, which will repeatedly touch all of the memories, regardless of the concurrency which then can result in more memory use than strictly necessary in low-concurrency scenarios.

runtime: add a regression test

b1d603f

nagisa requested review from akhi3030 and Ekleog-NEAR March 8, 2024 10:58

nagisa requested a review from a team as a code owner March 8, 2024 10:58

nagisa commented Mar 8, 2024

View reviewed changes

akhi3030 approved these changes Mar 8, 2024

View reviewed changes

Ekleog-NEAR approved these changes Mar 8, 2024

View reviewed changes

review comments

8866683

nagisa enabled auto-merge March 8, 2024 12:53

nagisa disabled auto-merge March 8, 2024 12:59

nagisa enabled auto-merge March 8, 2024 13:00

nagisa added this pull request to the merge queue Mar 8, 2024

Merged via the queue into near:master with commit f430756 Mar 8, 2024
23 of 24 checks passed

nagisa deleted the nagisa/temporary-workaround-for-ooms branch March 8, 2024 13:21

nagisa mentioned this pull request Mar 8, 2024

[1.37] runtime: temporarily workaround the LimitedMemoryPool limiting the concurrency of the contract runtime too much #10736

Merged

posvyatokum pushed a commit that referenced this pull request Mar 8, 2024

[1.37] runtime: temporarily workaround the LimitedMemoryPool limiting…

2567e70

… the concurrency of the contract runtime too much (#10736) This is a backport of #10733 for inclusion in 1.37.x series.

pugachAG mentioned this pull request Mar 12, 2024

Node down with a error - Insufficient resources: could not allocate code memory: Cannot allocate memory (os error 12) near/stakewars-iv#61

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: temporarily workaround the LimitedMemoryPool limiting the concurrency of the contract runtime too much #10733

runtime: temporarily workaround the LimitedMemoryPool limiting the concurrency of the contract runtime too much #10733

nagisa commented Mar 8, 2024 •

edited

Loading

nagisa Mar 8, 2024 •

edited

Loading

akhi3030 left a comment

Ekleog-NEAR left a comment

runtime: temporarily workaround the LimitedMemoryPool limiting the concurrency of the contract runtime too much #10733

runtime: temporarily workaround the LimitedMemoryPool limiting the concurrency of the contract runtime too much #10733

Conversation

nagisa commented Mar 8, 2024 • edited Loading

nagisa Mar 8, 2024 • edited Loading

Choose a reason for hiding this comment

akhi3030 left a comment

Choose a reason for hiding this comment

Ekleog-NEAR left a comment

Choose a reason for hiding this comment

nagisa commented Mar 8, 2024 •

edited

Loading

nagisa Mar 8, 2024 •

edited

Loading