Feat: lazy initialization memory #107

mohanson · 2020-08-07T02:41:24Z

Description

Split program memory by 256k size frames, only when a frame is used will it be filled with 0.
Maximum memory is still a fixed value: 4MB

Bench

Origin

interpret secp256k1_bench     time:   [21.956 ms 22.008 ms 22.066 ms]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

interpret secp256k1_bench via assembly     time:   [5.6998 ms 5.7362 ms 5.7741 ms]

Benchmarking aot secp256k1_bench: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.0s, enable flat sampling, or reduce sample count to 50.
aot secp256k1_bench     time:   [1.4025 ms 1.4070 ms 1.4124 ms]                                 
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

compiling secp256k1_bench for aot     time:   [26.887 ms 26.990 ms 27.114 ms]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Now

interpret secp256k1_bench     time:   [22.170 ms 22.221 ms 22.278 ms]
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

interpret secp256k1_bench via assembly     time:   [5.6565 ms 5.6904 ms 5.7279 ms]
Found 10 outliers among 100 measurements (10.00%)
  8 (8.00%) high mild
  2 (2.00%) high severe

Benchmarking aot secp256k1_bench: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.0s, enable flat sampling, or reduce sample count to 50.
aot secp256k1_bench     time:   [1.6108 ms 1.6138 ms 1.6174 ms]                                 
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) high mild
  8 (8.00%) high severe

compiling secp256k1_bench for aot     time:   [27.741 ms 27.806 ms 27.873 ms]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

Almost unchanged.

xxuejie · 2020-08-07T04:45:57Z

src/machine/aot/aot.x64.c

+  | mov byte [rdx+rcx], 1
+  | push rax
+  | mov rax, rcx
+  | call ->zeroed_memory


I feel like it's better we leverage memset than hand writing zeroed_memory function. Modern memset would leverage SIMD for more speedups.

I do feel like it might be better we add a new exit code here, when an uninitialized frame is detected, we exit from AOT and ASM mode with the error code, then at Rust side, we can fill in the zeros using code which is just a memset. That to me, might be a better and (hopefully) faster way.

I replaced it with memset(). They are called differently on different operating systems, and it took me a long time to understand this. It should work well for now.

xxuejie · 2020-08-07T04:53:59Z

src/machine/asm/execute.S

+  shr $2, %ecx; \
+  cld; rep; stosl;
+
+#define INITED_MEMORY(address_reg, temp_reg1, temp_reg2, temp_reg2d, length) \


I'm concerned about the duplication of all the code here, can you get some stat number on the code size before and after this change for the whole execute.o section? We do want all the code here to fit in L1 cache so as to be as performant as possible.

Also in the worse case, the code for CHECK_WRITE_PERMISSION are mostly the same, we should be able to combine them for less code size and less executed instructions.

Also in the worse case, the code for CHECK_WRITE_PERMISSION are mostly the same, we should be able to combine them for less code size and less executed instructions.

Yes, the code for checking out of bounds is duplicated. I am going to remove this in CHECK_WRITE_PERMISSION.

I'm concerned about the duplication of all the code here, can you get some stat number on the code size before and after this change for the whole execute.o section? We do want all the code here to fit in L1 cache so as to be as performant as possible.

I made a comparison, the original execute.o is 15K, now (based on the latest commit) it is 18K.

Let's merge CHECK_WRITE_PERMISSION and INITED_MEMORY for less instructions executed. Also keep in mind that INITED_MEMORY might also be used by memory reading.

I have combined the check_write() and inited_meory in one function named access_control(), please take a look.

commit:

088176f

7ce3db3

src/machine/asm/execute.S

definitions/src/asm.rs

quake · 2020-08-12T12:15:40Z

src/machine/asm/mod.rs

+fn inited_memory(machine: &mut AsmCoreMachine, addr: u64, size: u64) -> Result<(), Error> {
+    let frame_from = addr / MEMORY_FRAMESIZE as u64;
+    let (addr_to, overflowed) = addr.overflowing_add(size);
+    if overflowed {
+        return Err(Error::OutOfBound);
+    }
+    let frame_to = addr_to / MEMORY_FRAMESIZE as u64;
+    for i in frame_from..=std::cmp::min(MEMORY_FRAMES as u64 - 1, frame_to) {
+        if machine.frames[i as usize] == 0 {
+            let base_addr = i * MEMORY_FRAMESIZE as u64;
+            memset(
+                &mut machine.memory
+                    [base_addr as usize..(base_addr + MEMORY_FRAMESIZE as u64) as usize],
+                0,
+            );
+            machine.frames[i as usize] = 1;
+        }
+    }
+    Ok(())
+}
+


Since we check for out-of-bounds here, can we remove these code blocks below ?

if addr + 2 > self.memory.len() as u64 { return Err(Error::OutOfBound); } if addr + 4 > self.memory.len() as u64 { return Err(Error::OutOfBound); }

suggest to cast to usize instead of u64

Suggested change

fn inited_memory(machine: &mut AsmCoreMachine, addr: u64, size: u64) -> Result<(), Error> {

let frame_from = addr / MEMORY_FRAMESIZE as u64;

let (addr_to, overflowed) = addr.overflowing_add(size);

if overflowed {

return Err(Error::OutOfBound);

}

let frame_to = addr_to / MEMORY_FRAMESIZE as u64;

for i in frame_from..=std::cmp::min(MEMORY_FRAMES as u64 - 1, frame_to) {

if machine.frames[i as usize] == 0 {

let base_addr = i * MEMORY_FRAMESIZE as u64;

memset(

&mut machine.memory

[base_addr as usize..(base_addr + MEMORY_FRAMESIZE as u64) as usize],

0,

);

machine.frames[i as usize] = 1;

}

}

Ok(())

}

fn inited_memory(machine: &mut AsmCoreMachine, addr: u64, size: u64) -> Result<(), Error> {

let (addr_to, overflowed) = addr.overflowing_add(size);

if overflowed || addr_to as usize > RISCV_MAX_MEMORY {

return Err(Error::OutOfBound);

}

let frame_from = addr as usize / MEMORY_FRAMESIZE;

let frame_to = addr_to as usize / MEMORY_FRAMESIZE;

for i in frame_from..=std::cmp::min(MEMORY_FRAMES - 1, frame_to) {

if machine.frames[i] == 0 {

let base_addr = i * MEMORY_FRAMESIZE;

memset(

&mut machine.memory[base_addr..(base_addr + MEMORY_FRAMESIZE)],

0,

);

machine.frames[i] = 1;

}

}

Ok(())

}

Since we check for out-of-bounds here, can we remove these code blocks below ?

inited_memory() does not check out-of-bounds.

As https://github.com/nervosnetwork/ckb-vm/pull/107/files#r469094023 says, I will combine these logics: inited_memory, out-of-bounds-check and check_write_permission

I do suggest we keep u64 here to be explicit with the types, and only cast it to usize when we do need indexing. ckb-vm has employed this way throughout the codebase to be more precise with types.

src/machine/asm/mod.rs

xxuejie · 2020-08-14T06:00:57Z

src/machine/aot/aot.x64.c

+  | jne >4
+  | mov byte [r9+rsi], 1
+  | push rax
+  | mov rax, rsi


Just one minor suggestion here: if zeroed_memory takes address from rsi instead of rax, we might not need to save, update and restore rax here, correct?

Afraid not, the syscall number needs to be stored in rax.

What is syscall number here?

https://github.com/nervosnetwork/ckb-vm/pull/107/files/822591565524b1a036cfdbc226618b418ff9c74f#diff-429dcb115a620eba83a6f6512807bfffR433

The address of memset is stored in rax, and rsi may be used as a parameter. Not a syscall number, I described this code incorrectly.

Wait a minute, I will try to change it, you may be right

What I'm saying, is that you can just use rsi as a parameter to zeroed_memory instead of rax. Internally in zeroed_memory is a different story. This way, you don't need to push rax, mov rax, rsi and then pop rax surrounding zeroed_memory. The address, that will be used by zeroed_memory, is already in rsi, and we can rely on this fact.

I changed it, the code is more clear than before.

xxuejie · 2020-08-14T06:07:16Z

src/machine/aot/aot.x64.c

@@ -1224,7 +1317,8 @@ int aot_memory_write(AotContext* context, AotValue address, AotValue v, uint32_t
  if (ret != DASM_S_OK) { return ret; }

  | mov rdx, size
-  | call ->check_write
+  | mov rcx, 1


A common trick here, is that we can write mov ecx, 1 for the same result, but it will be encoded in a shorter instruction.

xxuejie · 2020-08-14T06:13:33Z

src/machine/asm/execute.S

+  CALL_MEMSET; \
+  POSTCALL; \
+2: \
+  movq $check_write, temp_reg1; \


I feel it is better if we define 2 macros here, one deals with both memory initialization and write permission check, for store operations. One deals only with memory initialization for load operations.

The problem here, is that even though we do a comparison based on check_write, and skip the operations in the read calls, those will still be generated in the code, resulting in larger code size. For a hot loop that is extremely performance sensitive, I suggest we keep the code as short as possible, even though we might experience some code duplication

I rolled back the modification of this file

xxuejie · 2020-08-14T06:18:35Z

src/machine/asm/mod.rs

+    if overflowed {
+        return Err(Error::OutOfBound);
+    }
+    let frame_to = addr_to >> MEMORY_FRAME_SHIFTS;


Should frame_to be (addr_to - 1) >> MEMORY_FRAME_SHIFTS? frame_to is inclusive below.

Yes, a bug here

xxuejie

🎆 🎆 🎆

mohanson added 4 commits August 5, 2020 15:08

feat: lazy initialize memory

da4a27d

fix: missing return

5c80a40

chore

8d56b2b

fix: a bug that I don't know why it can be solved

27f34da

mohanson requested review from a team and doitian August 7, 2020 02:41

mohanson added 2 commits August 7, 2020 11:13

fix: ci

189217a

fix: ci

01432bc

xxuejie reviewed Aug 7, 2020

View reviewed changes

mohanson force-pushed the mem branch from 853db40 to 40c76de Compare August 9, 2020 09:33

refactor: use c's memset for zeroed_memory() in aot.x64.c

6959597

mohanson force-pushed the mem branch from 40c76de to 6959597 Compare August 9, 2020 09:45

mohanson added 13 commits August 11, 2020 22:45

refactor: rewrite inited_memory with memset

7940c4d

fix: ci failed on windows

ee76c8e

fix: ci failed on windows

931846f

fix: ci failed on windows

f9bf870

fix: ci failed on windows

4b5c74b

fix: ci failed on windows

25a1381

fix: ci failed on windows

92652df

fix: ci failed on windows

726abcc

fix: ci failed on windows

7b48287

fix: ci failed on windows

9a2e6d8

fix: ci failed on windows

4d44ff0

optmize: remove duplicate out-of-bounds checks

b28d10c

optmize: remove unnecessary codes in zeroed_memory

3712335

xxuejie reviewed Aug 12, 2020

View reviewed changes

definitions/src/asm.rs Show resolved Hide resolved

quake reviewed Aug 12, 2020

View reviewed changes

jjyr reviewed Aug 13, 2020

View reviewed changes

src/machine/asm/mod.rs Outdated Show resolved Hide resolved

mohanson added 2 commits August 13, 2020 11:49

refactor: combine check_write and inited_meory in aot.x64.c

088176f

refactor: combine inited_memory and check_write_permission in executes.S

7ce3db3

refine: slightly speed up inited_memory

8225915

xxuejie reviewed Aug 14, 2020

View reviewed changes

mohanson added 4 commits August 14, 2020 14:46

fix: boundary error in inited_memory

4d2daa2

refine: zeroed_memory use rsi directly

691612e

refine: use ecx as the third parameter for access_control

2846a3a

Roll back changes on execute.S

b4b48aa

mohanson added 3 commits August 17, 2020 15:02

refactor: Combine memory initialization into check_write_permission

508e45c

update: reduce parameter passing

ebc74d3

update: aot.x64.c

de58f43

xxuejie approved these changes Aug 17, 2020

View reviewed changes

xxuejie merged commit c469792 into nervosnetwork:develop Aug 17, 2020

mohanson deleted the mem branch July 13, 2023 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: lazy initialization memory #107

Feat: lazy initialization memory #107

mohanson commented Aug 7, 2020 •

edited

Loading

xxuejie Aug 7, 2020

xxuejie Aug 7, 2020

mohanson Aug 12, 2020

xxuejie Aug 7, 2020

xxuejie Aug 7, 2020

mohanson Aug 12, 2020

mohanson Aug 12, 2020

xxuejie Aug 12, 2020

mohanson Aug 13, 2020

quake Aug 12, 2020 •

edited

Loading

mohanson Aug 12, 2020

xxuejie Aug 13, 2020

xxuejie Aug 14, 2020

mohanson Aug 14, 2020

xxuejie Aug 14, 2020

mohanson Aug 14, 2020

mohanson Aug 14, 2020

xxuejie Aug 14, 2020

mohanson Aug 14, 2020

xxuejie Aug 14, 2020

mohanson Aug 14, 2020

xxuejie Aug 14, 2020

mohanson Aug 14, 2020

xxuejie Aug 14, 2020

mohanson Aug 14, 2020

xxuejie left a comment

Feat: lazy initialization memory #107

Feat: lazy initialization memory #107

Conversation

mohanson commented Aug 7, 2020 • edited Loading

Description

Bench

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quake Aug 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xxuejie left a comment

Choose a reason for hiding this comment

mohanson commented Aug 7, 2020 •

edited

Loading

quake Aug 12, 2020 •

edited

Loading