Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: lazy initialization memory #107

Merged
merged 30 commits into from
Aug 17, 2020
Merged

Conversation

mohanson
Copy link
Collaborator

@mohanson mohanson commented Aug 7, 2020

Description

  • Split program memory by 256k size frames, only when a frame is used will it be filled with 0.
  • Maximum memory is still a fixed value: 4MB

Bench

Origin

interpret secp256k1_bench     time:   [21.956 ms 22.008 ms 22.066 ms]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

interpret secp256k1_bench via assembly     time:   [5.6998 ms 5.7362 ms 5.7741 ms]

Benchmarking aot secp256k1_bench: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.0s, enable flat sampling, or reduce sample count to 50.
aot secp256k1_bench     time:   [1.4025 ms 1.4070 ms 1.4124 ms]                                 
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

compiling secp256k1_bench for aot     time:   [26.887 ms 26.990 ms 27.114 ms]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Now

interpret secp256k1_bench     time:   [22.170 ms 22.221 ms 22.278 ms]
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

interpret secp256k1_bench via assembly     time:   [5.6565 ms 5.6904 ms 5.7279 ms]
Found 10 outliers among 100 measurements (10.00%)
  8 (8.00%) high mild
  2 (2.00%) high severe

Benchmarking aot secp256k1_bench: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.0s, enable flat sampling, or reduce sample count to 50.
aot secp256k1_bench     time:   [1.6108 ms 1.6138 ms 1.6174 ms]                                 
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) high mild
  8 (8.00%) high severe

compiling secp256k1_bench for aot     time:   [27.741 ms 27.806 ms 27.873 ms]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

Almost unchanged.

@mohanson mohanson requested review from a team and doitian August 7, 2020 02:41
| mov byte [rdx+rcx], 1
| push rax
| mov rax, rcx
| call ->zeroed_memory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it's better we leverage memset than hand writing zeroed_memory function. Modern memset would leverage SIMD for more speedups.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do feel like it might be better we add a new exit code here, when an uninitialized frame is detected, we exit from AOT and ASM mode with the error code, then at Rust side, we can fill in the zeros using code which is just a memset. That to me, might be a better and (hopefully) faster way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced it with memset(). They are called differently on different operating systems, and it took me a long time to understand this. It should work well for now.

shr $2, %ecx; \
cld; rep; stosl;

#define INITED_MEMORY(address_reg, temp_reg1, temp_reg2, temp_reg2d, length) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about the duplication of all the code here, can you get some stat number on the code size before and after this change for the whole execute.o section? We do want all the code here to fit in L1 cache so as to be as performant as possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in the worse case, the code for CHECK_WRITE_PERMISSION are mostly the same, we should be able to combine them for less code size and less executed instructions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in the worse case, the code for CHECK_WRITE_PERMISSION are mostly the same, we should be able to combine them for less code size and less executed instructions.

Yes, the code for checking out of bounds is duplicated. I am going to remove this in CHECK_WRITE_PERMISSION.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about the duplication of all the code here, can you get some stat number on the code size before and after this change for the whole execute.o section? We do want all the code here to fit in L1 cache so as to be as performant as possible.

I made a comparison, the original execute.o is 15K, now (based on the latest commit) it is 18K.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge CHECK_WRITE_PERMISSION and INITED_MEMORY for less instructions executed. Also keep in mind that INITED_MEMORY might also be used by memory reading.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have combined the check_write() and inited_meory in one function named access_control(), please take a look.

commit:

src/machine/asm/execute.S Outdated Show resolved Hide resolved
Comment on lines 57 to 77
fn inited_memory(machine: &mut AsmCoreMachine, addr: u64, size: u64) -> Result<(), Error> {
let frame_from = addr / MEMORY_FRAMESIZE as u64;
let (addr_to, overflowed) = addr.overflowing_add(size);
if overflowed {
return Err(Error::OutOfBound);
}
let frame_to = addr_to / MEMORY_FRAMESIZE as u64;
for i in frame_from..=std::cmp::min(MEMORY_FRAMES as u64 - 1, frame_to) {
if machine.frames[i as usize] == 0 {
let base_addr = i * MEMORY_FRAMESIZE as u64;
memset(
&mut machine.memory
[base_addr as usize..(base_addr + MEMORY_FRAMESIZE as u64) as usize],
0,
);
machine.frames[i as usize] = 1;
}
}
Ok(())
}

Copy link
Member

@quake quake Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we check for out-of-bounds here, can we remove these code blocks below ?

        if addr + 2 > self.memory.len() as u64 {
            return Err(Error::OutOfBound);
        }

        if addr + 4 > self.memory.len() as u64 {
            return Err(Error::OutOfBound);
        }

suggest to cast to usize instead of u64

Suggested change
fn inited_memory(machine: &mut AsmCoreMachine, addr: u64, size: u64) -> Result<(), Error> {
let frame_from = addr / MEMORY_FRAMESIZE as u64;
let (addr_to, overflowed) = addr.overflowing_add(size);
if overflowed {
return Err(Error::OutOfBound);
}
let frame_to = addr_to / MEMORY_FRAMESIZE as u64;
for i in frame_from..=std::cmp::min(MEMORY_FRAMES as u64 - 1, frame_to) {
if machine.frames[i as usize] == 0 {
let base_addr = i * MEMORY_FRAMESIZE as u64;
memset(
&mut machine.memory
[base_addr as usize..(base_addr + MEMORY_FRAMESIZE as u64) as usize],
0,
);
machine.frames[i as usize] = 1;
}
}
Ok(())
}
fn inited_memory(machine: &mut AsmCoreMachine, addr: u64, size: u64) -> Result<(), Error> {
let (addr_to, overflowed) = addr.overflowing_add(size);
if overflowed || addr_to as usize > RISCV_MAX_MEMORY {
return Err(Error::OutOfBound);
}
let frame_from = addr as usize / MEMORY_FRAMESIZE;
let frame_to = addr_to as usize / MEMORY_FRAMESIZE;
for i in frame_from..=std::cmp::min(MEMORY_FRAMES - 1, frame_to) {
if machine.frames[i] == 0 {
let base_addr = i * MEMORY_FRAMESIZE;
memset(
&mut machine.memory[base_addr..(base_addr + MEMORY_FRAMESIZE)],
0,
);
machine.frames[i] = 1;
}
}
Ok(())
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we check for out-of-bounds here, can we remove these code blocks below ?

inited_memory() does not check out-of-bounds.

As https://github.com/nervosnetwork/ckb-vm/pull/107/files#r469094023 says, I will combine these logics: inited_memory, out-of-bounds-check and check_write_permission

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do suggest we keep u64 here to be explicit with the types, and only cast it to usize when we do need indexing. ckb-vm has employed this way throughout the codebase to be more precise with types.

src/machine/asm/mod.rs Outdated Show resolved Hide resolved
| jne >4
| mov byte [r9+rsi], 1
| push rax
| mov rax, rsi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor suggestion here: if zeroed_memory takes address from rsi instead of rax, we might not need to save, update and restore rax here, correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afraid not, the syscall number needs to be stored in rax.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is syscall number here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/nervosnetwork/ckb-vm/pull/107/files/822591565524b1a036cfdbc226618b418ff9c74f#diff-429dcb115a620eba83a6f6512807bfffR433

The address of memset is stored in rax, and rsi may be used as a parameter. Not a syscall number, I described this code incorrectly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a minute, I will try to change it, you may be right

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm saying, is that you can just use rsi as a parameter to zeroed_memory instead of rax. Internally in zeroed_memory is a different story. This way, you don't need to push rax, mov rax, rsi and then pop rax surrounding zeroed_memory. The address, that will be used by zeroed_memory, is already in rsi, and we can rely on this fact.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it, the code is more clear than before.

@@ -1224,7 +1317,8 @@ int aot_memory_write(AotContext* context, AotValue address, AotValue v, uint32_t
if (ret != DASM_S_OK) { return ret; }

| mov rdx, size
| call ->check_write
| mov rcx, 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A common trick here, is that we can write mov ecx, 1 for the same result, but it will be encoded in a shorter instruction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

CALL_MEMSET; \
POSTCALL; \
2: \
movq $check_write, temp_reg1; \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it is better if we define 2 macros here, one deals with both memory initialization and write permission check, for store operations. One deals only with memory initialization for load operations.

The problem here, is that even though we do a comparison based on check_write, and skip the operations in the read calls, those will still be generated in the code, resulting in larger code size. For a hot loop that is extremely performance sensitive, I suggest we keep the code as short as possible, even though we might experience some code duplication

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rolled back the modification of this file

if overflowed {
return Err(Error::OutOfBound);
}
let frame_to = addr_to >> MEMORY_FRAME_SHIFTS;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should frame_to be (addr_to - 1) >> MEMORY_FRAME_SHIFTS? frame_to is inclusive below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a bug here

Copy link
Collaborator

@xxuejie xxuejie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎆 🎆 🎆

@xxuejie xxuejie merged commit c469792 into nervosnetwork:develop Aug 17, 2020
@mohanson mohanson deleted the mem branch July 13, 2023 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants