-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: lazy initialization memory #107
Conversation
| mov byte [rdx+rcx], 1 | ||
| push rax | ||
| mov rax, rcx | ||
| call ->zeroed_memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it's better we leverage memset than hand writing zeroed_memory
function. Modern memset would leverage SIMD for more speedups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced it with memset()
. They are called differently on different operating systems, and it took me a long time to understand this. It should work well for now.
src/machine/asm/execute.S
Outdated
shr $2, %ecx; \ | ||
cld; rep; stosl; | ||
|
||
#define INITED_MEMORY(address_reg, temp_reg1, temp_reg2, temp_reg2d, length) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about the duplication of all the code here, can you get some stat number on the code size before and after this change for the whole execute.o
section? We do want all the code here to fit in L1 cache so as to be as performant as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in the worse case, the code for CHECK_WRITE_PERMISSION
are mostly the same, we should be able to combine them for less code size and less executed instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in the worse case, the code for
CHECK_WRITE_PERMISSION
are mostly the same, we should be able to combine them for less code size and less executed instructions.
Yes, the code for checking out of bounds
is duplicated. I am going to remove this in CHECK_WRITE_PERMISSION
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about the duplication of all the code here, can you get some stat number on the code size before and after this change for the whole
execute.o
section? We do want all the code here to fit in L1 cache so as to be as performant as possible.
I made a comparison, the original execute.o
is 15K, now (based on the latest commit) it is 18K.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge CHECK_WRITE_PERMISSION
and INITED_MEMORY
for less instructions executed. Also keep in mind that INITED_MEMORY
might also be used by memory reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/machine/asm/mod.rs
Outdated
fn inited_memory(machine: &mut AsmCoreMachine, addr: u64, size: u64) -> Result<(), Error> { | ||
let frame_from = addr / MEMORY_FRAMESIZE as u64; | ||
let (addr_to, overflowed) = addr.overflowing_add(size); | ||
if overflowed { | ||
return Err(Error::OutOfBound); | ||
} | ||
let frame_to = addr_to / MEMORY_FRAMESIZE as u64; | ||
for i in frame_from..=std::cmp::min(MEMORY_FRAMES as u64 - 1, frame_to) { | ||
if machine.frames[i as usize] == 0 { | ||
let base_addr = i * MEMORY_FRAMESIZE as u64; | ||
memset( | ||
&mut machine.memory | ||
[base_addr as usize..(base_addr + MEMORY_FRAMESIZE as u64) as usize], | ||
0, | ||
); | ||
machine.frames[i as usize] = 1; | ||
} | ||
} | ||
Ok(()) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we check for out-of-bounds here, can we remove these code blocks below ?
if addr + 2 > self.memory.len() as u64 {
return Err(Error::OutOfBound);
}
if addr + 4 > self.memory.len() as u64 {
return Err(Error::OutOfBound);
}
suggest to cast to usize instead of u64
fn inited_memory(machine: &mut AsmCoreMachine, addr: u64, size: u64) -> Result<(), Error> { | |
let frame_from = addr / MEMORY_FRAMESIZE as u64; | |
let (addr_to, overflowed) = addr.overflowing_add(size); | |
if overflowed { | |
return Err(Error::OutOfBound); | |
} | |
let frame_to = addr_to / MEMORY_FRAMESIZE as u64; | |
for i in frame_from..=std::cmp::min(MEMORY_FRAMES as u64 - 1, frame_to) { | |
if machine.frames[i as usize] == 0 { | |
let base_addr = i * MEMORY_FRAMESIZE as u64; | |
memset( | |
&mut machine.memory | |
[base_addr as usize..(base_addr + MEMORY_FRAMESIZE as u64) as usize], | |
0, | |
); | |
machine.frames[i as usize] = 1; | |
} | |
} | |
Ok(()) | |
} | |
fn inited_memory(machine: &mut AsmCoreMachine, addr: u64, size: u64) -> Result<(), Error> { | |
let (addr_to, overflowed) = addr.overflowing_add(size); | |
if overflowed || addr_to as usize > RISCV_MAX_MEMORY { | |
return Err(Error::OutOfBound); | |
} | |
let frame_from = addr as usize / MEMORY_FRAMESIZE; | |
let frame_to = addr_to as usize / MEMORY_FRAMESIZE; | |
for i in frame_from..=std::cmp::min(MEMORY_FRAMES - 1, frame_to) { | |
if machine.frames[i] == 0 { | |
let base_addr = i * MEMORY_FRAMESIZE; | |
memset( | |
&mut machine.memory[base_addr..(base_addr + MEMORY_FRAMESIZE)], | |
0, | |
); | |
machine.frames[i] = 1; | |
} | |
} | |
Ok(()) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we check for out-of-bounds here, can we remove these code blocks below ?
inited_memory() does not check out-of-bounds.
As https://github.com/nervosnetwork/ckb-vm/pull/107/files#r469094023 says, I will combine these logics: inited_memory
, out-of-bounds-check
and check_write_permission
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do suggest we keep u64 here to be explicit with the types, and only cast it to usize when we do need indexing. ckb-vm has employed this way throughout the codebase to be more precise with types.
src/machine/aot/aot.x64.c
Outdated
| jne >4 | ||
| mov byte [r9+rsi], 1 | ||
| push rax | ||
| mov rax, rsi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one minor suggestion here: if zeroed_memory
takes address from rsi instead of rax, we might not need to save, update and restore rax here, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afraid not, the syscall number needs to be stored in rax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is syscall number here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The address of memset is stored in rax, and rsi may be used as a parameter. Not a syscall number, I described this code incorrectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait a minute, I will try to change it, you may be right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm saying, is that you can just use rsi as a parameter to zeroed_memory
instead of rax. Internally in zeroed_memory
is a different story. This way, you don't need to push rax, mov rax, rsi and then pop rax surrounding zeroed_memory. The address, that will be used by zeroed_memory
, is already in rsi, and we can rely on this fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it, the code is more clear than before.
src/machine/aot/aot.x64.c
Outdated
@@ -1224,7 +1317,8 @@ int aot_memory_write(AotContext* context, AotValue address, AotValue v, uint32_t | |||
if (ret != DASM_S_OK) { return ret; } | |||
|
|||
| mov rdx, size | |||
| call ->check_write | |||
| mov rcx, 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A common trick here, is that we can write mov ecx, 1
for the same result, but it will be encoded in a shorter instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/machine/asm/execute.S
Outdated
CALL_MEMSET; \ | ||
POSTCALL; \ | ||
2: \ | ||
movq $check_write, temp_reg1; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it is better if we define 2 macros here, one deals with both memory initialization and write permission check, for store operations. One deals only with memory initialization for load operations.
The problem here, is that even though we do a comparison based on check_write
, and skip the operations in the read calls, those will still be generated in the code, resulting in larger code size. For a hot loop that is extremely performance sensitive, I suggest we keep the code as short as possible, even though we might experience some code duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rolled back the modification of this file
src/machine/asm/mod.rs
Outdated
if overflowed { | ||
return Err(Error::OutOfBound); | ||
} | ||
let frame_to = addr_to >> MEMORY_FRAME_SHIFTS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should frame_to
be (addr_to - 1) >> MEMORY_FRAME_SHIFTS
? frame_to
is inclusive below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a bug here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎆 🎆 🎆
Description
Bench
Origin
Now
Almost unchanged.