New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pause ckb-vm by signal #348
Conversation
All my concerned about this PR is the consensus. It is important to note that when two identical running VMs receive a pause signal, the resulting snapshots may be different. |
src/machine/asm/execute_aarch64.S
Outdated
@@ -336,6 +339,9 @@ ckb_vm_x64_execute: | |||
add INST_PC, TRACE, CKB_VM_ASM_TRACE_OFFSET_THREAD | |||
NEXT_INST | |||
.prepare_trace: | |||
ldrb TEMP2w, [PAUSE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure on this, can you check the literature if we need atomic loads on aarch64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I compiled Pause
's has_interpreted
function on aarch64, the result is as follows:
0000000000000000 <_ZN2rs5Pause15has_interrupted17h49eea8a15effbf29E>:
0: f9400008 ldr x8, [x0]
4: 91004108 add x8, x8, #0x10
8: 08dffd08 ldarb w8, [x8]
c: 7100011f cmp w8, #0x0
10: 1a9f07e0 cset w0, ne // ne = any
14: d65f03c0 ret
Looks like we need to use ldarb
.
In addition, on x64, the result is
0000000000000000 <_ZN2rs5Pause15has_interrupted17h64f564067071b9e2E>:
0: 48 8b 07 mov (%rdi),%rax
3: 0f b6 40 10 movzbl 0x10(%rax),%eax
7: 84 c0 test %al,%al
9: 0f 95 c0 setne %al
c: c3 ret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that's because on x64, movzbl
guarentees atomic load, but on aarch64
we need to explicit use the atomic load operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't think there is an error using ldrb
either, I use the Ordering::SeqCst
when implementing Pause
, if I use Ordering::Relaxed
, the rust compiler will use ldrb
instead of ldarb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ordering::Relaxed
essentially means "I don't care in what order you load the memory, it could happen before the other side writes the stop signal, or after that", here we need acquiring load to ensure that loading operations happen after any flying storing operations. I do believe we need ldarb
here. In any case the compiler will not warn us, this will just be a mysterious runtime error later when at certain chip / OS configuration, we found that pausing does not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically when I write multi-threaded software, I use release
ordering store and acquire
ordering load, that can get you correct for the majority of the time. Only in very rare situations we would need acq-rel
or even seq-cst
.
} | ||
|
||
pub fn get_raw_ptr(&self) -> *mut u8 { | ||
&*self.s as *const _ as *mut u8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need get_mut here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot borrow data in an Arc
as mutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that sense you will need Mutex to wrap on AtomicU8
.
What I'm concerned here, is that we are casting a *AtomicU8
directly to a *u8
, which means we are relying on the underlying layout that the u8 value lies at the first of the AtomicU8
structure. This is something that worries me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can write a test case to guarantee the relationship between AtomicU8 and u8.
Personally I don't think this is a big problem, yes 2 running VMs might stop at different places when using pause signals. But what we need to guarentee, is that each VM will finally end up in the idential process after multiple pause-snapshot-resume actions. We will not restore VM 2 using VM 1's snapshot. |
Can we dive in the signal latency issue a bit? Why reducing the signal latency makes test case pass? I'm concerned we might have instable code or at least instable tests here |
I think the execution of fib(30) is done within a second. (the signal was sent every 100 milliseconds, 10 times in total) |
I understand that the code works here now, but if we had a little time I strongly recommended we restructure the test code here. We could introduce a new syscall that maybe like sleeps for 100ms, or even better, build a semaphore between the syscall and the outsider code, so we can precisely control when to pause the program, providing more stable and robust tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function looks okay, I do wish we can adjust the tests a bit but no strong blocker against merging
* POC * The signal renamed to pause * Add signal structure * Implement it in aarch64 * Rename sturct Signal to Pause * Implemented in the interpreter * Fix ci * Fix aarch64 * Use ldarb to load the pause signal on aarch64 * Assert AtomicU8 type has the same in-memory representation as u8 * Reduce signal latency * Refactor pause tests cases
* POC * The signal renamed to pause * Add signal structure * Implement it in aarch64 * Rename sturct Signal to Pause * Implemented in the interpreter * Fix ci * Fix aarch64 * Use ldarb to load the pause signal on aarch64 * Assert AtomicU8 type has the same in-memory representation as u8 * Reduce signal latency * Refactor pause tests cases
In this PR, a new way to "pause" a running vm is implemented for ckb-vm. We can pause a running vm by sending a signal.
In addition, the signal sender and the vm instance may not be in the same thread.