Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pause ckb-vm by signal #348

Merged
merged 12 commits into from Jun 7, 2023
Merged

Pause ckb-vm by signal #348

merged 12 commits into from Jun 7, 2023

Conversation

mohanson
Copy link
Collaborator

@mohanson mohanson commented Jun 6, 2023

In this PR, a new way to "pause" a running vm is implemented for ckb-vm. We can pause a running vm by sending a signal.

In addition, the signal sender and the vm instance may not be in the same thread.

@XuJiandong
Copy link
Collaborator

All my concerned about this PR is the consensus. It is important to note that when two identical running VMs receive a pause signal, the resulting snapshots may be different.

@mohanson mohanson requested a review from xxuejie June 6, 2023 09:19
@@ -336,6 +339,9 @@ ckb_vm_x64_execute:
add INST_PC, TRACE, CKB_VM_ASM_TRACE_OFFSET_THREAD
NEXT_INST
.prepare_trace:
ldrb TEMP2w, [PAUSE]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure on this, can you check the literature if we need atomic loads on aarch64?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compiled Pause's has_interpreted function on aarch64, the result is as follows:

0000000000000000 <_ZN2rs5Pause15has_interrupted17h49eea8a15effbf29E>:
   0:   f9400008        ldr     x8, [x0]
   4:   91004108        add     x8, x8, #0x10
   8:   08dffd08        ldarb   w8, [x8]
   c:   7100011f        cmp     w8, #0x0
  10:   1a9f07e0        cset    w0, ne  // ne = any
  14:   d65f03c0        ret

Looks like we need to use ldarb.

In addition, on x64, the result is

0000000000000000 <_ZN2rs5Pause15has_interrupted17h64f564067071b9e2E>:
   0:   48 8b 07                mov    (%rdi),%rax
   3:   0f b6 40 10             movzbl 0x10(%rax),%eax
   7:   84 c0                   test   %al,%al
   9:   0f 95 c0                setne  %al
   c:   c3                      ret

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that's because on x64, movzbl guarentees atomic load, but on aarch64 we need to explicit use the atomic load operations

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't think there is an error using ldrb either, I use the Ordering::SeqCst when implementing Pause, if I use Ordering::Relaxed, the rust compiler will use ldrb instead of ldarb

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordering::Relaxed essentially means "I don't care in what order you load the memory, it could happen before the other side writes the stop signal, or after that", here we need acquiring load to ensure that loading operations happen after any flying storing operations. I do believe we need ldarb here. In any case the compiler will not warn us, this will just be a mysterious runtime error later when at certain chip / OS configuration, we found that pausing does not work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically when I write multi-threaded software, I use release ordering store and acquire ordering load, that can get you correct for the majority of the time. Only in very rare situations we would need acq-rel or even seq-cst.

}

pub fn get_raw_ptr(&self) -> *mut u8 {
&*self.s as *const _ as *mut u8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need get_mut here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot borrow data in an Arc as mutable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that sense you will need Mutex to wrap on AtomicU8.

What I'm concerned here, is that we are casting a *AtomicU8 directly to a *u8, which means we are relying on the underlying layout that the u8 value lies at the first of the AtomicU8 structure. This is something that worries me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can write a test case to guarantee the relationship between AtomicU8 and u8.

@xxuejie
Copy link
Collaborator

xxuejie commented Jun 7, 2023

All my concerned about this PR is the consensus. It is important to note that when two identical running VMs receive a pause signal, the resulting snapshots may be different.

Personally I don't think this is a big problem, yes 2 running VMs might stop at different places when using pause signals. But what we need to guarentee, is that each VM will finally end up in the idential process after multiple pause-snapshot-resume actions. We will not restore VM 2 using VM 1's snapshot.

@xxuejie
Copy link
Collaborator

xxuejie commented Jun 7, 2023

Can we dive in the signal latency issue a bit? Why reducing the signal latency makes test case pass? I'm concerned we might have instable code or at least instable tests here

@mohanson
Copy link
Collaborator Author

mohanson commented Jun 7, 2023

Can we dive in the signal latency issue a bit? Why reducing the signal latency makes test case pass? I'm concerned we might have instable code or at least instable tests here

I think the execution of fib(30) is done within a second. (the signal was sent every 100 milliseconds, 10 times in total)

@xxuejie
Copy link
Collaborator

xxuejie commented Jun 7, 2023

I understand that the code works here now, but if we had a little time I strongly recommended we restructure the test code here. We could introduce a new syscall that maybe like sleeps for 100ms, or even better, build a semaphore between the syscall and the outsider code, so we can precisely control when to pause the program, providing more stable and robust tests.

xxuejie
xxuejie previously approved these changes Jun 7, 2023
Copy link
Collaborator

@xxuejie xxuejie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function looks okay, I do wish we can adjust the tests a bit but no strong blocker against merging

@mohanson mohanson merged commit 1f9c765 into nervosnetwork:develop Jun 7, 2023
11 checks passed
@mohanson mohanson deleted the signal branch June 7, 2023 06:18
mohanson added a commit to mohanson/ckb-vm that referenced this pull request Jun 8, 2023
* POC

* The signal renamed to pause

* Add signal structure

* Implement it in aarch64

* Rename sturct Signal to Pause

* Implemented in the interpreter

* Fix ci

* Fix aarch64

* Use ldarb to load the pause signal on aarch64

* Assert AtomicU8 type has the same in-memory representation as u8

* Reduce signal latency

* Refactor pause tests cases
mohanson added a commit that referenced this pull request Jun 8, 2023
* POC

* The signal renamed to pause

* Add signal structure

* Implement it in aarch64

* Rename sturct Signal to Pause

* Implemented in the interpreter

* Fix ci

* Fix aarch64

* Use ldarb to load the pause signal on aarch64

* Assert AtomicU8 type has the same in-memory representation as u8

* Reduce signal latency

* Refactor pause tests cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants