Pause ckb-vm by signal #348

mohanson · 2023-06-06T07:37:47Z

In this PR, a new way to "pause" a running vm is implemented for ckb-vm. We can pause a running vm by sending a signal.

In addition, the signal sender and the vm instance may not be in the same thread.

XuJiandong · 2023-06-06T08:50:34Z

All my concerned about this PR is the consensus. It is important to note that when two identical running VMs receive a pause signal, the resulting snapshots may be different.

xxuejie · 2023-06-07T01:16:02Z

src/machine/asm/execute_aarch64.S

@@ -336,6 +339,9 @@ ckb_vm_x64_execute:
  add INST_PC, TRACE, CKB_VM_ASM_TRACE_OFFSET_THREAD
  NEXT_INST
 .prepare_trace:
+  ldrb TEMP2w, [PAUSE]


I wasn't sure on this, can you check the literature if we need atomic loads on aarch64?

I compiled Pause's has_interpreted function on aarch64, the result is as follows:

0000000000000000 <_ZN2rs5Pause15has_interrupted17h49eea8a15effbf29E>: 0: f9400008 ldr x8, [x0] 4: 91004108 add x8, x8, #0x10 8: 08dffd08 ldarb w8, [x8] c: 7100011f cmp w8, #0x0 10: 1a9f07e0 cset w0, ne // ne = any 14: d65f03c0 ret

Looks like we need to use ldarb.

In addition, on x64, the result is

0000000000000000 <_ZN2rs5Pause15has_interrupted17h64f564067071b9e2E>: 0: 48 8b 07 mov (%rdi),%rax 3: 0f b6 40 10 movzbl 0x10(%rax),%eax 7: 84 c0 test %al,%al 9: 0f 95 c0 setne %al c: c3 ret

Yeah I think that's because on x64, movzbl guarentees atomic load, but on aarch64 we need to explicit use the atomic load operations

But I don't think there is an error using ldrb either, I use the Ordering::SeqCst when implementing Pause, if I use Ordering::Relaxed, the rust compiler will use ldrb instead of ldarb

Ordering::Relaxed essentially means "I don't care in what order you load the memory, it could happen before the other side writes the stop signal, or after that", here we need acquiring load to ensure that loading operations happen after any flying storing operations. I do believe we need ldarb here. In any case the compiler will not warn us, this will just be a mysterious runtime error later when at certain chip / OS configuration, we found that pausing does not work.

Typically when I write multi-threaded software, I use release ordering store and acquire ordering load, that can get you correct for the majority of the time. Only in very rare situations we would need acq-rel or even seq-cst.

xxuejie · 2023-06-07T01:17:03Z

src/machine/mod.rs

+    }
+
+    pub fn get_raw_ptr(&self) -> *mut u8 {
+        &*self.s as *const _ as *mut u8


Do we need get_mut here?

We cannot borrow data in an Arc as mutable.

In that sense you will need Mutex to wrap on AtomicU8.

What I'm concerned here, is that we are casting a *AtomicU8 directly to a *u8, which means we are relying on the underlying layout that the u8 value lies at the first of the AtomicU8 structure. This is something that worries me.

We can write a test case to guarantee the relationship between AtomicU8 and u8.

xxuejie · 2023-06-07T01:21:17Z

All my concerned about this PR is the consensus. It is important to note that when two identical running VMs receive a pause signal, the resulting snapshots may be different.

Personally I don't think this is a big problem, yes 2 running VMs might stop at different places when using pause signals. But what we need to guarentee, is that each VM will finally end up in the idential process after multiple pause-snapshot-resume actions. We will not restore VM 2 using VM 1's snapshot.

xxuejie · 2023-06-07T03:03:18Z

Can we dive in the signal latency issue a bit? Why reducing the signal latency makes test case pass? I'm concerned we might have instable code or at least instable tests here

mohanson · 2023-06-07T03:10:43Z

Can we dive in the signal latency issue a bit? Why reducing the signal latency makes test case pass? I'm concerned we might have instable code or at least instable tests here

I think the execution of fib(30) is done within a second. (the signal was sent every 100 milliseconds, 10 times in total)

src/machine/asm/execute_aarch64.S

xxuejie · 2023-06-07T03:15:43Z

I understand that the code works here now, but if we had a little time I strongly recommended we restructure the test code here. We could introduce a new syscall that maybe like sleeps for 100ms, or even better, build a semaphore between the syscall and the outsider code, so we can precisely control when to pause the program, providing more stable and robust tests.

xxuejie

The function looks okay, I do wish we can adjust the tests a bit but no strong blocker against merging

* POC * The signal renamed to pause * Add signal structure * Implement it in aarch64 * Rename sturct Signal to Pause * Implemented in the interpreter * Fix ci * Fix aarch64 * Use ldarb to load the pause signal on aarch64 * Assert AtomicU8 type has the same in-memory representation as u8 * Reduce signal latency * Refactor pause tests cases

mohanson added 8 commits June 5, 2023 12:47

POC

42e6a51

The signal renamed to pause

4bc0925

Add signal structure

f19b422

Implement it in aarch64

390aee9

Rename sturct Signal to Pause

e2946cf

Implemented in the interpreter

29f77a3

Fix ci

905918d

Fix aarch64

37d8305

mohanson requested a review from xxuejie June 6, 2023 09:19

xxuejie reviewed Jun 7, 2023

View reviewed changes

mohanson added 3 commits June 7, 2023 09:47

Use ldarb to load the pause signal on aarch64

9f2f36a

Assert AtomicU8 type has the same in-memory representation as u8

2712a87

Reduce signal latency

62ffb78

xxuejie reviewed Jun 7, 2023

View reviewed changes

src/machine/asm/execute_aarch64.S Show resolved Hide resolved

xxuejie previously approved these changes Jun 7, 2023

View reviewed changes

mohanson dismissed xxuejie’s stale review via 4d56b0c June 7, 2023 05:56

Refactor pause tests cases

4d56b0c

xxuejie approved these changes Jun 7, 2023

View reviewed changes

mohanson merged commit 1f9c765 into nervosnetwork:develop Jun 7, 2023
11 checks passed

mohanson deleted the signal branch June 7, 2023 06:18

mohanson mentioned this pull request Jun 12, 2023

chore(deps): Update ckb-vm to v0.24.1 nervosnetwork/ckb#4007

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pause ckb-vm by signal #348

Pause ckb-vm by signal #348

mohanson commented Jun 6, 2023

XuJiandong commented Jun 6, 2023

xxuejie Jun 7, 2023

mohanson Jun 7, 2023

xxuejie Jun 7, 2023

mohanson Jun 7, 2023

xxuejie Jun 7, 2023

xxuejie Jun 7, 2023

xxuejie Jun 7, 2023

mohanson Jun 7, 2023

xxuejie Jun 7, 2023

mohanson Jun 7, 2023

xxuejie commented Jun 7, 2023

xxuejie commented Jun 7, 2023

mohanson commented Jun 7, 2023

xxuejie commented Jun 7, 2023

xxuejie left a comment

Pause ckb-vm by signal #348

Pause ckb-vm by signal #348

Conversation

mohanson commented Jun 6, 2023

XuJiandong commented Jun 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xxuejie commented Jun 7, 2023

xxuejie commented Jun 7, 2023

mohanson commented Jun 7, 2023

xxuejie commented Jun 7, 2023

xxuejie left a comment

Choose a reason for hiding this comment