Skip to content

Conversation

shqking
Copy link
Contributor

@shqking shqking commented May 25, 2021

As updated in file TSRM.c, function tsrm_get_ls_cache_tcb_offset() would
return an address, which stores the address of function tlv_get_addr(),
and this return value would be further used as the first argument of
function tlv_get_addr().

Note that 1) the address of thread local variable "_tsrm_ls_cache" is
obtained by invoking tlv_get_addr() in macOS, and 2) 'ldr' instruction
would be patched to "add" by the linker. See the example in [1].

The disassembly code for function tlv_get_addr() is shown in [2].

In this patch, we implement one fast path for tlv_get_addr in macro
LOAD_TSRM_CACHE and will fall back to slow path if needed.

Based on the comment for function tlv_allocate_and_initialize_for_key()
from [3], the slow path is supposed to be executed only once for each
thread.

Even though a number of test cases failed due to the potential conflicts
of local labels, this patch can be submitted for internal/upstream
reviews.

[1] https://gist.github.com/shqking/4aab67e0105f7c1f2c549d57d5799f94
[2] https://gist.github.com/shqking/329d7712c26bad49786ab0a544a4af43
[3]
https://opensource.apple.com/source/dyld/dyld-195.6/src/threadLocalVariables.c.auto.html

Change-Id: I613e9c37e3ff2ecc3fab0f53f1e48a0246e12ee3

As updated in file TSRM.c, function tsrm_get_ls_cache_tcb_offset() would
return an address, which stores the address of function tlv_get_addr(),
and this return value would be further used as the first argument of
function tlv_get_addr().

Note that 1) the address of thread local variable "_tsrm_ls_cache" is
obtained by invoking tlv_get_addr() in macOS, and 2) 'ldr' instruction
would be patched to "add" by the linker. See the example in [1].

The disassembly code for function tlv_get_addr() is shown in [2].

In this patch, we implement one fast path for tlv_get_addr in macro
LOAD_TSRM_CACHE and will fall back to slow path if needed.

Based on the comment for function tlv_allocate_and_initialize_for_key()
from [3], the slow path is supposed to be executed only once for each
thread.

Even though a number of test cases failed due to the potential conflicts
of local labels, this patch can be submitted for internal/upstream
reviews.

[1] https://gist.github.com/shqking/4aab67e0105f7c1f2c549d57d5799f94
[2] https://gist.github.com/shqking/329d7712c26bad49786ab0a544a4af43
[3]
https://opensource.apple.com/source/dyld/dyld-195.6/src/threadLocalVariables.c.auto.html

Change-Id: I613e9c37e3ff2ecc3fab0f53f1e48a0246e12ee3
@shqking shqking requested a review from dstogov May 25, 2021 09:37
| ldr TMP2, [TMP3, #0x10]
| ldr reg, [TMP1, TMP2]
| b >7
|4:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion:
The slow path is common code and it would be good to put it in some global stub, however, we have to jump back from global stub.
As per "cold code", this would increase the code size since this piece of code would be generated for each invocation of macro "LOAD_TSRM_CACHE".

Another solution might be that, we access variable tsrm_ls_cache just after the thread is created, and in this way the slow path can be avoided. Yes. This would be correct with the premise that slow path is executed only once for each thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that it works in my local testing environment with the following patch. All test cases can pass. But I'm not sure if this is sound.

diff --git a/TSRM/TSRM.c b/TSRM/TSRM.c
index 18cf0d163c..c2907dee56 100644
--- a/TSRM/TSRM.c
+++ b/TSRM/TSRM.c
@@ -745,6 +745,7 @@ TSRM_API size_t tsrm_get_ls_cache_tcb_offset(void)
        size_t ret;

 # ifdef __APPLE__
+       ret = _tsrm_ls_cache;
        // In macOS, "ret" holds the address of one TLVDecriptor, where the first
        // member is the address of function tlv_get_addr().
        asm("adrp %0, #__tsrm_ls_cache@TLVPPAGE\n\t"
diff --git a/ext/opcache/jit/zend_jit_arm64.dasc b/ext/opcache/jit/zend_jit_arm64.dasc
index 0fc88a5090..615c08a749 100644
--- a/ext/opcache/jit/zend_jit_arm64.dasc
+++ b/ext/opcache/jit/zend_jit_arm64.dasc
@@ -554,24 +554,8 @@ static int logical_immediate_p (uint64_t value, uint32_t reg_size)
 |      and TMP1, TMP1, #0xfffffffffffffff8
 |      ldr TMP2, [TMP3, #0x8]
 |      ldr TMP1, [TMP1, TMP2, lsl #3]
-|      cbz TMP1, >4     // TODO: There is conflict for local labels. A number of test cases would fail.
 |      ldr TMP2, [TMP3, #0x10]
 |      ldr reg, [TMP1, TMP2]
-|      b >7
-|4:
-|      // Slow path. Invoke tlv_get_addr(). TODO: Put it in cold code or global stub?
-|      stp x29, x30, [sp, #-16]!
-|      mov x29, sp
-|      SAVE_REGS
-|      mov x0, TMP3
-|      ldr TMP1, [TMP3]
-|      blr TMP1
-|      mov TMP3, x0
-|      RESTORE_REGS     // TMP3 is not in this list
-|      ldr reg, [TMP3]
-|      mov sp, x29
-|      ldp, x29, x30, [sp], #16
-|7:
 ||#else
 |      .long 0xd53bd051 // TODO: hard-coded: mrs TMP3, tpidr_el0
 ||     ZEND_ASSERT(tsrm_ls_cache_tcb_offset <= LDR_STR_PIMM64);

| and TMP1, TMP1, #0xfffffffffffffff8
| ldr TMP2, [TMP3, #0x8]
| ldr TMP1, [TMP1, TMP2, lsl #3]
| cbz TMP1, >4 // TODO: There is conflict for local labels. A number of test cases would fail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slow patch should be never reached. It may be called when thread accesses tsrm_ls_cache for the first time, but it definitely must be accessed before control come to JIT-ed code. Do you understand the meaning of the constants above 0xfffffffffffffff8, 0x9, lsl 3?

I think, ldr TMP2, [TMP3, #0x8] and following lsl #3 should be constant for all threads. Probably, it may be executed in TSRM.c code. Then resulting code should be:

mrs TMP1, tpidrro_el0
SAFE_MEM_ACC_WITH_UOFFSET ldr, TMP1, TMP1, tsrm_ls_cache_tcb_offset_0x8_lsl_3, TMP2
SAFE_MEM_ACC_WITH_UOFFSET ldr, TMP1, TMP1, tsrm_ls_cache_tcb_offset_0x10, TMP2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Agree. The slow path should be avoided. But may I ask where is the proper place to access tsrm_ls_cache?

0xfffffffffffffff8 is used as CPU number mask. See https://opensource.apple.com/source/xnu/xnu-4570.1.46/osfmk/arm64/cswitch.s.auto.html, the "set_thread_registers" section.

ldr TMP2, [TMP3, #0x8] is to get the "key" and ldr TMP2, [TMP3, #0x10] is used to get the "offset". See struct TLVDescriptor from https://opensource.apple.com/source/dyld/dyld-195.6/src/threadLocalVariables.c.auto.html

I'm afraid I didn't get lsl #3 either.

Yes. I agree. "key" and "offset" are both constant and we can get them in the TSRM.c code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we may keep TSRM.c and do something like this.

mrs TMP1, tpidrro_el0
SAFE_MEM_ACC_WITH_UOFFSET ldr, TMP1, TMP1, ((TLVDescriptor*)(tsrm_ls_cache_tcb_offset))->key << 3, TMP2
SAFE_MEM_ACC_WITH_UOFFSET ldr, TMP1, TMP1, ((TLVDescriptor*)(tsrm_ls_cache_tcb_offset))->offset, TMP2

may be need also add and TMP1, TMP1, #0xfffffffffffffff8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point. I will update the code soon.

Another question for me is where is the best place for each thread to access thread local varaible_tsrm_ls_cache ?

I guess zend_jit_setup() -> tsrm_get_ls_cache_tcb_offset() is executed only once for each process, not for each thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_tsrm_ls_cache is guaranteed to be accessed before any VM or JIT code, during request startup, e.g. in init_executor()

zend_jit_setup() executed once, but in case TLVDescriptor is immutable, this should be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLVDescriptor for variable _tsrm_ls_cache is immutable, but the value in TMP1 at line 556 varies for different threads. This value is also SAFE_MEM_ACC_WITH_UOFFSET ldr, TMP1, TMP1, ((TLVDescriptor*)(tsrm_ls_cache_tcb_offset))->key << 3, TMP2 for your update.

If TMP1 is zero at line 556, slow path would be invoked.

Here I think we should guarantee this TMP1 at line 556 (the value stored in
TMP1 + ((TLVDescriptor*)tsrm_ls_cache_tcb_offset))->key << 3) ) for each thread is initialized in advance. Right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLVDescriptor for variable _tsrm_ls_cache is immutable, but the value in TMP1 at line 556 varies for different threads. This value is also SAFE_MEM_ACC_WITH_UOFFSET ldr, TMP1, TMP1, ((TLVDescriptor*)(tsrm_ls_cache_tcb_offset))->key << 3, TMP2 for your update.

This is expected, but I don't see any problems.

If TMP1 is zero at line 556, slow path would be invoked.

Here I think we should guarantee this TMP1 at line 556 (the value stored in
TMP1 + ((TLVDescriptor*)tsrm_ls_cache_tcb_offset))->key << 3) ) for each thread is initialized in advance. Right?

In my opinion, it can't be zero at the moment when JIT-ed code is executed.

Access to TLV(thread local variable) in macOS is in "dynamic" form and
function tlv_get_addr() is invoked to resolve the address. See the
example in [1].

Note there is one struct TLVDescriptor [2] for each TLV. The first
member holds the address of function tlv_get_addr(), and the other two
members, "key" and "offset", would be used inside tlv_get_addr().

The disassembly code for function tlv_get_addr() is shown in [3]. With
the value from system register, i.e. tpidrro_el0, together with "key"
and "offset", the TLV address can be obtained.

Note that the value from tpidrro_el0 varies for different threads, and
unique address for TLV is resolved.

It's worth noting that slow path would be executed, i.e. function
tlv_allocate_and_initialize_for_key(), for the first time of TLV access.

In this patch:
1. "_tsrm_ls_cache" is guaranteed to be accessed before any VM/JIT code
during the request startup, e.g. in init_executor(), therefore, slow
path can be avoided.

2. As TLVDecriptor is immutable and zend_jit_setup() executes once, we
get this structure in tsrm_get_ls_cache_tcb_offset(). Note the 'ldr'
instruction would be patched to 'add' by the linker.

3. Only fast path for tlv_get_addr() is implemented in macro
LOAD_TSRM_CACHE.

With this patch, all ~4k test cases can pass for ZTS+CALL in macOS on
Apple silicon.

[1] https://gist.github.com/shqking/4aab67e0105f7c1f2c549d57d5799f94
[2]
https://opensource.apple.com/source/dyld/dyld-195.6/src/threadLocalVariables.c.auto.html
[3] https://gist.github.com/shqking/329d7712c26bad49786ab0a544a4af43

Change-Id: I613e9c37e3ff2ecc3fab0f53f1e48a0246e12ee3
||#ifdef __APPLE__
| .long 0xd53bd071 // TODO: hard-coded: mrs TMP3, tpidrro_el0
| and TMP3, TMP3, #0xfffffffffffffff8
| SAFE_MEM_ACC_WITH_64_UOFFSET ldr, TMP3, TMP3, (((TLVDescriptor*)tsrm_ls_cache_tcb_offset)->key << 3), TMP1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slow path is avoided with the premise that tlv _tsrm_ls_cache is guaranteed to be accessed before the JIT-ed.

We have one concern that it would be a bit difficult for debugging if the premise is not satisfied with the software evolution in the future.

@dstogov Do you think we should add one check here?
How to do:

  1. add cbz TMP3, ->trap_label here
  2. define one global label named trap_label in one new defined stub function only for macOS case. In this label, brk #0 is the only instruction.

Impact:

  1. Performance: one cbz instruction would be executed each time, but it would not impose much overhead with the help of branch prediction.
  2. Code size: one cbz instruction would be produced for each invocation of macro LOAD_TSRM_CACHE.

Pros: In the future, if the premise is missing, it helps to debug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, everything fine now. There is no a big difference between debugging brk #0 trap or crash right after mrs TMP3, tpidrro_el0. I prefer not to add cbz.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cbz is to check the value in TMP3 after the first load at line 508.
It's possible that the crash site is far away mrs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recall that the value in TMP3 after line 508 corresponds to the x17 after line 5 in https://gist.github.com/shqking/329d7712c26bad49786ab0a544a4af43#file-tlv_get_addr-s-L6

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume *.phpt tests are passed.

||#ifdef __APPLE__
| .long 0xd53bd071 // TODO: hard-coded: mrs TMP3, tpidrro_el0
| and TMP3, TMP3, #0xfffffffffffffff8
| SAFE_MEM_ACC_WITH_64_UOFFSET ldr, TMP3, TMP3, (((TLVDescriptor*)tsrm_ls_cache_tcb_offset)->key << 3), TMP1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, everything fine now. There is no a big difference between debugging brk #0 trap or crash right after mrs TMP3, tpidrro_el0. I prefer not to add cbz.

The update in zend_jit.c is one workaround for macOS ZTS support. One
TODO comment is added.

Change-Id: I613e9c37e3ff2ecc3fab0f53f1e48a0246e12ee3
@shqking
Copy link
Contributor Author

shqking commented May 31, 2021

I assume *.phpt tests are passed.

Yes. All ~4k *.phpt test cases under Zend/tests/, tests/, ext/opcache/tests/jit/ would pass for JIT/arm64 under "ZTS+CALL" mode.

@shqking shqking merged commit 7c2a3a9 into php:master May 31, 2021
@shqking shqking deleted the fast-path-tls branch May 31, 2021 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants