-
Notifications
You must be signed in to change notification settings - Fork 7.9k
JIT/AArch64: [macos][ZTS] Support fast path for tlv_get_addr #7042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As updated in file TSRM.c, function tsrm_get_ls_cache_tcb_offset() would return an address, which stores the address of function tlv_get_addr(), and this return value would be further used as the first argument of function tlv_get_addr(). Note that 1) the address of thread local variable "_tsrm_ls_cache" is obtained by invoking tlv_get_addr() in macOS, and 2) 'ldr' instruction would be patched to "add" by the linker. See the example in [1]. The disassembly code for function tlv_get_addr() is shown in [2]. In this patch, we implement one fast path for tlv_get_addr in macro LOAD_TSRM_CACHE and will fall back to slow path if needed. Based on the comment for function tlv_allocate_and_initialize_for_key() from [3], the slow path is supposed to be executed only once for each thread. Even though a number of test cases failed due to the potential conflicts of local labels, this patch can be submitted for internal/upstream reviews. [1] https://gist.github.com/shqking/4aab67e0105f7c1f2c549d57d5799f94 [2] https://gist.github.com/shqking/329d7712c26bad49786ab0a544a4af43 [3] https://opensource.apple.com/source/dyld/dyld-195.6/src/threadLocalVariables.c.auto.html Change-Id: I613e9c37e3ff2ecc3fab0f53f1e48a0246e12ee3
ext/opcache/jit/zend_jit_arm64.dasc
Outdated
| ldr TMP2, [TMP3, #0x10] | ||
| ldr reg, [TMP1, TMP2] | ||
| b >7 | ||
|4: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion:
The slow path is common code and it would be good to put it in some global stub, however, we have to jump back from global stub.
As per "cold code", this would increase the code size since this piece of code would be generated for each invocation of macro "LOAD_TSRM_CACHE".
Another solution might be that, we access variable tsrm_ls_cache
just after the thread is created, and in this way the slow path can be avoided. Yes. This would be correct with the premise that slow path is executed only once for each thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that it works in my local testing environment with the following patch. All test cases can pass. But I'm not sure if this is sound.
diff --git a/TSRM/TSRM.c b/TSRM/TSRM.c
index 18cf0d163c..c2907dee56 100644
--- a/TSRM/TSRM.c
+++ b/TSRM/TSRM.c
@@ -745,6 +745,7 @@ TSRM_API size_t tsrm_get_ls_cache_tcb_offset(void)
size_t ret;
# ifdef __APPLE__
+ ret = _tsrm_ls_cache;
// In macOS, "ret" holds the address of one TLVDecriptor, where the first
// member is the address of function tlv_get_addr().
asm("adrp %0, #__tsrm_ls_cache@TLVPPAGE\n\t"
diff --git a/ext/opcache/jit/zend_jit_arm64.dasc b/ext/opcache/jit/zend_jit_arm64.dasc
index 0fc88a5090..615c08a749 100644
--- a/ext/opcache/jit/zend_jit_arm64.dasc
+++ b/ext/opcache/jit/zend_jit_arm64.dasc
@@ -554,24 +554,8 @@ static int logical_immediate_p (uint64_t value, uint32_t reg_size)
| and TMP1, TMP1, #0xfffffffffffffff8
| ldr TMP2, [TMP3, #0x8]
| ldr TMP1, [TMP1, TMP2, lsl #3]
-| cbz TMP1, >4 // TODO: There is conflict for local labels. A number of test cases would fail.
| ldr TMP2, [TMP3, #0x10]
| ldr reg, [TMP1, TMP2]
-| b >7
-|4:
-| // Slow path. Invoke tlv_get_addr(). TODO: Put it in cold code or global stub?
-| stp x29, x30, [sp, #-16]!
-| mov x29, sp
-| SAVE_REGS
-| mov x0, TMP3
-| ldr TMP1, [TMP3]
-| blr TMP1
-| mov TMP3, x0
-| RESTORE_REGS // TMP3 is not in this list
-| ldr reg, [TMP3]
-| mov sp, x29
-| ldp, x29, x30, [sp], #16
-|7:
||#else
| .long 0xd53bd051 // TODO: hard-coded: mrs TMP3, tpidr_el0
|| ZEND_ASSERT(tsrm_ls_cache_tcb_offset <= LDR_STR_PIMM64);
ext/opcache/jit/zend_jit_arm64.dasc
Outdated
| and TMP1, TMP1, #0xfffffffffffffff8 | ||
| ldr TMP2, [TMP3, #0x8] | ||
| ldr TMP1, [TMP1, TMP2, lsl #3] | ||
| cbz TMP1, >4 // TODO: There is conflict for local labels. A number of test cases would fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This slow patch should be never reached. It may be called when thread accesses tsrm_ls_cache
for the first time, but it definitely must be accessed before control come to JIT-ed code. Do you understand the meaning of the constants above 0xfffffffffffffff8
, 0x9
, lsl 3
?
I think, ldr TMP2, [TMP3, #0x8]
and following lsl #3
should be constant for all threads. Probably, it may be executed in TSRM.c code. Then resulting code should be:
mrs TMP1, tpidrro_el0
SAFE_MEM_ACC_WITH_UOFFSET ldr, TMP1, TMP1, tsrm_ls_cache_tcb_offset_0x8_lsl_3, TMP2
SAFE_MEM_ACC_WITH_UOFFSET ldr, TMP1, TMP1, tsrm_ls_cache_tcb_offset_0x10, TMP2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Agree. The slow path should be avoided. But may I ask where is the proper place to access tsrm_ls_cache
?
0xfffffffffffffff8
is used as CPU number mask. See https://opensource.apple.com/source/xnu/xnu-4570.1.46/osfmk/arm64/cswitch.s.auto.html, the "set_thread_registers" section.
ldr TMP2, [TMP3, #0x8]
is to get the "key" and ldr TMP2, [TMP3, #0x10]
is used to get the "offset". See struct TLVDescriptor
from https://opensource.apple.com/source/dyld/dyld-195.6/src/threadLocalVariables.c.auto.html
I'm afraid I didn't get lsl #3
either.
Yes. I agree. "key" and "offset" are both constant and we can get them in the TSRM.c code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we may keep TSRM.c and do something like this.
mrs TMP1, tpidrro_el0
SAFE_MEM_ACC_WITH_UOFFSET ldr, TMP1, TMP1, ((TLVDescriptor*)(tsrm_ls_cache_tcb_offset))->key << 3, TMP2
SAFE_MEM_ACC_WITH_UOFFSET ldr, TMP1, TMP1, ((TLVDescriptor*)(tsrm_ls_cache_tcb_offset))->offset, TMP2
may be need also add and TMP1, TMP1, #0xfffffffffffffff8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got your point. I will update the code soon.
Another question for me is where is the best place for each thread to access thread local varaible_tsrm_ls_cache
?
I guess zend_jit_setup()
-> tsrm_get_ls_cache_tcb_offset()
is executed only once for each process, not for each thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_tsrm_ls_cache
is guaranteed to be accessed before any VM or JIT code, during request startup, e.g. in init_executor()
zend_jit_setup() executed once, but in case TLVDescriptor is immutable, this should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLVDescriptor
for variable _tsrm_ls_cache
is immutable, but the value in TMP1
at line 556 varies for different threads. This value is also SAFE_MEM_ACC_WITH_UOFFSET ldr, TMP1, TMP1, ((TLVDescriptor*)(tsrm_ls_cache_tcb_offset))->key << 3, TMP2
for your update.
If TMP1
is zero at line 556, slow path would be invoked.
Here I think we should guarantee this TMP1
at line 556 (the value stored in
TMP1 + ((TLVDescriptor*)tsrm_ls_cache_tcb_offset))->key << 3)
) for each thread is initialized in advance. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLVDescriptor
for variable_tsrm_ls_cache
is immutable, but the value inTMP1
at line 556 varies for different threads. This value is alsoSAFE_MEM_ACC_WITH_UOFFSET ldr, TMP1, TMP1, ((TLVDescriptor*)(tsrm_ls_cache_tcb_offset))->key << 3, TMP2
for your update.
This is expected, but I don't see any problems.
If
TMP1
is zero at line 556, slow path would be invoked.Here I think we should guarantee this
TMP1
at line 556 (the value stored in
TMP1 + ((TLVDescriptor*)tsrm_ls_cache_tcb_offset))->key << 3)
) for each thread is initialized in advance. Right?
In my opinion, it can't be zero at the moment when JIT-ed code is executed.
Access to TLV(thread local variable) in macOS is in "dynamic" form and function tlv_get_addr() is invoked to resolve the address. See the example in [1]. Note there is one struct TLVDescriptor [2] for each TLV. The first member holds the address of function tlv_get_addr(), and the other two members, "key" and "offset", would be used inside tlv_get_addr(). The disassembly code for function tlv_get_addr() is shown in [3]. With the value from system register, i.e. tpidrro_el0, together with "key" and "offset", the TLV address can be obtained. Note that the value from tpidrro_el0 varies for different threads, and unique address for TLV is resolved. It's worth noting that slow path would be executed, i.e. function tlv_allocate_and_initialize_for_key(), for the first time of TLV access. In this patch: 1. "_tsrm_ls_cache" is guaranteed to be accessed before any VM/JIT code during the request startup, e.g. in init_executor(), therefore, slow path can be avoided. 2. As TLVDecriptor is immutable and zend_jit_setup() executes once, we get this structure in tsrm_get_ls_cache_tcb_offset(). Note the 'ldr' instruction would be patched to 'add' by the linker. 3. Only fast path for tlv_get_addr() is implemented in macro LOAD_TSRM_CACHE. With this patch, all ~4k test cases can pass for ZTS+CALL in macOS on Apple silicon. [1] https://gist.github.com/shqking/4aab67e0105f7c1f2c549d57d5799f94 [2] https://opensource.apple.com/source/dyld/dyld-195.6/src/threadLocalVariables.c.auto.html [3] https://gist.github.com/shqking/329d7712c26bad49786ab0a544a4af43 Change-Id: I613e9c37e3ff2ecc3fab0f53f1e48a0246e12ee3
||#ifdef __APPLE__ | ||
| .long 0xd53bd071 // TODO: hard-coded: mrs TMP3, tpidrro_el0 | ||
| and TMP3, TMP3, #0xfffffffffffffff8 | ||
| SAFE_MEM_ACC_WITH_64_UOFFSET ldr, TMP3, TMP3, (((TLVDescriptor*)tsrm_ls_cache_tcb_offset)->key << 3), TMP1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slow path is avoided with the premise that tlv _tsrm_ls_cache
is guaranteed to be accessed before the JIT-ed.
We have one concern that it would be a bit difficult for debugging if the premise is not satisfied with the software evolution in the future.
@dstogov Do you think we should add one check here?
How to do:
- add
cbz TMP3, ->trap_label
here - define one global label named
trap_label
in one new defined stub function only for macOS case. In this label,brk #0
is the only instruction.
Impact:
- Performance: one
cbz
instruction would be executed each time, but it would not impose much overhead with the help of branch prediction. - Code size: one
cbz
instruction would be produced for each invocation of macroLOAD_TSRM_CACHE
.
Pros: In the future, if the premise is missing, it helps to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, everything fine now. There is no a big difference between debugging brk #0
trap or crash right after mrs TMP3, tpidrro_el0
. I prefer not to add cbz
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the cbz
is to check the value in TMP3
after the first load at line 508.
It's possible that the crash site is far away mrs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recall that the value in TMP3
after line 508 corresponds to the x17
after line 5 in https://gist.github.com/shqking/329d7712c26bad49786ab0a544a4af43#file-tlv_get_addr-s-L6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume *.phpt tests are passed.
||#ifdef __APPLE__ | ||
| .long 0xd53bd071 // TODO: hard-coded: mrs TMP3, tpidrro_el0 | ||
| and TMP3, TMP3, #0xfffffffffffffff8 | ||
| SAFE_MEM_ACC_WITH_64_UOFFSET ldr, TMP3, TMP3, (((TLVDescriptor*)tsrm_ls_cache_tcb_offset)->key << 3), TMP1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, everything fine now. There is no a big difference between debugging brk #0
trap or crash right after mrs TMP3, tpidrro_el0
. I prefer not to add cbz
.
The update in zend_jit.c is one workaround for macOS ZTS support. One TODO comment is added. Change-Id: I613e9c37e3ff2ecc3fab0f53f1e48a0246e12ee3
Yes. All ~4k *.phpt test cases under |
As updated in file TSRM.c, function tsrm_get_ls_cache_tcb_offset() would
return an address, which stores the address of function tlv_get_addr(),
and this return value would be further used as the first argument of
function tlv_get_addr().
Note that 1) the address of thread local variable "_tsrm_ls_cache" is
obtained by invoking tlv_get_addr() in macOS, and 2) 'ldr' instruction
would be patched to "add" by the linker. See the example in [1].
The disassembly code for function tlv_get_addr() is shown in [2].
In this patch, we implement one fast path for tlv_get_addr in macro
LOAD_TSRM_CACHE and will fall back to slow path if needed.
Based on the comment for function tlv_allocate_and_initialize_for_key()
from [3], the slow path is supposed to be executed only once for each
thread.
Even though a number of test cases failed due to the potential conflicts
of local labels, this patch can be submitted for internal/upstream
reviews.
[1] https://gist.github.com/shqking/4aab67e0105f7c1f2c549d57d5799f94
[2] https://gist.github.com/shqking/329d7712c26bad49786ab0a544a4af43
[3]
https://opensource.apple.com/source/dyld/dyld-195.6/src/threadLocalVariables.c.auto.html
Change-Id: I613e9c37e3ff2ecc3fab0f53f1e48a0246e12ee3