Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CodeGen: Preserve known tags for LOAD_TVALUE synthesized from LOADK #1201

Merged
merged 2 commits into from
Mar 15, 2024

Conversation

zeux
Copy link
Collaborator

@zeux zeux commented Mar 14, 2024

When lowering LOADK for booleans/numbers/nils, we deconstruct the operation using STORE_TAG which informs the rest of the optimization pipeline about the tag of the value. This is helpful to remove various tag checks.

When the constant is a string or a vector, we just use LOAD_TVALUE/STORE_TVALUE. For strings, this could be replaced by pointer load/store, but for vectors there's no great alternative using current IR ops; in either case, the optimization needs to be carefully examined for profitability as simply copying constants into registers for function calls could become more expensive.

However, there are cases where it's still valuable to preserve the tag. For vectors, doing any math with vector constants contains tag checks that could be removed. For both strings and vectors, storing them into a table has a barrier that for vectors could be elided, and for strings could be simplified as there's no need to confirm the tag.

With this change we now carry the optional tag of the value with LOAD_TVALUE. This has no performance effect on existing benchmarks but does reduce the generated code for benchmarks by ~0.1%, and it makes vector code more efficient (~5% lift on X64 log1p approximation).

When lowering LOADK for booleans/numbers/nils, we deconstruct the
operation using STORE_TAG which informs the rest of the optimization
pipeline about the tag of the value. This is helpful to remove various
tag checks.

When the constant is a string or a vector, we just use
LOAD_TVALUE/STORE_TVALUE. For strings, this could be replaced by pointer
load/store, but for vectors there's no great alternative using current
IR ops; in either case, the optimization needs to be carefully examined
for profitability as simply copying constants into registers for
function calls could become more expensive.

However, there are cases where it's still valuable to preserve the tag.
For vectors, doing any math with vector constants contains tag checks
that could be removed. For both strings and vectors, storing them into a
table has a barrier that for vectors could be elided, and for strings
could be simplified as there's no need to confirm the tag.

With this change we now carry the optional tag of the value with
LOAD_TVALUE. This has no performance effect on existing benchmarks but
does reduce the generated code for benchmarks by ~0.1%, and it makes
vector code more efficient (~5% lift on X64 log1p approximation).
@zeux
Copy link
Collaborator Author

zeux commented Mar 14, 2024

This change results in ~4% lift on A64 and ~5% lift on X64 for vector approximation which is the main goal here, but it seems like it doesn't hurt for strings (the code size reduction on benchmark code is due to strings exclusively - I don't have access to a large Luau codebase so can't measure broad impact but assume it's a slight codesize reduction as well) so it's applied unconditionally.

@zeux
Copy link
Collaborator Author

zeux commented Mar 14, 2024

fwiw assembly diff for A64 for vector approximation:

     8: 	local r: vector = vector(0, 0, 0)
 LOADK R1 K0 [0, 0, 0]
 vector <- vector, any
-#   %4 = LOAD_TVALUE K0                                       ; useCount: 2, lastUse: %14
+#   %4 = LOAD_TVALUE K0, 0i, tvector                          ; useCount: 1, lastUse: %14
  ldr         q31,[x22]
-#   STORE_TVALUE R1, %4                                       ; %5
- str         q31,[x25,#16]
     9: 	local a: vector = x
 MOVE R2 R0
 vector <- vector, any
 .L22:
-#   %6 = LOAD_TVALUE R0                                       ; useCount: 8, lastUse: %123
+#   %6 = LOAD_TVALUE R0                                       ; useCount: 7, lastUse: %123
  ldr         q30,[x25]
-#   STORE_TVALUE R2, %6                                       ; %7
- str         q30,[x25,#32]
    10: 	r += a -- todo: * 1 doesn't fold yet
 ADD R1 R1 R2
 vector <- vector, vector
 .L23:
-#   %8 = LOAD_TAG R1                                          ; useCount: 1, lastUse: %9
- ldr         w17,[x25,#28]
-#   CHECK_TAG %8, tvector, exit(2)                            ; %9
- cmp         w17,#4
- b.ne        .L24
 #   %14 = ADD_VEC %4, %6                                      ; useCount: 1, lastUse: %39
  fadd        v31.4s,v31.4s,v30.4s
    11: 	a *= x
 MUL R2 R2 R0
 vector <- vector, vector

@vegorov-rbx
Copy link
Collaborator

I think this change triggers a bug in another optimization and causes tests on Ubuntu to fail.
I believe it will be fixed by the sync PR today, so will have to be rebased on that.

@zeux
Copy link
Collaborator Author

zeux commented Mar 15, 2024

My prediction is that the sync will hit the same issue but we'll see :)

The error shows up in a slightly different run every time afaict - the first run of this PR actually failed the main test execution (no codegen). A rerun just now failed in a flags=false no-codegen run:

image

The failure happens before any tests actually run (no doctest banner which is printed first). This is an ASAN issue with ASLR that I also have locally on my system, it's spuriously failing test runs on master as well as on this branch. It's been reported here google/sanitizers#1614 and fixed in some versions of clang/gcc but if you don't have the latest build then you're going to run into issues...

Locally, sudo sysctl -w vm.mmap_rnd_bits=28 fixes the issue completely - if the sync still sees it we may need to add this to the build steps before running any tests for the time being...

@zeux
Copy link
Collaborator Author

zeux commented Mar 15, 2024

Judging from the references that are starting to appear on the linked issue, this started happening on GHA recently. Perhaps the number of randomized bits set to 32 is a recent GHA-side or Ubuntu-side config change.

@zeux
Copy link
Collaborator Author

zeux commented Mar 15, 2024

#1203

@vegorov-rbx
Copy link
Collaborator

Thanks for figuring that out.

@vegorov-rbx vegorov-rbx merged commit a768311 into master Mar 15, 2024
8 checks passed
@vegorov-rbx vegorov-rbx deleted the cg-ltvtag branch March 15, 2024 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants