Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64] Fix pairing different types of registers when computing CSRs. #66642

Merged
merged 7 commits into from
Oct 17, 2023

Conversation

nocchijiang
Copy link
Contributor

If a function has odd number of same type of registers to save, and the calling convention also requires odd number of such type of CSRs, an FP register would be accidentally marked as saved when producePairRegisters returns true.

This patch also fixes the AArch64LowerHomogeneousPrologEpilog pass not handling AArch64::NoRegister; actually this pass must be fixed along with the register pairing so i can write a test for it.

@nocchijiang
Copy link
Contributor Author

nocchijiang commented Sep 18, 2023

@TNorthover could you please have a look at the patch and add reviewers for me (i don't have commit access)?

@kyulee-com @kyulee1 Trying to ping the original author of AArch64LowerHomogeneousPrologEpilog pass. Please also have a look at it when you are free.

@kyulee-com
Copy link
Contributor

kyulee-com commented Sep 26, 2023

Thanks for finding and fixing this issue!
Can you share some numbers on why this specialization is worth rather than bailing out since it's actually trying to handle the case where CSRs are not pairable (although the helper function names/checks are named after a pair)?
I also have a few questions on this approach:

  • How is CFI unwind info generated? Is it still valid? I'm not sure how it works with the compact unwind where CSRs are expected to be paired in order (without skipping one).
  • I saw xzr is handled at last after skipping a missing CSR. However, saving/restoring xzr is unnecessary, strictly speaking. Should we align the helper name that refelects xzr (although it's no-op) or should we use a single load/store as opposed to a pair load/store for the last element?

@nocchijiang
Copy link
Contributor Author

@kyulee-com

Thanks for finding and fixing this issue! Can you share some numbers on why this specialization is worth rather than bailing out since it's actually trying to handle the case where CSRs are not pairable (although the helper function names/checks are named after a pair)? I also have a few questions on this approach:

First of all let me thank you for the great work on the AArch64LowerHomogeneousPrologEpilog pass! I tried to enable the pass for some of the top apps from our company and noticed that it works better with Swift code. I found that Swift compiler tends to generate more functions for the same binary size compared to Clang, thus more frame setup/destroy instructions are outlined. 3.4% binary size benefit was observed in one of the tested apps which adopts Swift most heavily (roughly half of the codebase is consisted of Swift code); the respective number for other apps with much less Swift code is about 1.4%.

However, the app instantly crashed when I opened it. The crash log indicated that there were illegal instructions in the outlined helper functions. After some debugging around AArch64FrameLowering and the pass itself, I found that the pass was always assuming the input registers paired, while the functions with a parameter pointing to swifterror have a chance to create the unpaired cases. Actually these cases are not rare: by bailing out the problematic cases I observed that 0.14% binary size benefit was lost from the heavily-adopt-Swift app. This number is decent enough that it worth a proper fix in my opinion.

Moreover, I think it is fair to call it a bug, which should be addressed, for determineCalleeSaves trying to pair registers from different classes.

  • How is CFI unwind info generated? Is it still valid? I'm not sure how it works with the compact unwind where CSRs are expected to be paired in order (without skipping one).

As long as the compact unwind info is not generated for the unpaired cases I think we are safe. This patch is not changing the behavior of unwind info generation. Correct me if i am wrong.

  • I saw xzr is handled at last after skipping a missing CSR. However, saving/restoring xzr is unnecessary, strictly speaking. Should we align the helper name that refelects xzr (although it's no-op) or should we use a single load/store as opposed to a pair load/store for the last element?

I wanted to put minimized effort to make the pass work for the unpaired cases, so I decided to keep the load/store pair form. If this introduces harmful side effect that I am not aware of, please let me know. I'm OK to add XZR into the name if you prefer the verbosity & accurate naming.

@kyulee-com
Copy link
Contributor

This patch is not changing the behavior of unwind info generation.

Intuitively, this doesn't seem right as you also save and restore xzr at the bottom of stack.

If you compare your test before/after -homogeneous-prolog-epilog, it seems different codegen semantically.
For instance, it didn't save x29 originally while it now saves x29. As mentioned above, the bottom of stack seems different. Originally it had x28/x27 at sp+16 while it is now with x28/xzr at sp+0. So, the subsequent instruction str x21, [sp, #8] in the function body seems overlapped with the latter.

I suggest adding a real test case to confirm the change. Also suggest adding more comments as the underlying assumption was changed quite a bit. In fact, I prefer to have a right implementation for easier maintenance/understanding although it might lead to more substantial changes.

Another note/tip on debugging. Although the original implementation does not have a flag to expand the helper in place, I think the transformation should be still semantically valid even if shouldUseFrameHelper were false always, although I don't expect the same code sequence as the original code.

If a function has odd number of same type of registers to save, and
the calling convention also requires odd number of such type of CSRs,
an FP register would be accidentally marked as saved when
producePairRegisters returns true.

This patch also fixes the AArch64LowerHomogeneousPrologEpilog pass not
handling AArch64::NoRegister; actually this pass must be fixed along
with the register pairing so i can write a test for it.
@nocchijiang
Copy link
Contributor Author

For instance, it didn't save x29 originally while it now saves x29

I think this is the expected behavior. When producePairRegisters returns true, as long as LR is saved, FP will be unconditionally saved no matter if it is really needed (AArch64FrameLowering::hasFP). The same behavior can be observed on Darwin platform when switching off the homogeneous prolog epilog pass:

Sample IR:

declare void @bar(i32 %i)

define void @test(i32 %i) nounwind minsize {
  call void asm sideeffect "mov x0, #42", "~{x0},~{x19},~{x20},~{x21},~{x22},~{x23},~{x24},~{x25},~{x26},~{x27}"() nounwind
  call void @bar(i32 %i)
  ret void
}

Output:

	.section	__TEXT,__text,regular,pure_instructions
	.ios_version_min 7, 0
	.globl	_test                           ; -- Begin function test
	.p2align	2
_test:                                  ; @test
; %bb.0:
	stp	x28, x27, [sp, #-96]!           ; 16-byte Folded Spill
	stp	x26, x25, [sp, #16]             ; 16-byte Folded Spill
	mov	w8, w0
	stp	x24, x23, [sp, #32]             ; 16-byte Folded Spill
	stp	x22, x21, [sp, #48]             ; 16-byte Folded Spill
	stp	x20, x19, [sp, #64]             ; 16-byte Folded Spill
	stp	x29, x30, [sp, #80]             ; 16-byte Folded Spill
	; InlineAsm Start
	mov	x0, #42                         ; =0x2a
	; InlineAsm End
	mov	w0, w8
	bl	_bar
	ldp	x29, x30, [sp, #80]             ; 16-byte Folded Reload
	ldp	x20, x19, [sp, #64]             ; 16-byte Folded Reload
	ldp	x22, x21, [sp, #48]             ; 16-byte Folded Reload
	ldp	x24, x23, [sp, #32]             ; 16-byte Folded Reload
	ldp	x26, x25, [sp, #16]             ; 16-byte Folded Reload
	ldp	x28, x27, [sp], #96             ; 16-byte Folded Reload
	ret
                                        ; -- End function
.subsections_via_symbols

But I noticed that LR/FP is repeatedly saved in the helper function on Linux target because the lowering pass was assuming that LR would always be at an even index (0-based) of the CSRs list, which is broken by the odd CSRs cases. I decide to bail for the "odd index" cases (I don't think Swift is that popular in Android world).

So, the subsequent instruction str x21, [sp, #8] in the function body seems overlapped with the latter.

I believe that str x21, [sp, #8] writes x21 into the empty slot (where xzr was "saved") so the overlap is benign. I debugged around related code and found that computeFreeStackSlots (in PrologEpilogInserter.cpp) tries to put frame objects into the empty slots created by uneven CSRs. Anyway I decided to implement str/ldr in the lowering pass instead of storing/loading xzr.

@kyulee-com
Copy link
Contributor

kyulee-com commented Oct 16, 2023

Thanks for clearing this up! Overall it looks good to me modulo minor comments.

@nocchijiang
Copy link
Contributor Author

Fixed the typos and a bad test expectation.

Copy link
Contributor

@kyulee-com kyulee-com left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nocchijiang
Copy link
Contributor Author

@kyulee-com Could you please merge the PR for me since I don't have commit access? Or do you think it is necessary for someone else to review it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants