-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVPTX] Lower 16xi8 and 8xi8 stores efficiently #73646
Conversation
180ee21
to
9d747dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits. LGTM
6ab9db7
to
298e563
Compare
// i32 results instead of letting ReplaceLoadVector split it into smaller stores | ||
// during legalization. This is done at dag-combine time, so that vector | ||
// operations with i8 elements can be optimised away instead of being needlessly | ||
// split during legalization, which involves storing to the stack and loading it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Legalizer assuming that stack loads/stores are cheap is indeed a rather bad misoptimization for NVPTX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this comment might be out of date, as it looks copied from PerformLOADCombine
and that was written before stack optimizations were done
Lower 16xi8 vector stores in NVPTX ISel efficiently using st.v4.b32 instead of multiple st.v4.u8 along the lines of vector loads and 8xf16. Similarly, 8xi8 using st.v2.u32.
298e563
to
c197301
Compare
This seems to have injected failures into Halide codegen; we are now getting runtime errors of the form |
; CHECK-LABEL: .visible .func v8i8_store | ||
define void @v8i8_store(ptr %a, <8 x i8> %v) { | ||
; CHECK: st.v2.u32 | ||
store <8 x i8> %v, ptr %a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only correct if the pointer is aligned to a 4-byte-boundary (IIUC), but AFAIK nothing in the IR to this point promises that alignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Using larger types for loads/stores must be aligned appropriately.
We do use allowsMemoryAccessForAlignment
in other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, we should revert it if a fix-forward is not imminent (this is breaking all of Halide's Cuda tests).
This reverts commit 173fcf7. Needs to constrain the optimization to properly aligned loads/stores only. llvm#73646 (comment)
…4518) This reverts commit 173fcf7. We need to constrain the optimization to properly aligned loads/stores only. #73646 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the alignment issue is addressed
Lower 16xi8 vector stores in NVPTX ISel efficiently using
st.v4.b32 instead of multiple st.v4.u8 along the lines of vector loads
and 8xf16. Similarly, 8xi8 using st.v2.u32.