[NVPTX] Lower 16xi8 and 8xi8 stores efficiently #73646

bondhugula · 2023-11-28T13:54:40Z

Lower 16xi8 vector stores in NVPTX ISel efficiently using
st.v4.b32 instead of multiple st.v4.u8 along the lines of vector loads
and 8xf16. Similarly, 8xi8 using st.v2.u32.

ldrumm

Minor nits. LGTM

llvm/test/CodeGen/NVPTX/vector-stores.ll

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

Artem-B · 2023-11-29T18:23:05Z

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

+// i32 results instead of letting ReplaceLoadVector split it into smaller stores
+// during legalization. This is done at dag-combine time, so that vector
+// operations with i8 elements can be optimised away instead of being needlessly
+// split during legalization, which involves storing to the stack and loading it


Nice. Legalizer assuming that stack loads/stores are cheap is indeed a rather bad misoptimization for NVPTX.

Note that this comment might be out of date, as it looks copied from PerformLOADCombine and that was written before stack optimizations were done

Lower 16xi8 vector stores in NVPTX ISel efficiently using st.v4.b32 instead of multiple st.v4.u8 along the lines of vector loads and 8xf16. Similarly, 8xi8 using st.v2.u32.

steven-johnson · 2023-12-05T19:07:30Z

This seems to have injected failures into Halide codegen; we are now getting runtime errors of the form CUDA_ERROR_MISALIGNED_ADDRESS for cuMemcpyDtoH() where we didn't before. It appears we are now emitting an aligned store instruction where we previous emitted an unaligned one. Can we get a revert of this pending further investigation, please?

steven-johnson · 2023-12-05T19:09:55Z

llvm/test/CodeGen/NVPTX/vector-stores.ll

+; CHECK-LABEL: .visible .func v8i8_store
+define void @v8i8_store(ptr %a, <8 x i8> %v) {
+  ; CHECK: st.v2.u32
+  store <8 x i8> %v, ptr %a


This is only correct if the pointer is aligned to a 4-byte-boundary (IIUC), but AFAIK nothing in the IR to this point promises that alignment

You're right. Using larger types for loads/stores must be aligned appropriately.

We do use allowsMemoryAccessForAlignment in other places.

In that case, we should revert it if a fix-forward is not imminent (this is breaking all of Halide's Cuda tests).

This reverts commit 173fcf7. Needs to constrain the optimization to properly aligned loads/stores only. llvm#73646 (comment)

…4518) This reverts commit 173fcf7. We need to constrain the optimization to properly aligned loads/stores only. #73646 (comment)

pasaulais

LGTM once the alignment issue is addressed

bondhugula requested review from Artem-B and pasaulais November 28, 2023 13:54

bondhugula mentioned this pull request Nov 28, 2023

[NVPTX] Preserve v16i8 vector loads when legalizing #67322

Closed

bondhugula force-pushed the uday/nvptx_v16i8_vector_store branch 2 times, most recently from 180ee21 to 9d747dd Compare November 28, 2023 14:14

ldrumm approved these changes Nov 28, 2023

View reviewed changes

llvm/test/CodeGen/NVPTX/vector-stores.ll Outdated Show resolved Hide resolved

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp Outdated Show resolved Hide resolved

bondhugula force-pushed the uday/nvptx_v16i8_vector_store branch 2 times, most recently from 6ab9db7 to 298e563 Compare November 29, 2023 06:14

Artem-B approved these changes Nov 29, 2023

View reviewed changes

[NVPTX] Lower 16xi8 and 8xi8 stores efficiently

c197301

Lower 16xi8 vector stores in NVPTX ISel efficiently using st.v4.b32 instead of multiple st.v4.u8 along the lines of vector loads and 8xf16. Similarly, 8xi8 using st.v2.u32.

bondhugula force-pushed the uday/nvptx_v16i8_vector_store branch from 298e563 to c197301 Compare November 30, 2023 02:26

bondhugula merged commit 173fcf7 into llvm:main Dec 1, 2023
3 checks passed

steven-johnson reviewed Dec 5, 2023

View reviewed changes

Artem-B mentioned this pull request Dec 5, 2023

Revert "[NVPTX] Lower 16xi8 and 8xi8 stores efficiently (#73646)" #74518

Merged

Artem-B added a commit that referenced this pull request Dec 6, 2023

Revert "[NVPTX] Lower 16xi8 and 8xi8 stores efficiently (#73646)" (#7…

a2d3bb1

…4518) This reverts commit 173fcf7. We need to constrain the optimization to properly aligned loads/stores only. #73646 (comment)

pasaulais reviewed Dec 13, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVPTX] Lower 16xi8 and 8xi8 stores efficiently #73646

[NVPTX] Lower 16xi8 and 8xi8 stores efficiently #73646

bondhugula commented Nov 28, 2023

ldrumm left a comment

Artem-B Nov 29, 2023

pasaulais Dec 13, 2023

steven-johnson commented Dec 5, 2023

steven-johnson Dec 5, 2023

Artem-B Dec 5, 2023

steven-johnson Dec 5, 2023

pasaulais left a comment

[NVPTX] Lower 16xi8 and 8xi8 stores efficiently #73646

[NVPTX] Lower 16xi8 and 8xi8 stores efficiently #73646

Conversation

bondhugula commented Nov 28, 2023

ldrumm left a comment

Choose a reason for hiding this comment

Artem-B Nov 29, 2023

Choose a reason for hiding this comment

pasaulais Dec 13, 2023

Choose a reason for hiding this comment

steven-johnson commented Dec 5, 2023

steven-johnson Dec 5, 2023

Choose a reason for hiding this comment

Artem-B Dec 5, 2023

Choose a reason for hiding this comment

steven-johnson Dec 5, 2023

Choose a reason for hiding this comment

pasaulais left a comment

Choose a reason for hiding this comment