Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVPTX] Improve lowering of v4i8 #67866

Merged
merged 13 commits into from
Oct 9, 2023
Merged

[NVPTX] Improve lowering of v4i8 #67866

merged 13 commits into from
Oct 9, 2023

Conversation

Artem-B
Copy link
Member

@Artem-B Artem-B commented Sep 29, 2023

Make it a legal type and plumb through lowering of relevant instructions.

Make it a legal type and plumb through lowering of relevant instructions.
Copy link
Contributor

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually trying out the patch in triton it causes some invalid ptx to get emitted (see comment)

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td Outdated Show resolved Hide resolved
Verified that NVPTX tests pass with ptxas being able to compiler PTX produced by
llc tests.
To make things work consisstently for v4i8, we need to implement other vector
ops.
@github-actions
Copy link

github-actions bot commented Oct 5, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@Artem-B
Copy link
Member Author

Artem-B commented Oct 5, 2023

I still need to test it on live code, though we do not have much code that would end up using v4i8.

The generated PTX checked in llvm/test/CodeGen/NVPTX/i8x4-instructions.ll could use extra scrutiny.

@ThomasRaoux
Copy link
Contributor

I ran the patch on our triton kernels and I don't see any functional problems left.

Copy link
Contributor

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it required quite a lot of cases to be handled :(
Thanks for doing this, it solves some of the problems triton had with latest LLVM. Changes look good to me.

llvm/test/CodeGen/NVPTX/i8x4-instructions.ll Show resolved Hide resolved
@Artem-B
Copy link
Member Author

Artem-B commented Oct 6, 2023

I see one suspicious failure in tensorflow tests. I suspect I've messed something up in v4i8 comparison.

@Artem-B Artem-B requested a review from d0k October 6, 2023 17:43
@Artem-B
Copy link
Member Author

Artem-B commented Oct 6, 2023

I see one suspicious failure in tensorflow tests. I suspect I've messed something up in v4i8 comparison.

Yup, there is a problem:

Successfully custom legalized node
 ... replacing: t10: v4i8 = BUILD_VECTOR Constant:i16<-128>, Constant:i16<-128>, Constant:i16<-128>, Constant:i16<-128>
     with:      t17: v4i8 = bitcast Constant:i32<-128>

@Artem-B
Copy link
Member Author

Artem-B commented Oct 6, 2023

I see one suspicious failure in tensorflow tests. I suspect I've messed something up in v4i8 comparison.

Yup, there is a problem:

Successfully custom legalized node
 ... replacing: t10: v4i8 = BUILD_VECTOR Constant:i16<-128>, Constant:i16<-128>, Constant:i16<-128>, Constant:i16<-128>
     with:      t17: v4i8 = bitcast Constant:i32<-128>

Resolved by 9821e90

@Artem-B
Copy link
Member Author

Artem-B commented Oct 6, 2023

Found another issue. We merge four independent byte loads with align 1 into a 32-bit load, which fails at runtime on misaligned pointers.

%t0 = type { [17 x i8] }

@shared_storage = linkonce_odr local_unnamed_addr addrspace(3) global %t0 undef, align 1

define <4 x i8> @in_v4i8(<4 x i8> %x, <4 x i8> %y) nounwind {
  %v = load <4 x i8>, ptr getelementptr inbounds (i8, ptr addrspacecast (ptr addrspace(3) @shared_storage to ptr), i64 9), align 1
  ret <4 x i8> %v
}
        mov.u64         %rd1, shared_storage;
        cvta.shared.u64         %rd2, %rd1;
        ld.u32  %r1, [%rd2+9];
        st.param.b32    [func_retval0+0], %r1;
        ret;

@Artem-B
Copy link
Member Author

Artem-B commented Oct 9, 2023

clang-format failure on GitHub is weird -- it just silently exits with an error.
I ran the same command locally and fixed one place it was not happy about.

The buildkite failure somewhere in RISC-V appears to be unrelated.

@Artem-B Artem-B merged commit cbafb6f into llvm:main Oct 9, 2023
2 of 3 checks passed
@abadams
Copy link

abadams commented Oct 15, 2023

I believe this may be causing failures for u/srem. See #69124

Edit: Also causing failures when you sign-extend the result of a <4 x i8> comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants