Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Vectorize] Vectorization for __builtin_prefetch #66160

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

m-saito-fj
Copy link

Allow vectorization of loops containing __builtin_prefetch. Add masked_prefetch intrinsic and masked_gather_prefetch intrinsic for this purpose. Also, add a process to vectorize prefetch intrinsic in LoopVectorize.

@m-saito-fj
Copy link
Author

I have created several patches to be able to vectorize loops containing __builtin_prefetch with LoopVectorize in AArch64 and generate SVE prefetch (prfb/prfh/prfw/prfd) instructions.
All Patches

This patch is one of them and is mainly for the IR part. In this patch, prefetch intrinsic cannot be vectorized because isLegalMaskedPrefetch returns false by default. Therefore, I have not added a test to this patch.

Tests are attached to other patches. If you want to refer to them, please refer to the following URL.
Test for masked prefetch
Test for masked gather prefetch

@m-saito-fj
Copy link
Author

Gentle ping ...

@m-saito-fj
Copy link
Author

ping.

Copy link
Member

@rengolin rengolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, have not looked at all the code.

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved
case Intrinsic::masked_gather_prefetch: {
const Value *Mask = Args[4];
bool VarMask = !isa<Constant>(Mask);
Align Alignment = cast<ConstantInt>(Args[1])->getAlignValue();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not checking if Args[1] is a ConstantInt, Is that guaranteed elsewhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is guaranteed. Masked_gather_prefetch intrinsic is created only by IRBuilder::CreateMaskedGatherPrefetch function, and Arg[1] is create as ConstantInt in the function.
This is a pattern similar to masked_gather intrinsic, and Arg[1] is not checked in the masked_gather intrinsic case.

Value *Locality,
const Twine &Name) {
auto *PtrTy = cast<PointerType>(Ptr->getType());
assert(Mask && "Mask should not be all-ones (null)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this message. I assumed all ones would be all lanes, and a null value would be all zeroes which would be no lanes (which is still a valid mask, just not useful for pre-fetching).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment.

I had just done the same with masked load/masked store.
I think that if mask is null, then mask can be considered unnecessary. So I think there is no problem with all ones.

But, I think it would be confusing with that message as it was, so I fixed it.

assert message.
The comment is outdated because I force pushed it. Please see link above.

auto *PtrsTy = cast<VectorType>(Ptrs->getType());
ElementCount NumElts = PtrsTy->getElementCount();

if (!Mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I see where that comes from. I think this is a misleading tactic. I'd force the mask to have at least one value in it (f that's what you want) here and not and the next call.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment.

I do the same with gahter/scatter.

I think that if mask is null, then mask can be considered unnecessary. So I think there is no problem with all ones.
I think it is fine as it is.

case Instruction::Call: {
if (!isa<PrefetchInst>(I))
return !VFDatabase::hasMaskedVariant(*(cast<CallInst>(I)), VF);
auto *Ptr = getPrefetchPointerOperand(I);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you implement hasMaskedVariant for prefetch?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment.

I think I can't implement it. This is because HasMaskedVariant checks if the library you are using has a masked and vectorized implementation. I don't think prefetch will ever be implemented in the library. This is because prefetch intrinsic is converted to a single instruction. Therefore, we did not consider using HasMaskedVariant.

Copy link

github-actions bot commented Nov 30, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@m-saito-fj
Copy link
Author

This patch is conflicting with the main branch and will be fixed and force-pushed after rebase.

@m-saito-fj m-saito-fj force-pushed the vectorization-builtin-prefetch branch 4 times, most recently from 64541d2 to 6635fa0 Compare January 4, 2024 05:22
@m-saito-fj
Copy link
Author

The rebase of this patch is complete.

@m-saito-fj m-saito-fj force-pushed the vectorization-builtin-prefetch branch 2 times, most recently from 3b8fb11 to 15c4966 Compare January 12, 2024 13:03
Allow vectorization of loops containing __builtin_prefetch. Add
masked_prefetch intrinsic and masked_gather_prefetch intrinsic for this
purpose. Also, add a process to vectorize prefetch intrinsic in
LoopVectorize.
@m-saito-fj m-saito-fj requested a review from fhahn January 15, 2024 04:01
@davemgreen
Copy link
Collaborator

A couple of high level comments:

  • Is the main idea to make sure that we do vectorize the loop, or are the prefetches important for performance?
  • Under the assumption that prefetching works on cache lines we could consider scalarizing the prefetches instead, so just normal prefetches are emitted.
  • Or alternatively under the assumption that the users often put the prefetches in the wrong place, we could consider just dropping them from the vectorized loop.
  • Having said that, SVE does these instructions so it would make sense to add them.

I think it might make sense to have an initial patch that introduces the intrinsics+langref and adds the codegen support for them - both for genetic targets and for SVE. Once that is in it would make adding tests for vectorization easier.

@m-saito-fj
Copy link
Author

@davemgreen Thank you for your comment.

Is the main idea to make sure that we do vectorize the loop, or are the prefetches important for performance?

My main goal is to vectorize the loop containing __builtin_prefetch. Certainly, there are several ways to vectorize a loop containing builtin_prefetch. I implemented vectorization using the SVE instruction, but I think it would be better to implement other vectorization support and options to choose between those methods.

I think it might make sense to have an initial patch that introduces the intrinsics+langref and adds the codegen support for them - both for genetic targets and for SVE. Once that is in it would make adding tests for vectorization easier.

Implementation
The above implementation consists of three commits. Each of the commits and their additional features consist of the following

  1. [Vectorize] Vectorization for __builitin_prefetch
    1.1. Add new vector prefetch intrinsic + langref
    1.2. Addition of prefetch intrinsic vectorization process to LoopVectorize
  2. [CodeGen] CodeGen for masked_prefetch and masked_gather_prefetch
    2.1. Codegen support for vector prefetch intrinsic
  3. [AArch64][Vectorize] Vectorization for __builtin_prefetch for AArch64
    3.1. Allow vectorization by LoopVectorize in AArch64
    3.2. Addition of Lowering process in CodeGen for AArch64

Does this mean it would be better to aim to merge 1.1, 2.1 and 3.2 in one patch first?

@davemgreen
Copy link
Collaborator

Could we add 1.1 + 2.1 + tests for some architecture (maybe aarch64 without sve) in one patch, with 3.1 added in a second. 1.2+3.2+tests can then make up the third.

@m-saito-fj
Copy link
Author

Could we add 1.1 + 2.1 + tests for some architecture (maybe aarch64 without sve) in one patch, with 3.1 added in a second. 1.2+3.2+tests can then make up the third.

Thank you for the suggestion.
I will proceed with what you suggested. I will close this request after I make another pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants