Skip to content

Conversation

charithaintc
Copy link
Contributor

@charithaintc charithaintc commented Sep 8, 2025

Add support for distributing the vector.multi_reduction operation across lanes in a warp. Currently only 2D to 1D reductions are supported. Given layouts for the source and accumulator vectors,

  • If the reduction dimension is distributed across lanes, the reduction is non-lane-local and the reduction is done using warp shuffles. Here we simply rewrite the MultiDimReductionOp to a sequence of ReductionOps inside the warp op body. Actual distribution will be done by WarpOpReduction pattern.
  • If the reduction dimension is not distributed across lanes, the reduction is lane-local. In this case, we yield the source and accumulator vectors from the warp op and perform the lane-local reduction outside the warp op using a sequence of ReductionOps.

PR also adds support for distributing vector.shape_cast based on layouts.

@charithaintc
Copy link
Contributor Author

@adam-smnk We decided to move the muti reduction distribution inside XeGPU subgroup distribution after some discussion. Reason being certain cases of reduction require accessing the layouts of the source of reduction which upstream vector distribution does not allow us to do. I will close #154438

@charithaintc charithaintc changed the title [mlir][xegpu] Add support for vector.multi_reduction SIMT distribution. [mlir][xegpu] Add support for vector.multi_reduction and vector.shape_cast SIMT distribution. Sep 8, 2025
// dimensions are not distributed.
unsigned distributionStart = originalType.getRank() - laneLayout.size();
unsigned distributionStart =
originalType.getRank() - effectiveLaneLayout.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should here assert "originalType.getRank()== effectiveLaneLayout.size()"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think caller should take care of that. This function simply distribute the innermost laneLayout.size() dimensions.


// Check if the dimension can be distributed evenly.
if (dim % laneLayout[i - distributionStart] != 0)
if (dim % effectiveLaneLayout[i - distributionStart] != 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to handle when dim size is 1, as the result of shape_cast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling this function for such cases will result in failure. caller (i.e. pattern) should handle the error and decide how to proceed.

@adam-smnk
Copy link
Contributor

@charithaintc Just as a side note, couldn't shape_cast part of that PR still go through? It looked largely complete.
Either way, it's fine to have it here too if it's better, easier etc..

@charithaintc
Copy link
Contributor Author

charithaintc commented Sep 9, 2025

@charithaintc Just as a side note, couldn't shape_cast part of that PR still go through? It looked largely complete. Either way, it's fine to have it here too if it's better, easier etc..

We found that shape cast also need to access xegpu layouts in most cases. So we can not rely on vector distribution infra (plus the pattern there is naive, does not fit our use cases). The shape cast here has high pattern benefit, so it should go first. I think we will have to adopt this approach until we have something working.

if (!sourceLayout)
return rewriter.notifyMatchFailure(
warpOp, "the source of shape_cast op lacks distribution layout");
FailureOr<VectorType> sourceDistTypeOrFailure =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How getDistVectTypeBasedOnLaneLayout work for the layout_in0? Could you please add the following as a test case and handle it?

layout_in0 = #xegpu.layout<lane_layout = [16, 1], lane_data = [1, 1]>
%res0 = vector.shape_cast %in0 {layout_res0 = #xegpu.slice<#xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>, dims = [0]> }
: vector<1x2xf32> to vector<2xf32>

Or maybe we just limit shape_cast only support shorter vector (with slice attribute) to wider vector (parent attribute)?

In any case, we should check inputLayout is a slice of resultLayout, or the reverse if we allow shape_cast to narrow vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this code example getDistVectTypeBasedOnLaneLayout will return a failure. 1x2 is not distributable with 16, 1.

I added a check with isSliceOf. please have a look.

Also planning to add few more test cases.

Copy link
Contributor

@Jianhui-Li Jianhui-Li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@charithaintc charithaintc merged commit 9b0d7dd into llvm:main Sep 12, 2025
7 of 9 checks passed
return rewriter.notifyMatchFailure(
warpOp, "shape_cast is rank increasing but result layout is not a "
"slice of source layout");

Copy link
Contributor

@nbpatel nbpatel Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shape cast pattern needs to check only unit dims can be squeezed/expanded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants