-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[MLIR][XeGPU] Introduce xegpu::uArch usage in target-sensitive passes
#163801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -43,7 +43,12 @@ def XeGPUPropagateLayout : Pass<"xegpu-propagate-layout"> { | |
| let options = [Option< | ||
| "printOnly", "print-analysis-only", "bool", | ||
| /*default=*/"false", | ||
| "Print the result of layout propagation analysis and exit.">]; | ||
| "Print the result of layout propagation analysis and exit.">, | ||
| Option< | ||
| "assumeUnrolled", "assume-unrolled", "bool", | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can this option be an enumeration, so the propagation could be applied to "lane", "inst", and "subgroup" parameters? High-level implies lower level will be propagated, so "assumeUnrolled = true" can be replaced to "level = lane" here and the options are more extensible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not an enum, but a string. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At this point, yes we expect user to set sg_layout/sg_data. |
||
| /*default=*/"false", | ||
| "If the input IR has SG-sized tiles matching instruction sizes, omit `inst_data`."> | ||
| ]; | ||
| } | ||
|
|
||
| def XeGPUWgToSgDistribute : Pass<"xegpu-wg-to-sg-distribute"> { | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,8 +23,6 @@ | |
| #include <map> | ||
| #include <string> | ||
|
|
||
| #define DEBUG_TYPE "xegpu-uarch" | ||
|
|
||
| using namespace mlir; | ||
| using namespace mlir::xegpu::uArch; | ||
|
|
||
|
|
@@ -42,12 +40,61 @@ struct Xe2Plus : public uArch { | |
| &instrs = {}) | ||
| : uArch(archName, archDescription, regInfo, cacheInfo, instrs), | ||
| xeCore(xeCore) {} | ||
| int getSubgroupSize() const override { return 16; } | ||
| unsigned getPackedFormatBitSizeGatherScatter() const override { return 32; } | ||
| unsigned getPackedFormatBitSize() const override { return 16; } | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is getPackedFormatBitSize really getPackedFormatBitSizeDpasA? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, will be renamed. And I think it should be a member of dpas instruction per uarch instance. We might want to split this PR into two parts to have a substantial discussion in each: (1) uArch modification and (2) propagation option and uArch application in passes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For C, is it the same as B (32)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For C, the result is f32 so no packing needed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For a generic lane data calculation for dpas operands, wouldn't the following format be desired in the dpas propagation It is not so much about whether we actually consider "packing" C. |
||
| std::optional<unsigned> getPackedFormatBitSizeDpasB() const override { | ||
| return 32; | ||
| } | ||
| }; | ||
|
|
||
| //===----------------------------------------------------------------------===// | ||
| // uArch instructions | ||
| //===----------------------------------------------------------------------===// | ||
| struct StoreNdInstruction : public Instruction { | ||
| StoreNdInstruction() | ||
| : Instruction(InstructionKind::STORE_ND, InstructionScope::Subgroup) {} | ||
|
|
||
| // Source : | ||
| // https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_subgroups.html#_add_a_new_section_6_13_x_sub_group_read_and_write_functions | ||
| // Reads 1, 2, 4, or 8 uints of data for each work item in the sub-group from | ||
| // the specified pointer | ||
| llvm::SmallVector<int> getSortedLaneVectorLengths() { return {1, 2, 4, 8}; } | ||
| }; | ||
|
|
||
| struct LoadNdInstruction : public Instruction { | ||
| LoadNdInstruction() | ||
| : Instruction(InstructionKind::LOAD_ND, InstructionScope::Subgroup) {} | ||
|
|
||
| // Source : | ||
| // https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_subgroups.html#_add_a_new_section_6_13_x_sub_group_read_and_write_functions | ||
| // Writes 1, 2, 4, or 8 uints of data for each work item in the sub-group to | ||
| // the specified pointer. | ||
| llvm::SmallVector<int> getSortedLaneVectorLengths() { return {1, 2, 4, 8}; } | ||
| }; | ||
|
|
||
| struct PrefetchNdInstruction : public Instruction { | ||
| PrefetchNdInstruction() | ||
| : Instruction(InstructionKind::PREFETCH_ND, InstructionScope::Subgroup) {} | ||
|
|
||
| // Source : | ||
| // https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_subgroup_buffer_prefetch.html#_add_a_new_section_6_15_x_sub_group_prefetch_functions | ||
| llvm::SmallVector<int> getSortedLaneVectorLengths(int elementBitwidth) { | ||
| if (elementBitwidth == 8 || elementBitwidth == 16) | ||
| return {1, 2, 4, 8, 16}; | ||
| else if (elementBitwidth == 32 || elementBitwidth == 64) | ||
| return {1, 2, 4, 8}; | ||
| else | ||
| llvm_unreachable( | ||
| "Unsupported element bitwidth for PrefetchNdInstruction"); | ||
| } | ||
| }; | ||
|
|
||
| // struct to represent DPAS instruction | ||
| struct DPASInstruction : public Instruction, public MMAInstructionInterface { | ||
| DPASInstruction() | ||
| : Instruction(InstructionKind::DPAS, InstructionScope::Subgroup) {} | ||
| // Source: | ||
| // https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_subgroup_matrix_multiply_accumulate.html | ||
|
|
||
| // Override all virtuals from MatrixOpInterface | ||
| virtual llvm::SmallVector<std::pair<uint32_t, uint32_t>, 16> | ||
|
|
@@ -72,6 +119,9 @@ struct DPASInstruction : public Instruction, public MMAInstructionInterface { | |
| virtual llvm::SmallVector<uint32_t, 8> getSupportedN(Type type) override; | ||
| }; | ||
|
|
||
| //===----------------------------------------------------------------------===// | ||
| // uArch instructions | ||
| //===----------------------------------------------------------------------===// | ||
| struct PVCuArch : public Xe2Plus { | ||
| // Maintaines ownership of the instructions owned by PVUarch | ||
| llvm::SmallVector<std::shared_ptr<Instruction>, 8> owned_instructions; | ||
|
|
@@ -101,9 +151,15 @@ struct PVCuArch : public Xe2Plus { | |
| CacheInfo(512 * 1024, 64, CacheHierarchyLevel::L2)); | ||
|
|
||
| // Add the instructions- | ||
| auto dpas = std::make_shared<DPASInstruction>(); | ||
| instructions.emplace(dpas->getInstructionKind(), dpas); | ||
| owned_instructions.push_back(dpas); | ||
| llvm::SmallVector<std::shared_ptr<Instruction>> instructionsToAdd{ | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit - formatting |
||
| std::make_shared<DPASInstruction>(), | ||
| std::make_shared<StoreNdInstruction>(), | ||
| std::make_shared<LoadNdInstruction>(), | ||
| std::make_shared<PrefetchNdInstruction>()}; | ||
| for (auto &inst : instructionsToAdd) { | ||
| instructions.emplace(inst->getInstructionKind(), inst); | ||
| owned_instructions.push_back(inst); | ||
| } | ||
| } | ||
| }; | ||
|
|
||
|
|
@@ -139,10 +195,24 @@ struct BMGuArch : public Xe2Plus { | |
| owned_instructions.push_back(dpas); | ||
| } | ||
| }; | ||
|
|
||
| inline std::shared_ptr<uArch> getUArch(const std::string &archName) { | ||
| if (archName == "pvc") | ||
| return std::make_shared<PVCuArch>(); | ||
| else if (archName == "bmg") | ||
| return std::make_shared<BMGuArch>(); | ||
| else | ||
| return nullptr; | ||
| } | ||
|
|
||
| } // namespace uArch | ||
| } // namespace xegpu | ||
| } // namespace mlir | ||
|
|
||
| //===----------------------------------------------------------------------===// | ||
| // Instruction implementations | ||
| //===----------------------------------------------------------------------===// | ||
|
|
||
| inline llvm::SmallVector<std::pair<uint32_t, uint32_t>, 16> | ||
| DPASInstruction::getSupportedShapes(Type dataType, MMAOpndKind matrixType) { | ||
| auto combineVectors = [](const llvm::SmallVector<uint32_t, 8> &a, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, the constructor was not used and its signature conflicted with the new
inst_dataone, so it can be removed altogether, until we need order somewhere.