Skip to content

feat: Support low precision#691

Open
FangRui0 wants to merge 4 commits into
hw-native-sys:mainfrom
FangRui0:support_low_precision
Open

feat: Support low precision#691
FangRui0 wants to merge 4 commits into
hw-native-sys:mainfrom
FangRui0:support_low_precision

Conversation

@FangRui0
Copy link
Copy Markdown
Contributor

@FangRui0 FangRui0 commented May 21, 2026

1、 alloc_tile tstore tload tcvt tprefetch op在a5上支持接受低精度类型
2、支持非mlir内置类型返回正确位宽

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for low-precision element types (FP8, FP4, etc.) on the A5 NPU architecture. It updates the PTO IR manual and adds design documentation, refactors type utility functions for storage size calculations, and enhances verifiers for tload, tstore, tprefetch, and tcvt to enforce architecture-specific constraints. Feedback highlights a bug in getPTOStorageElemByteSize that causes regressions for sub-byte types, the presence of absolute local paths in documentation, and the need to use isI8Like instead of isInteger(8) in verifiers to correctly support both signed and unsigned 8-bit integers.


unsigned mlir::pto::getPTOStorageElemByteSize(Type t) {
unsigned bitWidth = getPTOStorageElemBitWidth(t);
return bitWidth == 0 ? 0 : bitWidth / kBitsPerByte;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of getPTOStorageElemByteSize returns 0 for types with a bit width less than 8 (e.g., i1). In most NPU contexts, the smallest addressable storage unit is a byte, and sub-byte types like i1 are stored in a full byte. This causes regressions in several passes (like PTOIRTranslator and PTOPlanMemory) where a size of 0 now triggers failures or fatal errors instead of falling back to 1 byte. Using (bitWidth + kBitsPerByte - 1) / kBitsPerByte correctly calculates the storage size in bytes for all types.

Suggested change
return bitWidth == 0 ? 0 : bitWidth / kBitsPerByte;
return (bitWidth + kBitsPerByte - 1) / kBitsPerByte;

## Scope

This note records the current low-precision C++ API support status in
`/Users/fangrui/workspace/huawei/pto-isa` for the following PTO operations:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation contains absolute local file paths (e.g., /Users/fangrui/...). These should be replaced with relative paths or simply the repository name to ensure the documentation is portable and relevant for all contributors.

Suggested change
`/Users/fangrui/workspace/huawei/pto-isa` for the following PTO operations:
`.` for the following PTO operations:

Comment thread lib/PTO/IR/PTO.cpp

static bool isA5AccStorePreQuantDstType(Type srcElem, Type dstElem) {
if (srcElem.isInteger(32))
return dstElem.isInteger(8) || dstElem.isF16() || dstElem.isBF16();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check dstElem.isInteger(8) is too restrictive as it typically only matches signless 8-bit integers. The previous implementation used isI8Like(dstElem), which likely covers both signed and unsigned 8-bit integers (i8 and ui8). According to the manual (line 899), both i8 and ui8 should be supported here.

Suggested change
return dstElem.isInteger(8) || dstElem.isF16() || dstElem.isBF16();
return isI8Like(dstElem) || dstElem.isF16() || dstElem.isBF16();

Comment thread lib/PTO/IR/PTO.cpp
return dstElem.isInteger(8) || dstElem.isF16() || dstElem.isBF16();
if (!srcElem.isF32())
return false;
return dstElem.isInteger(8) || dstElem.isF16() || dstElem.isBF16() ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the i32 source case, isInteger(8) should be replaced with isI8Like(dstElem) to correctly support both i8 and ui8 destination types as specified in the manual.

Suggested change
return dstElem.isInteger(8) || dstElem.isF16() || dstElem.isBF16() ||
return isI8Like(dstElem) || dstElem.isF16() || dstElem.isBF16() ||

@reedhecre
Copy link
Copy Markdown

reedhecre commented May 21, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: feat: Support low precision #691 feat: Support low precision
  • Author: FangRui0
  • Base/Head: main / support_low_precision
  • Head SHA: bf044f285606
  • Trigger: PR 有新提交
  • Generated At: 2026-05-22T07:06:28Z
  • Previous Head SHA: 0c17c69c098f
  • Status: completed

Summary

检查到 2 个问题:新增的低精度 tstore preQuant 路径在 EmitC 阶段会失败,且 A5 vec tstore 没有校验低精度目的类型。

Findings

  1. P1 低精度 preQuant `tstore` 的新合法路径在 EmitC 降低时会直接失败 lib/PTO/Transforms/PTOToEmitC.cpp:4455

这个 PR 新增了 A5 acc + preQuantScalar -> hif8/f8 的合法场景,并在 test/lit/pto/tstore_low_precision_a5_valid.pto 里用 pto.declare_tile 构造了 src。但这条路径继续走到 EmitC 时,declare_tile 会先变成 declare_tile_memref,而 MemRefType 在类型转换里又被统一降成指针类型;随后这里的 PTOTStoreToTSTOREhasPreQuantScalar/reluPreMode/atomicType 任一生效时,强制要求 src/dst 都是 emitc::OpaqueType 才能拼模板参数。结果是新加的有效用例在真正生成 C++ 时会因为 src must be emitc::OpaqueType 卡住,属于确定的代码生成失败。

  1. P2 A5 vec `tstore` 只校验了源低精度类型,没有校验目的类型是否在支持集合内 lib/PTO/IR/PTO.cpp:2926

A5 vec tstore 这段现在只对 srcElem 调用了 isA5TLoadStoreTransferElemType,随后仅用字节数相等来约束 dstElem。这样一来,任何同字节宽但不在本 PR 新增支持列表里的 GM 元素类型都会被放过;例如某个 8-bit float 变体只要宽度是 1 byte,就能和低精度 tile 通过 verifier。后续要么会实例化出 PTO-ISA 不支持的 TSTORE 组合,要么退化成错误的按字节重解释,和这里新增的低精度支持约束不一致。

@FangRui0 FangRui0 force-pushed the support_low_precision branch 2 times, most recently from 4c5e02b to cc66d99 Compare May 21, 2026 07:02
@FangRui0
Copy link
Copy Markdown
Contributor Author

/review

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

收到 /review,正在执行 Codex review。

Findings

Review in progress.

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

Review failed at stage codex-review: exit=1

Findings

未生成结构化 findings,因为 review 过程提前失败。

Log Tail

  8771	  auto isAllowedDataElem = [&](mlir::Type t) -> bool {
  8772	    if (t.isF16() || t.isF32() || t.isBF16()) return true;
  8773	    if (auto it = mlir::dyn_cast<mlir::IntegerType>(t))
  8774	      return (it.getWidth() == 8 || it.getWidth() == 16 || it.getWidth() == 32);
  8775	    return false;
  8776	  };
  8777	  auto isAllowedIndexElem = [&](mlir::Type t) -> bool {
  8778	    if (auto it = mlir::dyn_cast<mlir::IntegerType>(t))
  8779	      return (it.getWidth() == 16 || it.getWidth() == 32);
exec
/bin/bash -lc "git diff --unified=20 origin/main...HEAD -- lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp | sed -n '1,220p'" in /tmp/ptoas-pr-review-monitor/runs/20260521_150503_pr691/repo succeeded in 0ms:
diff --git a/lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp b/lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp
index 1aad797..0f3dba4 100644
--- a/lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp
+++ b/lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp
@@ -188,42 +188,44 @@ getMemrefSubViewBaseAddresses(memref::SubViewOp op, MemRefType sourceType,
     addresses.reserve(static_cast<size_t>(colSize));
     for (int64_t col = 0; col < colSize; ++col) {
       int64_t elemOffset =
           rowOffset * strides[0] + (colOffset + col) * strides[1];
       addresses.push_back(static_cast<uint64_t>(elemOffset * elemBytes));
     }
   } else {
     return std::nullopt;
   }
 
   return addresses;
 }
 
 // [辅助函数] 尝试从 Operation 中计算相对于 Source 的字节偏移量和新大小
 // 返回值: pair<offsetInBytes, sizeInBytes>
exec
/bin/bash -lc "git diff --unified=20 origin/main...HEAD -- lib/PTO/Transforms/PTOPlanMemory.cpp | sed -n '1,220p'" in /tmp/ptoas-pr-review-monitor/runs/20260521_150503_pr691/repo succeeded in 0ms:
diff --git a/lib/PTO/Transforms/PTOPlanMemory.cpp b/lib/PTO/Transforms/PTOPlanMemory.cpp
index 36ec6bb..cf6c25c 100644
--- a/lib/PTO/Transforms/PTOPlanMemory.cpp
+++ b/lib/PTO/Transforms/PTOPlanMemory.cpp
@@ -1,33 +1,35 @@
 // Copyright (c) 2026 Huawei Technologies Co., Ltd.
 // This program is free software, you can redistribute it and/or modify it under the terms and conditions of
 // CANN Open Software License Agreement Version 2.0 (the "License").
 // Please refer to the License for details. You may not use this file except in compliance with the License.
 // THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
 // INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
 // See LICENSE in the root of the software repository for the full text of the License.
 
 //===- PlanMemory.cpp ----Plan Buffer Memory Address ----------------------===//
 //===----------------------------------------------------------------------===//
 
 #include "PTOPlanMemory.h"
 
+#include "PTO/IR/PTOTypeUtils.h"
+
exec
/bin/bash -lc "git diff --unified=20 origin/main...HEAD -- lib/PTO/Transforms/PTOToEmitC.cpp | sed -n '1,260p'" in /tmp/ptoas-pr-review-monitor/runs/20260521_150503_pr691/repo succeeded in 0ms:
diff --git a/lib/PTO/Transforms/PTOToEmitC.cpp b/lib/PTO/Transforms/PTOToEmitC.cpp
index 6228c78..6fb8bfa 100644
--- a/lib/PTO/Transforms/PTOToEmitC.cpp
+++ b/lib/PTO/Transforms/PTOToEmitC.cpp
@@ -3364,108 +3364,82 @@ struct SubviewToEmitCPattern : public OpConversionPattern<memref::SubViewOp> {
     gtConstructorArgs.push_back(strideInstOp.getResult(0)); // 拿到 stride_inst 的 SSA Value
 
     rewriter.replaceOpWithNewOp<emitc::CallOpaqueOp>(
         op, 
         gtType, 
         gtCppType,
         /*args=*/ArrayAttr{}, 
         /*templateArgs=*/ArrayAttr{},
         /*operands=*/ValueRange(gtConstructorArgs)
     );
 
     return success();
   }
 };
 
ERROR: Selected model is at capacity. Please try a different model.
Warning: no last agent message; wrote empty content to /tmp/ptoas-pr-review-monitor/runs/20260521_150503_pr691/codex_last_message.json
tokens used
83,078
===== END STAGE codex-review rc=1 @ 2026-05-21 15:07:33 =====

@FangRui0
Copy link
Copy Markdown
Contributor Author

/review

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

收到 /review,正在执行 Codex review。

Findings

Review in progress.

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

Review failed at stage codex-review: exit=1

Findings

未生成结构化 findings,因为 review 过程提前失败。

Log Tail

     if (!srcElem || !tmpElem || !dstElem || srcElem != dstElem || srcElem != tmpElem)
       return emitOpError() << "expects src, tmp, and dst to have the same element type";
-    unsigned elemBytes = srcElem.getIntOrFloatBitWidth() / 8;
+    unsigned elemBytes = getPTOStorageElemByteSize(srcElem);
+    if (elemBytes == 0)
+      return emitOpError() << "failed to get transpose element size";
     if (elemBytes != 1 && elemBytes != 2 && elemBytes != 4)
       return emitOpError() << "expects transpose element size to be 1, 2, or 4 bytes";
     auto isAllowedWidthType = [&](Type ty) {
       if (elemBytes == 4)
         return ty.isInteger(32) || ty.isF32();
       if (elemBytes == 2)
         return ty.isInteger(16) || ty.isF16() || ty.isBF16();
       return ty.isInteger(8);
     };
     if (!isAllowedWidthType(srcElem))
       return emitOpError() << "expects transpose element type to match the supported set for its width";
     auto checkAlignedMajor = [&](Type ty, StringRef name) -> LogicalResult {
       auto tb = mlir::dyn_cast<pto::TileBufType>(ty);
       if (!tb)
exec
/bin/bash -lc "git diff --unified=20 origin/main...HEAD -- lib/PTO/IR/PTO.cpp | sed -n '1115,1165p'" in /tmp/ptoas-pr-review-monitor/runs/20260521_153004_pr691/repo succeeded in 0ms:
-  }
+  int64_t elemBytes = static_cast<int64_t>(getElemByteSize(elemTy));
   if (elemBytes <= 0) return failure();
 
   if (fr == 1024) {
     innerRows = 16;
     innerCols = 16;
     return success();
   }
   if (fr == 32) {
     innerRows = 16;
     innerCols = 2;
     return success();
   }
   if (fr == 512) {
     if (sl == 1) {
       innerRows = 16;
       innerCols = 32 / elemBytes;
       return success();
     }
exec
/bin/bash -lc "grep -n \"verifyTFillPadLike\\|TFillPadOp::verify\\|TExtractOp::verify\\|TPadOp::verify\" -n lib/PTO/IR/PTO.cpp" in /tmp/ptoas-pr-review-monitor/runs/20260521_153004_pr691/repo succeeded in 0ms:
5050:mlir::LogicalResult mlir::pto::TExtractOp::verify() {
5552:static mlir::LogicalResult verifyTFillPadLike(Operation *op, Type srcTy, Type dstTy,
5646:mlir::LogicalResult mlir::pto::TFillPadOp::verify() {
5647:  return verifyTFillPadLike(getOperation(), getSrc().getType(), getDst().getType(),
5652:  return verifyTFillPadLike(getOperation(), getSrc().getType(), getDst().getType(),
5657:  return verifyTFillPadLike(getOperation(), getSrc().getType(), getDst().getType(),
exec
/bin/bash -lc "sed -n '5552,5675p' lib/PTO/IR/PTO.cpp" in /tmp/ptoas-pr-review-monitor/runs/20260521_153004_pr691/repo succeeded in 0ms:
static mlir::LogicalResult verifyTFillPadLike(Operation *op, Type srcTy, Type dstTy,
                                              bool allowDstExpand,
                                              llvm::StringRef opName) {
  if (!isPTOShapedLike(srcTy) || !isPTOShapedLike(dstTy))
    return op->emitError("expects src/dst to be PTO shaped-like types");

  auto srcShape = getShapeVec(srcTy);
  auto dstShape = getShapeVec(dstTy);
  if (srcShape.size() != 2 || dstShape.size() != 2)
    return op->emitError("expects rank-2 shaped types for src/dst");

  auto srcElem = getElemTy(srcTy);
  auto dstElem = getElemTy(dstTy);

  auto getElemBytes = [](mlir::Type t) -> int64_t {
    unsigned elemBytes = getPTOStorageElemByteSize(t);
    return elemBytes == 0 ? -1 : static_cast<int64_t>(elemBytes);
  };

  int64_t srcB = getElemBytes(srcElem);
exec
/bin/bash -lc "grep -R \"tfillpad\\|TFILLPAD\" -n docs include/PTO/IR/PTOOps.td test/lit/pto | head -120" in /tmp/ptoas-pr-review-monitor/runs/20260521_153004_pr691/repo succeeded in 1ms:

ERROR: Selected model is at capacity. Please try a different model.
Warning: no last agent message; wrote empty content to /tmp/ptoas-pr-review-monitor/runs/20260521_153004_pr691/codex_last_message.json
tokens used
900,991
===== END STAGE codex-review rc=1 @ 2026-05-21 15:45:53 =====

@FangRui0 FangRui0 force-pushed the support_low_precision branch from cc66d99 to 417e9d7 Compare May 21, 2026 11:49
@FangRui0
Copy link
Copy Markdown
Contributor Author

/review

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

收到 /review,正在执行 Codex review。

Findings

Review in progress.

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

PR #691 has three real issues: FP4 tcvt shape validation is inconsistent with packed-dimension semantics, the new memref element interface is registered twice, and memref-form tprefetch skips the new arch/type checks.

Findings

  1. P1 FP4 `tcvt` accepts packed shapes that lower to mismatched tile extents lib/PTO/IR/PTO.cpp:4851

TCvtOp::verify still treats src/dst as shape-compatible only when their raw rows/cols are equal. That is wrong for the new packed FP4 types, because this PR also documents !pto.f4E1M2x2/!pto.f4E2M1x2 dimensions as packed extents rather than logical scalar extents. The mismatch is visible in lowering: EmitC expands the packed FP4 dimension by 2 (renderTileTemplateDim), so a verifier-accepted case like bf16 16x16 -> !pto.f4E1M2x2 16x16 lowers to TCVT between different tile extents (16x16 vs 16x32). That is a high-probability wrong-code / no-overload situation for the main feature added in this PR.

  1. P2 Dialect initialization registers `MemRefElementTypeInterface` twice for the new types lib/PTO/IR/PTO.cpp:2049

The three new low-precision types already declare MemRefElementTypeInterface in ODS, so the generated type classes register that interface statically. PTODialect::initialize() then attaches external MemRefElementTypeInterface models for the same types a second time. In MLIR this interface registration is expected to happen once per type; assertion-enabled builds typically abort during dialect initialization on the duplicate registration. That makes ptoas fail before parsing any IR in Debug/ASan-style CI configurations.

  1. P2 Memref-form `tprefetch` no longer enforces the new low-precision arch checks lib/PTO/IR/PTO.cpp:2404

TPrefetchOp::verify now routes through dispatchVerifierByArch, but dispatchVerifierByArch immediately returns success whenever any operand is a memref (shouldBypassDecodedMemrefVerifier). tprefetch is also the one verifier here that explicitly accepts memref forms, so the new low-precision A2/A3/A5 checks never run for that post-lowering form anymore. As a result, an unsupported memref-form pto.tprefetch (for example A3 + !pto.hif8) now passes verification and only fails later in codegen/runtime, which breaks the new “A5-only” contract for low-precision prefetch.

@FangRui0 FangRui0 force-pushed the support_low_precision branch from 0c17c69 to bf044f2 Compare May 22, 2026 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants