AMDGPU: Rename and add bf16 support for global_load_tr builtins #86202

changpeng · 2024-03-21T21:44:01Z

Make the name of a clang builtin as close to the mnemonic instruction name as possible. The data type suffix may not be enough to tell what instruction the builtin is going to produce.
This patch also add the bf16 support for global_load_tr_b128 builtins.

Make the name of a clang builtin as close to the mnemonic instruction name as possible. The data type suffix may not be enough to tell what instruction the builtin is going to produce. This patch also add the bf16 support for global_load_tr_b128 builtins.

llvmbot · 2024-03-21T21:44:32Z

@llvm/pr-subscribers-clang
@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-clang-codegen

Author: Changpeng Fang (changpeng)

Changes

Make the name of a clang builtin as close to the mnemonic instruction name as possible. The data type suffix may not be enough to tell what instruction the builtin is going to produce.
This patch also add the bf16 support for global_load_tr_b128 builtins.

Full diff: https://github.com/llvm/llvm-project/pull/86202.diff

7 Files Affected:

(modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+9-7)
(modified) clang/lib/CodeGen/CGBuiltin.cpp (+22-12)
(modified) clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-gfx11-err.cl (+13-12)
(modified) clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-gfx12-w32-err.cl (+6-5)
(modified) clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-gfx12-w64-err.cl (+6-5)
(modified) clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-w32.cl (+19-19)
(modified) clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-w64.cl (+19-19)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 61ec8b79bf054d..4153b316c22b1d 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -432,13 +432,15 @@ TARGET_BUILTIN(__builtin_amdgcn_s_wakeup_barrier, "vi", "n", "gfx12-insts")
 TARGET_BUILTIN(__builtin_amdgcn_s_barrier_leave, "b", "n", "gfx12-insts")
 TARGET_BUILTIN(__builtin_amdgcn_s_get_barrier_state, "Uii", "n", "gfx12-insts")
 
-TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_v2i32, "V2iV2i*1", "nc", "gfx12-insts,wavefrontsize32")
-TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_v8i16, "V8sV8s*1", "nc", "gfx12-insts,wavefrontsize32")
-TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_v8f16, "V8hV8h*1", "nc", "gfx12-insts,wavefrontsize32")
-
-TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_i32, "ii*1", "nc", "gfx12-insts,wavefrontsize64")
-TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_v4i16, "V4sV4s*1", "nc", "gfx12-insts,wavefrontsize64")
-TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_v4f16, "V4hV4h*1", "nc", "gfx12-insts,wavefrontsize64")
+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b64_v2i32, "V2iV2i*1", "nc", "gfx12-insts,wavefrontsize32")
+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v8i16, "V8sV8s*1", "nc", "gfx12-insts,wavefrontsize32")
+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v8f16, "V8hV8h*1", "nc", "gfx12-insts,wavefrontsize32")
+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v8bf16, "V8yV8y*1", "nc", "gfx12-insts,wavefrontsize32")
+
+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b64_i32, "ii*1", "nc", "gfx12-insts,wavefrontsize64")
+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v4i16, "V4sV4s*1", "nc", "gfx12-insts,wavefrontsize64")
+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v4f16, "V4hV4h*1", "nc", "gfx12-insts,wavefrontsize64")
+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v4bf16, "V4yV4y*1", "nc", "gfx12-insts,wavefrontsize64")
 
 //===----------------------------------------------------------------------===//
 // WMMA builtins.
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index e14e8908828218..2eaceeba617700 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -18531,35 +18531,45 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
     llvm::Function *F = CGM.getIntrinsic(IID, {ArgTy});
     return Builder.CreateCall(F, {Addr, Val, ZeroI32, ZeroI32, ZeroI1});
   }
-  case AMDGPU::BI__builtin_amdgcn_global_load_tr_i32:
-  case AMDGPU::BI__builtin_amdgcn_global_load_tr_v2i32:
-  case AMDGPU::BI__builtin_amdgcn_global_load_tr_v4f16:
-  case AMDGPU::BI__builtin_amdgcn_global_load_tr_v4i16:
-  case AMDGPU::BI__builtin_amdgcn_global_load_tr_v8f16:
-  case AMDGPU::BI__builtin_amdgcn_global_load_tr_v8i16: {
+  case AMDGPU::BI__builtin_amdgcn_global_load_tr_b64_i32:
+  case AMDGPU::BI__builtin_amdgcn_global_load_tr_b64_v2i32:
+  case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v4bf16:
+  case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v4f16:
+  case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v4i16:
+  case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v8bf16:
+  case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v8f16:
+  case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v8i16: {
 
     llvm::Type *ArgTy;
     switch (BuiltinID) {
-    case AMDGPU::BI__builtin_amdgcn_global_load_tr_i32:
+    case AMDGPU::BI__builtin_amdgcn_global_load_tr_b64_i32:
       ArgTy = llvm::Type::getInt32Ty(getLLVMContext());
       break;
-    case AMDGPU::BI__builtin_amdgcn_global_load_tr_v2i32:
+    case AMDGPU::BI__builtin_amdgcn_global_load_tr_b64_v2i32:
       ArgTy = llvm::FixedVectorType::get(
           llvm::Type::getInt32Ty(getLLVMContext()), 2);
       break;
-    case AMDGPU::BI__builtin_amdgcn_global_load_tr_v4f16:
+    case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v4bf16:
+      ArgTy = llvm::FixedVectorType::get(
+          llvm::Type::getBFloatTy(getLLVMContext()), 4);
+      break;
+    case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v4f16:
       ArgTy = llvm::FixedVectorType::get(
           llvm::Type::getHalfTy(getLLVMContext()), 4);
       break;
-    case AMDGPU::BI__builtin_amdgcn_global_load_tr_v4i16:
+    case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v4i16:
       ArgTy = llvm::FixedVectorType::get(
           llvm::Type::getInt16Ty(getLLVMContext()), 4);
       break;
-    case AMDGPU::BI__builtin_amdgcn_global_load_tr_v8f16:
+    case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v8bf16:
+      ArgTy = llvm::FixedVectorType::get(
+          llvm::Type::getBFloatTy(getLLVMContext()), 8);
+      break;
+    case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v8f16:
       ArgTy = llvm::FixedVectorType::get(
           llvm::Type::getHalfTy(getLLVMContext()), 8);
       break;
-    case AMDGPU::BI__builtin_amdgcn_global_load_tr_v8i16:
+    case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v8i16:
       ArgTy = llvm::FixedVectorType::get(
           llvm::Type::getInt16Ty(getLLVMContext()), 8);
       break;
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-gfx11-err.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-gfx11-err.cl
index f7afb7cb97edad..4363769b864571 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-gfx11-err.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-gfx11-err.cl
@@ -6,21 +6,22 @@
 typedef int    v2i   __attribute__((ext_vector_type(2)));
 typedef half   v8h   __attribute__((ext_vector_type(8)));
 typedef short  v8s   __attribute__((ext_vector_type(8)));
+typedef __bf16 v8bf16 __attribute__((ext_vector_type(8)));
 
 typedef half   v4h   __attribute__((ext_vector_type(4)));
 typedef short  v4s   __attribute__((ext_vector_type(4)));
+typedef __bf16 v4bf16 __attribute__((ext_vector_type(4)));
 
-
-
-void amdgcn_global_load_tr(global v2i* v2i_inptr, global v8s* v8s_inptr, global v8h* v8h_inptr,
-                           global int* int_inptr, global v4s* v4s_inptr, global v4h* v4h_inptr)
+void amdgcn_global_load_tr(global v2i* v2i_inptr, global v8s* v8s_inptr, global v8h* v8h_inptr, global v8bf16* v8bf16_inptr,
+                           global int* int_inptr, global v4s* v4s_inptr, global v4h* v4h_inptr, global v4bf16* v4bf16_inptr)
 {
-  v2i out_1 = __builtin_amdgcn_global_load_tr_v2i32(v2i_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_v2i32' needs target feature gfx12-insts,wavefrontsize32}}
-  v8s out_2 = __builtin_amdgcn_global_load_tr_v8i16(v8s_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_v8i16' needs target feature gfx12-insts,wavefrontsize32}}
-  v8h out_3 = __builtin_amdgcn_global_load_tr_v8f16(v8h_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_v8f16' needs target feature gfx12-insts,wavefrontsize32}}
-
-  int out_4 = __builtin_amdgcn_global_load_tr_i32(int_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_i32' needs target feature gfx12-insts,wavefrontsize64}}
-  v4s out_5 = __builtin_amdgcn_global_load_tr_v4i16(v4s_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_v4i16' needs target feature gfx12-insts,wavefrontsize64}}
-  v4h out_6 = __builtin_amdgcn_global_load_tr_v4f16(v4h_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_v4f16' needs target feature gfx12-insts,wavefrontsize64}}
+  v2i out_1 = __builtin_amdgcn_global_load_tr_b64_v2i32(v2i_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b64_v2i32' needs target feature gfx12-insts,wavefrontsize32}}
+  v8s out_2 = __builtin_amdgcn_global_load_tr_b128_v8i16(v8s_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b128_v8i16' needs target feature gfx12-insts,wavefrontsize32}}
+  v8h out_3 = __builtin_amdgcn_global_load_tr_b128_v8f16(v8h_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b128_v8f16' needs target feature gfx12-insts,wavefrontsize32}}
+  v8bf16 o4 = __builtin_amdgcn_global_load_tr_b128_v8bf16(v8bf16_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b128_v8bf16' needs target feature gfx12-insts,wavefrontsize32}}
+
+  int out_5 = __builtin_amdgcn_global_load_tr_b64_i32(int_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b64_i32' needs target feature gfx12-insts,wavefrontsize64}}
+  v4s out_6 = __builtin_amdgcn_global_load_tr_b128_v4i16(v4s_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b128_v4i16' needs target feature gfx12-insts,wavefrontsize64}}
+  v4h out_7 = __builtin_amdgcn_global_load_tr_b128_v4f16(v4h_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b128_v4f16' needs target feature gfx12-insts,wavefrontsize64}}
+  v4bf16 o8 = __builtin_amdgcn_global_load_tr_b128_v4bf16(v4bf16_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b128_v4bf16' needs target feature gfx12-insts,wavefrontsize64}}
 }
-
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-gfx12-w32-err.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-gfx12-w32-err.cl
index 04ac0a66db7ce7..208f92fc5d44f3 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-gfx12-w32-err.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-gfx12-w32-err.cl
@@ -5,11 +5,12 @@
 
 typedef half   v4h   __attribute__((ext_vector_type(4)));
 typedef short  v4s   __attribute__((ext_vector_type(4)));
+typedef __bf16 v4bf16 __attribute__((ext_vector_type(4)));
 
-void amdgcn_global_load_tr(global int* int_inptr, global v4s* v4s_inptr, global v4h* v4h_inptr)
+void amdgcn_global_load_tr(global int* int_inptr, global v4s* v4s_inptr, global v4h* v4h_inptr, global v4bf16* v4bf16_inptr)
 {
-  int out_4 = __builtin_amdgcn_global_load_tr_i32(int_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_i32' needs target feature gfx12-insts,wavefrontsize64}}
-  v4s out_5 = __builtin_amdgcn_global_load_tr_v4i16(v4s_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_v4i16' needs target feature gfx12-insts,wavefrontsize64}}
-  v4h out_6 = __builtin_amdgcn_global_load_tr_v4f16(v4h_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_v4f16' needs target feature gfx12-insts,wavefrontsize64}}
+  int out_1 = __builtin_amdgcn_global_load_tr_b64_i32(int_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b64_i32' needs target feature gfx12-insts,wavefrontsize64}}
+  v4s out_2 = __builtin_amdgcn_global_load_tr_b128_v4i16(v4s_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b128_v4i16' needs target feature gfx12-insts,wavefrontsize64}}
+  v4h out_3 = __builtin_amdgcn_global_load_tr_b128_v4f16(v4h_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b128_v4f16' needs target feature gfx12-insts,wavefrontsize64}}
+  v4bf16 o4 = __builtin_amdgcn_global_load_tr_b128_v4bf16(v4bf16_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b128_v4bf16' needs target feature gfx12-insts,wavefrontsize64}}
 }
-
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-gfx12-w64-err.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-gfx12-w64-err.cl
index 113b54b853a9f4..199146a9715da6 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-gfx12-w64-err.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-gfx12-w64-err.cl
@@ -6,11 +6,12 @@
 typedef int    v2i   __attribute__((ext_vector_type(2)));
 typedef half   v8h   __attribute__((ext_vector_type(8)));
 typedef short  v8s   __attribute__((ext_vector_type(8)));
+typedef __bf16 v8bf16 __attribute__((ext_vector_type(8)));
 
-void amdgcn_global_load_tr(global v2i* v2i_inptr, global v8s* v8s_inptr, global v8h* v8h_inptr)
+void amdgcn_global_load_tr(global v2i* v2i_inptr, global v8s* v8s_inptr, global v8h* v8h_inptr, global v8bf16* v8bf16_inptr)
 {
-  v2i out_1 = __builtin_amdgcn_global_load_tr_v2i32(v2i_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_v2i32' needs target feature gfx12-insts,wavefrontsize32}}
-  v8s out_2 = __builtin_amdgcn_global_load_tr_v8i16(v8s_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_v8i16' needs target feature gfx12-insts,wavefrontsize32}}
-  v8h out_3 = __builtin_amdgcn_global_load_tr_v8f16(v8h_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_v8f16' needs target feature gfx12-insts,wavefrontsize32}}
+  v2i out_1 = __builtin_amdgcn_global_load_tr_b64_v2i32(v2i_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b64_v2i32' needs target feature gfx12-insts,wavefrontsize32}}
+  v8s out_2 = __builtin_amdgcn_global_load_tr_b128_v8i16(v8s_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b128_v8i16' needs target feature gfx12-insts,wavefrontsize32}}
+  v8h out_3 = __builtin_amdgcn_global_load_tr_b128_v8f16(v8h_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b128_v8f16' needs target feature gfx12-insts,wavefrontsize32}}
+  v8bf16 o4 = __builtin_amdgcn_global_load_tr_b128_v8bf16(v8bf16_inptr); // expected-error{{'__builtin_amdgcn_global_load_tr_b128_v8bf16' needs target feature gfx12-insts,wavefrontsize32}}
 }
-
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-w32.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-w32.cl
index b5fcad68a47020..0035b16b902b68 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-w32.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-w32.cl
@@ -5,44 +5,44 @@
 typedef int    v2i   __attribute__((ext_vector_type(2)));
 typedef half   v8h   __attribute__((ext_vector_type(8)));
 typedef short  v8s   __attribute__((ext_vector_type(8)));
+typedef __bf16 v8bf16 __attribute__((ext_vector_type(8)));
 
-// Wave32
-
-//
-// amdgcn_global_load_tr
-//
-
-// CHECK-GFX1200-LABEL: @test_amdgcn_global_load_tr_v2i32(
+// CHECK-GFX1200-LABEL: @test_amdgcn_global_load_tr_b64_v2i32(
 // CHECK-GFX1200-NEXT:  entry:
 // CHECK-GFX1200-NEXT:    [[TMP0:%.*]] = tail call <2 x i32> @llvm.amdgcn.global.load.tr.v2i32(ptr addrspace(1) [[INPTR:%.*]])
 // CHECK-GFX1200-NEXT:    ret <2 x i32> [[TMP0]]
 //
-v2i test_amdgcn_global_load_tr_v2i32(global v2i* inptr)
+v2i test_amdgcn_global_load_tr_b64_v2i32(global v2i* inptr)
 {
-  return __builtin_amdgcn_global_load_tr_v2i32(inptr);
+  return __builtin_amdgcn_global_load_tr_b64_v2i32(inptr);
 }
 
-//
-// amdgcn_global_load_tr
-//
-
-// CHECK-GFX1200-LABEL: @test_amdgcn_global_load_tr_v8i16(
+// CHECK-GFX1200-LABEL: @test_amdgcn_global_load_tr_b128_v8i16(
 // CHECK-GFX1200-NEXT:  entry:
 // CHECK-GFX1200-NEXT:    [[TMP0:%.*]] = tail call <8 x i16> @llvm.amdgcn.global.load.tr.v8i16(ptr addrspace(1) [[INPTR:%.*]])
 // CHECK-GFX1200-NEXT:    ret <8 x i16> [[TMP0]]
 //
-v8s test_amdgcn_global_load_tr_v8i16(global v8s* inptr)
+v8s test_amdgcn_global_load_tr_b128_v8i16(global v8s* inptr)
 {
-  return __builtin_amdgcn_global_load_tr_v8i16(inptr);
+  return __builtin_amdgcn_global_load_tr_b128_v8i16(inptr);
 }
 
-// CHECK-GFX1200-LABEL: @test_amdgcn_global_load_tr_v8f16(
+// CHECK-GFX1200-LABEL: @test_amdgcn_global_load_tr_b128_v8f16(
 // CHECK-GFX1200-NEXT:  entry:
 // CHECK-GFX1200-NEXT:    [[TMP0:%.*]] = tail call <8 x half> @llvm.amdgcn.global.load.tr.v8f16(ptr addrspace(1) [[INPTR:%.*]])
 // CHECK-GFX1200-NEXT:    ret <8 x half> [[TMP0]]
 //
-v8h test_amdgcn_global_load_tr_v8f16(global v8h* inptr)
+v8h test_amdgcn_global_load_tr_b128_v8f16(global v8h* inptr)
 {
-  return __builtin_amdgcn_global_load_tr_v8f16(inptr);
+  return __builtin_amdgcn_global_load_tr_b128_v8f16(inptr);
 }
 
+// CHECK-GFX1200-LABEL: @test_amdgcn_global_load_tr_b128_v8bf16(
+// CHECK-GFX1200-NEXT:  entry:
+// CHECK-GFX1200-NEXT:    [[TMP0:%.*]] = tail call <8 x bfloat> @llvm.amdgcn.global.load.tr.v8bf16(ptr addrspace(1) [[INPTR:%.*]])
+// CHECK-GFX1200-NEXT:    ret <8 x bfloat> [[TMP0]]
+//
+v8bf16 test_amdgcn_global_load_tr_b128_v8bf16(global v8bf16* inptr)
+{
+  return __builtin_amdgcn_global_load_tr_b128_v8bf16(inptr);
+}
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-w64.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-w64.cl
index 9c48ac071b4d3f..6c025bb5a55a36 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-w64.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-tr-w64.cl
@@ -4,44 +4,44 @@
 
 typedef half   v4h   __attribute__((ext_vector_type(4)));
 typedef short  v4s   __attribute__((ext_vector_type(4)));
+typedef __bf16 v4bf16 __attribute__((ext_vector_type(4)));
 
-// Wave64
-
-//
-// amdgcn_global_load_tr
-//
-
-// CHECK-GFX1200-LABEL: @test_amdgcn_global_load_tr_i32(
+// CHECK-GFX1200-LABEL: @test_amdgcn_global_load_tr_b64_i32(
 // CHECK-GFX1200-NEXT:  entry:
 // CHECK-GFX1200-NEXT:    [[TMP0:%.*]] = tail call i32 @llvm.amdgcn.global.load.tr.i32(ptr addrspace(1) [[INPTR:%.*]])
 // CHECK-GFX1200-NEXT:    ret i32 [[TMP0]]
 //
-int test_amdgcn_global_load_tr_i32(global int* inptr)
+int test_amdgcn_global_load_tr_b64_i32(global int* inptr)
 {
-  return __builtin_amdgcn_global_load_tr_i32(inptr);
+  return __builtin_amdgcn_global_load_tr_b64_i32(inptr);
 }
 
-//
-// amdgcn_global_load_tr
-//
-
-// CHECK-GFX1200-LABEL: @test_amdgcn_global_load_tr_v4i16(
+// CHECK-GFX1200-LABEL: @test_amdgcn_global_load_tr_b128_v4i16(
 // CHECK-GFX1200-NEXT:  entry:
 // CHECK-GFX1200-NEXT:    [[TMP0:%.*]] = tail call <4 x i16> @llvm.amdgcn.global.load.tr.v4i16(ptr addrspace(1) [[INPTR:%.*]])
 // CHECK-GFX1200-NEXT:    ret <4 x i16> [[TMP0]]
 //
-v4s test_amdgcn_global_load_tr_v4i16(global v4s* inptr)
+v4s test_amdgcn_global_load_tr_b128_v4i16(global v4s* inptr)
 {
-  return __builtin_amdgcn_global_load_tr_v4i16(inptr);
+  return __builtin_amdgcn_global_load_tr_b128_v4i16(inptr);
 }
 
-// CHECK-GFX1200-LABEL: @test_amdgcn_global_load_tr_v4f16(
+// CHECK-GFX1200-LABEL: @test_amdgcn_global_load_tr_b128_v4f16(
 // CHECK-GFX1200-NEXT:  entry:
 // CHECK-GFX1200-NEXT:    [[TMP0:%.*]] = tail call <4 x half> @llvm.amdgcn.global.load.tr.v4f16(ptr addrspace(1) [[INPTR:%.*]])
 // CHECK-GFX1200-NEXT:    ret <4 x half> [[TMP0]]
 //
-v4h test_amdgcn_global_load_tr_v4f16(global v4h* inptr)
+v4h test_amdgcn_global_load_tr_b128_v4f16(global v4h* inptr)
 {
-  return __builtin_amdgcn_global_load_tr_v4f16(inptr);
+  return __builtin_amdgcn_global_load_tr_b128_v4f16(inptr);
 }
 
+// CHECK-GFX1200-LABEL: @test_amdgcn_global_load_tr_b128_v4bf16(
+// CHECK-GFX1200-NEXT:  entry:
+// CHECK-GFX1200-NEXT:    [[TMP0:%.*]] = tail call <4 x bfloat> @llvm.amdgcn.global.load.tr.v4bf16(ptr addrspace(1) [[INPTR:%.*]])
+// CHECK-GFX1200-NEXT:    ret <4 x bfloat> [[TMP0]]
+//
+v4bf16 test_amdgcn_global_load_tr_b128_v4bf16(global v4bf16* inptr)
+{
+  return __builtin_amdgcn_global_load_tr_b128_v4bf16(inptr);
+}

rampitec

Do you want to rename intrinsics as well? Because now intrinsic names do not match builtin names.

shiltian · 2024-03-21T22:36:49Z

clang/include/clang/Basic/BuiltinsAMDGPU.def

+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b64_i32, "ii*1", "nc", "gfx12-insts,wavefrontsize64")
+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v4i16, "V4sV4s*1", "nc", "gfx12-insts,wavefrontsize64")
+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v4f16, "V4hV4h*1", "nc", "gfx12-insts,wavefrontsize64")
+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v4bf16, "V4yV4y*1", "nc", "gfx12-insts,wavefrontsize64")


Do we still want to keep the old builtins to maintain compatibility, though I doubt there is any legacy code using them?

There should not be legacy yet.

changpeng · 2024-03-21T23:09:07Z

Do you want to rename intrinsics as well? Because now intrinsic names do not match builtin names.

Do we have to match builtins with intrinsics? Renaming intrinsics here means we will have to duplicate the intrinsics.

rampitec · 2024-03-21T23:12:32Z

Do you want to rename intrinsics as well? Because now intrinsic names do not match builtin names.

Do we have to match builtins with intrinsics? Renaming intrinsics here means we will have to duplicate the intrinsics.

Is that because of the mangling?

changpeng · 2024-03-21T23:29:42Z

Do you want to rename intrinsics as well? Because now intrinsic names do not match builtin names.

Do we have to match builtins with intrinsics? Renaming intrinsics here means we will have to duplicate the intrinsics.

Is that because of the mangling?
Right. It was originally suggested to use a single instrinsic "load_lr". But eventually we use global_load_tr to indicate this is in global address space. If we want to rename intrinsics here, it should be global_load_tr_b64 and global_load_tr_b128.

We should rename intrinsic if users can use intrinsics directly. I think use-friendly is more important.

shiltian · 2024-03-22T00:26:47Z

Do you want to rename intrinsics as well? Because now intrinsic names do not match builtin names.

Do we have to match builtins with intrinsics? Renaming intrinsics here means we will have to duplicate the intrinsics.

Is that because of the mangling?
Right. It was originally suggested to use a single instrinsic "load_lr". But eventually we use global_load_tr to indicate this is in global address space. If we want to rename intrinsics here, it should be global_load_tr_b64 and global_load_tr_b128.

We should rename intrinsic if users can use intrinsics directly. I think use-friendly is more important.

I don't think intrinsics are meant for users. Builtins are the user-facing front. :-)

changpeng · 2024-03-22T00:55:33Z

I don't think intrinsics are meant for users. Builtins are the user-facing front. :-)

Then renaming the intrinsics should be at a relatively lower priority. We may do it in a separate patch once we have reached an agreement.

rampitec · 2024-03-22T06:43:39Z

I don't think intrinsics are meant for users. Builtins are the user-facing front. :-)

Depending on who you consider an user. Are folks writing MLIR generators users?

piotrAMD · 2024-03-22T06:51:36Z

The change LG - thanks for adding support for bf16.

Agreed that the intrinsics should match the builtins for consistency (now or in a follow-up commit).
These intrinsics were added for the upcoming generation - it should be fine to rename them at this stage.

arsenm · 2024-03-22T10:43:10Z

I don't think intrinsics are meant for users. Builtins are the user-facing front. :-)

Depending on who you consider an user. Are folks writing MLIR generators users?

They're consumers of an unstable API, changing intrinsics is fine

arsenm · 2024-03-22T10:45:29Z

clang/include/clang/Basic/BuiltinsAMDGPU.def

+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b64_i32, "ii*1", "nc", "gfx12-insts,wavefrontsize64")
+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v4i16, "V4sV4s*1", "nc", "gfx12-insts,wavefrontsize64")
+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v4f16, "V4hV4h*1", "nc", "gfx12-insts,wavefrontsize64")
+TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v4bf16, "V4yV4y*1", "nc", "gfx12-insts,wavefrontsize64")


Do we really need the f16/bf16 versions? You can always bitcast the i16 versions.

changpeng · 2024-03-22T15:16:59Z

[AMD Official Use Only - General] I am fine to remove f16/bf16 versions. Enumerating all possible types could be very painful. For example we gave up enumerating for B64, and ended up using v2i32 only. What do others think removing f16/bf16 versions? Thanks Get Outlook for iOS<https://aka.ms/o0ukef>

________________________________ From: Matt Arsenault ***@***.***> Sent: Friday, March 22, 2024 3:45:53 AM To: llvm/llvm-project ***@***.***> Cc: Fang, Changpeng ***@***.***>; Author ***@***.***> Subject: Re: [llvm/llvm-project] AMDGPU: Rename and add bf16 support for global_load_tr builtins (PR #86202) Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding. @arsenm commented on this pull request.

________________________________ In clang/include/clang/Basic/BuiltinsAMDGPU.def<#86202 (comment)>:

-TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_v2i32, "V2iV2i*1", "nc", "gfx12-insts,wavefrontsize32")

-TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_v8i16, "V8sV8s*1", "nc", "gfx12-insts,wavefrontsize32") -TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_v8f16, "V8hV8h*1", "nc", "gfx12-insts,wavefrontsize32") - -TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_i32, "ii*1", "nc", "gfx12-insts,wavefrontsize64") -TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_v4i16, "V4sV4s*1", "nc", "gfx12-insts,wavefrontsize64") -TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_v4f16, "V4hV4h*1", "nc", "gfx12-insts,wavefrontsize64") +TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b64_v2i32, "V2iV2i*1", "nc", "gfx12-insts,wavefrontsize32") +TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v8i16, "V8sV8s*1", "nc", "gfx12-insts,wavefrontsize32") +TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v8f16, "V8hV8h*1", "nc", "gfx12-insts,wavefrontsize32") +TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v8bf16, "V8yV8y*1", "nc", "gfx12-insts,wavefrontsize32") + +TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b64_i32, "ii*1", "nc", "gfx12-insts,wavefrontsize64") +TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v4i16, "V4sV4s*1", "nc", "gfx12-insts,wavefrontsize64") +TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v4f16, "V4hV4h*1", "nc", "gfx12-insts,wavefrontsize64") +TARGET_BUILTIN(__builtin_amdgcn_global_load_tr_b128_v4bf16, "V4yV4y*1", "nc", "gfx12-insts,wavefrontsize64") Do we really need the f16/bf16 versions? You can always bitcast the i16 versions. — Reply to this email directly, view it on GitHub<#86202 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABALCLLYCW6QUD3CELVLSKDYZQDWDAVCNFSM6AAAAABFCIDLAKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSNJUGQ4TEOBYGM>. You are receiving this because you authored the thread.Message ID: ***@***.***>

changpeng · 2024-03-22T15:51:26Z

I am going to propose to rename intrinsics and remove f16/bf16 versions of builtins/intrinsics

…#86202) Make the name of a clang builtin as close to the mnemonic instruction name as possible. The data type suffix may not be enough to tell what instruction the builtin is going to produce. This patch also add the bf16 support for global_load_tr_b128 builtins.

llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen labels Mar 21, 2024

changpeng requested a review from piotrAMD March 21, 2024 21:44

changpeng requested review from rampitec, jayfoad and srpande March 21, 2024 21:44

rampitec reviewed Mar 21, 2024

View reviewed changes

shiltian reviewed Mar 21, 2024

View reviewed changes

changpeng requested a review from arsenm March 21, 2024 23:30

arsenm approved these changes Mar 22, 2024

View reviewed changes

arsenm reviewed Mar 22, 2024

View reviewed changes

changpeng merged commit 3054d0d into llvm:main Mar 22, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMDGPU: Rename and add bf16 support for global_load_tr builtins #86202

AMDGPU: Rename and add bf16 support for global_load_tr builtins #86202

changpeng commented Mar 21, 2024

llvmbot commented Mar 21, 2024 •

edited

rampitec left a comment

shiltian Mar 21, 2024

rampitec Mar 21, 2024

changpeng commented Mar 21, 2024

rampitec commented Mar 21, 2024

changpeng commented Mar 21, 2024

shiltian commented Mar 22, 2024

changpeng commented Mar 22, 2024 •

edited

rampitec commented Mar 22, 2024

piotrAMD commented Mar 22, 2024

arsenm commented Mar 22, 2024

arsenm Mar 22, 2024

changpeng commented Mar 22, 2024 via email

changpeng commented Mar 22, 2024

AMDGPU: Rename and add bf16 support for global_load_tr builtins #86202

AMDGPU: Rename and add bf16 support for global_load_tr builtins #86202

Conversation

changpeng commented Mar 21, 2024

llvmbot commented Mar 21, 2024 • edited

rampitec left a comment

Choose a reason for hiding this comment

shiltian Mar 21, 2024

Choose a reason for hiding this comment

rampitec Mar 21, 2024

Choose a reason for hiding this comment

changpeng commented Mar 21, 2024

rampitec commented Mar 21, 2024

changpeng commented Mar 21, 2024

shiltian commented Mar 22, 2024

changpeng commented Mar 22, 2024 • edited

rampitec commented Mar 22, 2024

piotrAMD commented Mar 22, 2024

arsenm commented Mar 22, 2024

arsenm Mar 22, 2024

Choose a reason for hiding this comment

changpeng commented Mar 22, 2024 via email

changpeng commented Mar 22, 2024

llvmbot commented Mar 21, 2024 •

edited

changpeng commented Mar 22, 2024 •

edited