Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPIR-V] ByteAddressBuffer Load should get vectorized #3372

Closed
Jasper-Bekkers opened this issue Jan 20, 2021 · 7 comments
Closed

[SPIR-V] ByteAddressBuffer Load should get vectorized #3372

Jasper-Bekkers opened this issue Jan 20, 2021 · 7 comments
Assignees
Labels
spirv Work related to SPIR-V

Comments

@Jasper-Bekkers
Copy link

struct PSInput
{
	float4 color : COLOR;
};

ByteAddressBuffer g_stuff;


float4 PSMain(PSInput input) : SV_TARGET
{
	//return mul(g_stuff.Load<Mat4x4Wrapper>(0).internal, input.color);
	//return mul(g_stuff[0].internal, input.color);
	return mul(g_stuff.Load<float4x4>(0), input.color);
}

Output:

; SPIR-V
; Version: 1.0
; Generator: Google spiregg; 0
; Bound: 89
; Schema: 0
               OpCapability Shader
               OpMemoryModel Logical GLSL450
               OpEntryPoint Fragment %PSMain "PSMain" %in_var_COLOR %out_var_SV_TARGET
               OpExecutionMode %PSMain OriginUpperLeft
               OpSource HLSL 650
               OpName %type_ByteAddressBuffer "type.ByteAddressBuffer"
               OpName %g_stuff "g_stuff"
               OpName %in_var_COLOR "in.var.COLOR"
               OpName %out_var_SV_TARGET "out.var.SV_TARGET"
               OpName %PSMain "PSMain"
               OpDecorate %in_var_COLOR Location 0
               OpDecorate %out_var_SV_TARGET Location 0
               OpDecorate %g_stuff DescriptorSet 0
               OpDecorate %g_stuff Binding 0
               OpDecorate %_runtimearr_uint ArrayStride 4
               OpMemberDecorate %type_ByteAddressBuffer 0 Offset 0
               OpMemberDecorate %type_ByteAddressBuffer 0 NonWritable
               OpDecorate %type_ByteAddressBuffer BufferBlock
       %uint = OpTypeInt 32 0
     %uint_0 = OpConstant %uint 0
     %uint_2 = OpConstant %uint 2
     %uint_1 = OpConstant %uint 1
%_runtimearr_uint = OpTypeRuntimeArray %uint
%type_ByteAddressBuffer = OpTypeStruct %_runtimearr_uint
%_ptr_Uniform_type_ByteAddressBuffer = OpTypePointer Uniform %type_ByteAddressBuffer
      %float = OpTypeFloat 32
    %v4float = OpTypeVector %float 4
%_ptr_Input_v4float = OpTypePointer Input %v4float
%_ptr_Output_v4float = OpTypePointer Output %v4float
       %void = OpTypeVoid
         %17 = OpTypeFunction %void
%_ptr_Uniform_uint = OpTypePointer Uniform %uint
%mat4v4float = OpTypeMatrix %v4float 4
    %g_stuff = OpVariable %_ptr_Uniform_type_ByteAddressBuffer Uniform
%in_var_COLOR = OpVariable %_ptr_Input_v4float Input
%out_var_SV_TARGET = OpVariable %_ptr_Output_v4float Output
     %uint_3 = OpConstant %uint 3
     %uint_4 = OpConstant %uint 4
     %uint_5 = OpConstant %uint 5
     %uint_6 = OpConstant %uint 6
     %uint_7 = OpConstant %uint 7
     %uint_8 = OpConstant %uint 8
     %uint_9 = OpConstant %uint 9
    %uint_10 = OpConstant %uint 10
    %uint_11 = OpConstant %uint 11
    %uint_12 = OpConstant %uint 12
    %uint_13 = OpConstant %uint 13
    %uint_14 = OpConstant %uint 14
    %uint_15 = OpConstant %uint 15
     %PSMain = OpFunction %void None %17
         %33 = OpLabel
         %34 = OpLoad %v4float %in_var_COLOR
         %35 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_0
         %36 = OpLoad %uint %35
         %37 = OpBitcast %float %36
         %38 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_1
         %39 = OpLoad %uint %38
         %40 = OpBitcast %float %39
         %41 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_2
         %42 = OpLoad %uint %41
         %43 = OpBitcast %float %42
         %44 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_3
         %45 = OpLoad %uint %44
         %46 = OpBitcast %float %45
         %47 = OpCompositeConstruct %v4float %37 %40 %43 %46
         %48 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_4
         %49 = OpLoad %uint %48
         %50 = OpBitcast %float %49
         %51 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_5
         %52 = OpLoad %uint %51
         %53 = OpBitcast %float %52
         %54 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_6
         %55 = OpLoad %uint %54
         %56 = OpBitcast %float %55
         %57 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_7
         %58 = OpLoad %uint %57
         %59 = OpBitcast %float %58
         %60 = OpCompositeConstruct %v4float %50 %53 %56 %59
         %61 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_8
         %62 = OpLoad %uint %61
         %63 = OpBitcast %float %62
         %64 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_9
         %65 = OpLoad %uint %64
         %66 = OpBitcast %float %65
         %67 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_10
         %68 = OpLoad %uint %67
         %69 = OpBitcast %float %68
         %70 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_11
         %71 = OpLoad %uint %70
         %72 = OpBitcast %float %71
         %73 = OpCompositeConstruct %v4float %63 %66 %69 %72
         %74 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_12
         %75 = OpLoad %uint %74
         %76 = OpBitcast %float %75
         %77 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_13
         %78 = OpLoad %uint %77
         %79 = OpBitcast %float %78
         %80 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_14
         %81 = OpLoad %uint %80
         %82 = OpBitcast %float %81
         %83 = OpAccessChain %_ptr_Uniform_uint %g_stuff %uint_0 %uint_15
         %84 = OpLoad %uint %83
         %85 = OpBitcast %float %84
         %86 = OpCompositeConstruct %v4float %76 %79 %82 %85
         %87 = OpCompositeConstruct %mat4v4float %47 %60 %73 %86
         %88 = OpVectorTimesMatrix %v4float %34 %87
               OpStore %out_var_SV_TARGET %88
               OpReturn
               OpFunctionEnd

Compared to:

struct PSInput
{
	float4 color : COLOR;
};

struct Mat4x4Wrapper {
    float4x4 internal;
};

StructuredBuffer<float4x4> g_stuff;


float4 PSMain(PSInput input) : SV_TARGET
{
	//return mul(g_stuff.Load<Mat4x4Wrapper>(0).internal, input.color);
	return mul(g_stuff[0], input.color);
}
; SPIR-V
; Version: 1.0
; Generator: Google spiregg; 0
; Bound: 25
; Schema: 0
               OpCapability Shader
               OpMemoryModel Logical GLSL450
               OpEntryPoint Fragment %PSMain "PSMain" %in_var_COLOR %out_var_SV_TARGET
               OpExecutionMode %PSMain OriginUpperLeft
               OpSource HLSL 650
               OpName %type_StructuredBuffer_mat4v4float "type.StructuredBuffer.mat4v4float"
               OpName %g_stuff "g_stuff"
               OpName %in_var_COLOR "in.var.COLOR"
               OpName %out_var_SV_TARGET "out.var.SV_TARGET"
               OpName %PSMain "PSMain"
               OpDecorate %in_var_COLOR Location 0
               OpDecorate %out_var_SV_TARGET Location 0
               OpDecorate %g_stuff DescriptorSet 0
               OpDecorate %g_stuff Binding 0
               OpDecorate %_runtimearr_mat4v4float ArrayStride 64
               OpMemberDecorate %type_StructuredBuffer_mat4v4float 0 Offset 0
               OpMemberDecorate %type_StructuredBuffer_mat4v4float 0 MatrixStride 16
               OpMemberDecorate %type_StructuredBuffer_mat4v4float 0 RowMajor
               OpMemberDecorate %type_StructuredBuffer_mat4v4float 0 NonWritable
               OpDecorate %type_StructuredBuffer_mat4v4float BufferBlock
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
       %uint = OpTypeInt 32 0
     %uint_0 = OpConstant %uint 0
      %float = OpTypeFloat 32
    %v4float = OpTypeVector %float 4
%mat4v4float = OpTypeMatrix %v4float 4
%_runtimearr_mat4v4float = OpTypeRuntimeArray %mat4v4float
%type_StructuredBuffer_mat4v4float = OpTypeStruct %_runtimearr_mat4v4float
%_ptr_Uniform_type_StructuredBuffer_mat4v4float = OpTypePointer Uniform %type_StructuredBuffer_mat4v4float
%_ptr_Input_v4float = OpTypePointer Input %v4float
%_ptr_Output_v4float = OpTypePointer Output %v4float
       %void = OpTypeVoid
         %18 = OpTypeFunction %void
%_ptr_Uniform_mat4v4float = OpTypePointer Uniform %mat4v4float
    %g_stuff = OpVariable %_ptr_Uniform_type_StructuredBuffer_mat4v4float Uniform
%in_var_COLOR = OpVariable %_ptr_Input_v4float Input
%out_var_SV_TARGET = OpVariable %_ptr_Output_v4float Output
     %PSMain = OpFunction %void None %18
         %20 = OpLabel
         %21 = OpLoad %v4float %in_var_COLOR
         %22 = OpAccessChain %_ptr_Uniform_mat4v4float %g_stuff %int_0 %uint_0
         %23 = OpLoad %mat4v4float %22
         %24 = OpVectorTimesMatrix %v4float %21 %23
               OpStore %out_var_SV_TARGET %24
               OpReturn
               OpFunctionEnd

Run with dxc -spirv -Tps_6_5 -EPSMain test.hlsl

@Jasper-Bekkers
Copy link
Author

Jasper-Bekkers commented Feb 24, 2021

It looks like this issue is significantly worse then expected, since non-templated loads exhibit the same bad behavior:

http://shader-playground.timjones.io/1ff2f9b80ad8cb73d7cc64f60edb2804
http://shader-playground.timjones.io/88a55360960ebd185281324bb19511d5
http://shader-playground.timjones.io/91e117daf519511a033a9c3a8daf457d

The expected output would be to have one OpTypeVector of 4 elements, and one OpLoad on that.

As an example, see the following RDNA ISA as outputted from the RGA tool:

; -------- Disassembly --------------------
shader main
  asic(GFX10)
  type(CS)
  sgpr_count(6)
  vgpr_count(8)
  wave_size(64)

  s_inst_prefetch  0x0003                               // 000000000000: BFA00003
  s_getpc_b64   s[0:1]                                  // 000000000004: BE801F80
  s_mov_b32     s0, s2                                  // 000000000008: BE800302
  s_load_dwordx4  s[4:7], s[0:1], null                  // 00000000000C: F4080100 FA000000
  s_and_b32     s0, s3, lit(0x00fffffc)                 // 000000000014: 8700FF03 00FFFFFC
  v_mov_b32     v0, s0                                  // 00000000001C: 7E000200
  s_waitcnt     lgkmcnt(0)                              // 000000000020: BF8CC07F
  s_clause      0x0003                                  // 000000000024: BFA10003
  buffer_load_dword  v1, v0, s[4:7], 0 offen            // 000000000028: E0301000 80010100
  buffer_load_dword  v2, v0, s[4:7], 0 offen offset:4   // 000000000030: E0301004 80010200
  buffer_load_dword  v3, v0, s[4:7], 0 offen offset:8   // 000000000038: E0301008 80010300
  buffer_load_dword  v4, v0, s[4:7], 0 offen offset:12  // 000000000040: E030100C 80010400
  s_waitcnt     vmcnt(3)                                // 000000000048: BF8C3F73
  v_cvt_u32_f32  v1, v1                                 // 00000000004C: 7E020F01
  s_waitcnt     vmcnt(2)                                // 000000000050: BF8C3F72
  v_cvt_u32_f32  v2, v2                                 // 000000000054: 7E040F02
  s_waitcnt     vmcnt(1)                                // 000000000058: BF8C3F71
  v_cvt_u32_f32  v3, v3                                 // 00000000005C: 7E060F03
  s_waitcnt     vmcnt(0)                                // 000000000060: BF8C3F70
  v_cvt_u32_f32  v4, v4                                 // 000000000064: 7E080F04
  buffer_store_dword  v1, v0, s[4:7], 0 offen glc       // 000000000068: E0705000 80010100
  buffer_store_dword  v2, v0, s[4:7], 0 offen offset:4 glc // 000000000070: E0705004 80010200
  buffer_store_dword  v3, v0, s[4:7], 0 offen offset:8 glc // 000000000078: E0705008 80010300
  buffer_store_dword  v4, v0, s[4:7], 0 offen offset:12 glc // 000000000080: E070500C 80010400
  s_endpgm                                              // 000000000088: BF810000

Which also emits 4 buffer_load_dword and 4 buffer_store_dword whereas one buffer_store_dwordx4 would do. Notice also that even though the compiler correctly detects that this can be in a single clause, it still also emits 4 s_waitcnt ops. Arguably AMD should have a LoadStoreVectorizer pass similar to what's found in LLVM, however in practice it seems that NVIDIA also doesn't so such a pass and relies on the source compiler (DXC in this case) to do the right thing.

@Jasper-Bekkers Jasper-Bekkers changed the title [SPIR-V] Templated ByteAddressBuffer Load should get vectorized [SPIR-V] ByteAddressBuffer Load should get vectorized Feb 24, 2021
@jaebaek
Copy link
Collaborator

jaebaek commented Mar 2, 2021

@Jasper-Bekkers Thank you for reporting this issue. I will take a look and get back to you.

@MarijnS95
Copy link
Contributor

@jaebaek Thanks! Could this possibly have any relation to #3370?

@alan-baker
Copy link
Contributor

This is a limitation of SPIR-V. If the variable is typed to be a runtime array of uints (as it is here), then only a single uint can loaded at a time. There is a potential optimization of representing the variable as a runtime array of uint4s (or an aggregate of larger values), but that complicates the logic to calculate the addresses.

@jaebaek
Copy link
Collaborator

jaebaek commented Mar 3, 2021

I agree with @alan-baker . It works correctly but inefficient, which is a limitation of SPIR-V.

Since the given ByteAddressBuffer is a buffer of "bytes" (it generates uint buffer in SPIR-V), converting it into an arbitrary type using the templated load definitely needs some type-cast for each element. Currently, we do not have SPIR-V instructions for doing it efficiently.

@Jasper-Bekkers
Copy link
Author

Currently, we do not have SPIR-V instructions for doing it efficiently.

Might be better to close this issue then, and follow up directly within Khronos instead what do you think?

@jaebaek
Copy link
Collaborator

jaebaek commented Mar 8, 2021

Yes, we can close it and open another one when we have some solutions for this.

@jaebaek jaebaek closed this as completed Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spirv Work related to SPIR-V
Projects
None yet
Development

No branches or pull requests

5 participants