-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve masked load/store for bytes and words. #2821
Comments
Here is a somewhat faster implementation for varying int32 FastLoadByte(const uniform int8 * const uniform arr, varying int32 Index)
{
varying int32 DwInd = Index >> 2; // Index/4
varying int32 Shift = ((uint32)Index << 30) >> 27; // (Index%4)*8
varying uint32 Dword = ((uniform uint32 * uniform) & arr[0])[DwInd];
return (Dword >> Shift) & 0xFF;
} vmovmskps eax, ymm1
vpbroadcastd ymm2, dword ptr [rip + .LCPI1_0] # ymm2 = [4294967292,4294967292,4294967292,4294967292,4294967292,4294967292,4294967292,4294967292]
vpand ymm3, ymm0, ymm2
cmp eax, 255
jne .LBB1_2
vpcmpeqd ymm1, ymm1, ymm1
.LBB1_2:
vpxor xmm2, xmm2, xmm2
vpgatherdd ymm2, ymmword ptr [rdi + ymm3], ymm1
vpslld ymm0, ymm0, 3
vpbroadcastd ymm1, dword ptr [rip + .LCPI1_1] # ymm1 = [24,24,24,24,24,24,24,24]
vpand ymm0, ymm0, ymm1
vpsrlvd ymm0, ymm2, ymm0
vpand ymm0, ymm0, ymmword ptr [rip + .LCPI1_2]
ret This is hand tuned which I think should work for any mask. vpand ymm3, ymm0, dword ptr [rip + .LCPI1_0] # [0xfffffffc,...]
vpxor xmm2, xmm2, xmm2
vpgatherdd ymm2, ymmword ptr [rdi + ymm3], ymm1
vpslld ymm0, ymm0, 30
vpsrld ymm0, ymm0, 27
vpsrlvd ymm0, ymm2, ymm0
vpand ymm0, ymm0, ymmword ptr [rip + .LCPI1_1] # [0xff,...]
ret I don't know what is the reason behind using vpand ymm2, ymm0, dword ptr [rip + .LCPI1_0] # [0xfffffffc,...]
vpgatherdd ymm2, ymmword ptr [rdi + ymm2], ymm1
vpslld ymm0, ymm0, 30
vpsrld ymm0, ymm0, 27
vpsrlvd ymm0, ymm2, ymm0
vpand ymm0, ymm0, ymmword ptr [rip + .LCPI1_1] # [0xff,...]
ret the other two vpand instructions can also be replaced with shifts to remove constants. |
I think the SetValue___REFvytvyt: # @SetValue___REFvytvyt
vmovmskps eax, ymm1
cmp eax, 255
je .LBB1_2
vmovd xmm1, eax
vpbroadcastd ymm1, xmm1
vpand ymm1, ymm1, ymmword ptr [rip + .LCPI1_0]
vpxor xmm2, xmm2, xmm2
vpcmpeqd ymm1, ymm1, ymm2
vextracti128 xmm2, ymm1, 1
vpackssdw xmm1, xmm1, xmm2
vpacksswb xmm1, xmm1, xmm1
vmovq xmm2, qword ptr [rdi] # xmm2 = mem[0],zero
vpblendvb xmm0, xmm0, xmm2, xmm1
.LBB1_2:
vmovq qword ptr [rdi], xmm0
vzeroupper
ret for the masked part, it's doing some extra work which I think can be more straight forward. instead of building an inverted mask from vextracti128 xmm2, ymm1, 1 # extract upper 128 bits from the mask
vpackssdw xmm1, xmm1, xmm2 # convert 256-bit mask to 128-bit mask
vpacksswb xmm1, xmm1, xmm1 # convert 128-bit mask to 64-bit mask.
vmovq xmm2, qword ptr [rdi] # xmm2 = mem[0],zero
vpblendvb xmm0, xmm2, xmm0, xmm1 # if 1s, set xmm2, otherwise leave xmm0 unchanged together with rest of the code: SetValue___REFvytvyt: # @SetValue___REFvytvyt
vmovmskps eax, ymm1
cmp eax, 255
je .LBB1_2
vextracti128 xmm2, ymm1, 1
vpackssdw xmm1, xmm1, xmm2
vpacksswb xmm1, xmm1, xmm1
vmovq xmm2, qword ptr [rdi] # xmm2 = mem[0],zero
vpblendvb xmm0, xmm2, xmm0, xmm1
.LBB1_2:
vmovq qword ptr [rdi], xmm0
vzeroupper
ret I tried a few ways and came up with this, it's still doing some redundant work. 😞 inline void SetByte(int8& dest, int8 src)
{
if ((((1 << TARGET_WIDTH) - 1) ^ lanemask()) == 0)
{
unmasked
{
dest = src;
}
}
else
{
uniform uint32<TARGET_WIDTH> bmask;
unmasked
{
bmask[programIndex] = 0;
}
bmask[programIndex] = ~0;
unmasked
{
dest = select(bmask[programIndex] == 0, dest, src);
}
}
} SetByte___REFvyTvyT: # @SetByte___REFvyTvyT
vmovmskps eax, ymm1
cmp al, -1
je .LBB1_2
vpcmpeqd ymm2, ymm2, ymm2 # unnecessary
vxorps xmm3, xmm3, xmm3 # unnecessary
vblendvps ymm1, ymm3, ymm2, ymm1 # unnecessary
vpcmpeqd ymm1, ymm1, ymm3 # unnecessary
vextracti128 xmm2, ymm1, 1
vpackssdw xmm1, xmm1, xmm2
vpacksswb xmm1, xmm1, xmm1
vmovq xmm2, qword ptr [rdi] # xmm2 = mem[0],zero
vpblendvb xmm0, xmm0, xmm2, xmm1
.LBB1_2: # %common.ret
vmovq qword ptr [rdi], xmm0
vzeroupper
ret |
I noticed that there is a
looks like a very useful variable. this, unlike lanemask, gives direct access to the execution mask. this also worked for non-debug builds and it gave the best results so far. please don't remove it! 😄 void SetByte(int8& dest, int8 src)
{
varying uint32 mask = __mask;
unmasked
{
dest = mask ? src : dest;
}
} masked SetByte___REFvyTvyT: # @SetByte___REFvyTvyT
vpxor xmm2, xmm2, xmm2 # unnecessary
vpcmpeqd ymm1, ymm1, ymm2 # unnecessary
vextracti128 xmm2, ymm1, 1
vpackssdw xmm1, xmm1, xmm2
vpacksswb xmm1, xmm1, xmm1
vmovq xmm2, qword ptr [rdi] # xmm2 = mem[0],zero
vpblendvb xmm0, xmm0, xmm2, xmm1
vmovq qword ptr [rdi], xmm0
vzeroupper
ret unmasked SetByte___UM_REFvyTvyT: # @SetByte___UM_REFvyTvyT
vmovlps qword ptr [rdi], xmm0
ret this looks much cleaner, there are still two unnecessary instructions |
I have not read the whole thread but I need to comment about I would not discourage you to use it if you need, but you at least need to understand that it is defined in a different way for different targets. See Line 3008 in 45d66e9
Line 28 in 45d66e9
|
@nurmukhametov Thank you for pointing that out. I think I found a bug! void SetByte(int8& dest, int8 src)
{
#if TARGET_ELEMENT_WIDTH == 1
typedef uint8 TMask;
#elif TARGET_ELEMENT_WIDTH == 2
typedef uint16 TMask;
#elif TARGET_ELEMENT_WIDTH == 4
typedef uint32 TMask;
#elif TARGET_ELEMENT_WIDTH == 8
typedef uint64 TMask;
#else
#error "unknown mask"
#endif
varying TMask mask = __mask;
unmasked
{
dest = mask ? src : dest;
}
} I tested this for all cpu targets on godbolt and the output assembly seems right, except for
I used lanemask but the compiler is not able to optimize it very well. I may try switching to int32 later, that quadruples the memory usage but SIMD instructions are more friendly with Tldr of this threadI'm working with small types to perform some expensive computations. unfortunately, sometimes ispc gives up and uses gather/scatter operations when working with types like |
This doesn't look right because TARGET_ELEMENT_WIDTH is not ISPC_MASK_BITS. I think idiomatically you need to write this code instead:
|
I see, yeah this looks correct and much cleaner. how ever the issue with SetByte___REFvytvyt: # @SetByte___REFvytvyt
ret same issue with this code: void SetByte(int8& dest, int8 src)
{
uniform uint32 mask = lanemask();
unmasked
{
dest = select(((1 << programIndex) & mask) != 0, src, dest);
}
} |
I have created issue #2824 with that. |
@MkazemAkhgary, sorry for commenting on the initial example late, but it looks like it would be better to write the code as follows:
For any of # avx2-i32x4
SetByte___REFvytvyt: # @SetByte___REFvytvyt
vmovss dword ptr [rdi], xmm0
ret |
@nurmukhametov wow, very nice solution, thank you! I have to revisit some of my codes to see how this performs. although it's strange that the compiler doesn't do it efficiently without the foreach loop. |
@nurmukhametov looks like this is not doing a masked store :( it generates same instruction as unmasked function which is somewhat unexpected.
If I remember correctly, there is no avx2 instruction for doing a masked store for bytes or shorts. but it can be done with multiple instructions in a branchless way. |
Yeah, I missed the initial point completely. I am also only now realized that Anyway. It looks like using
As I understand, there are two problems with this code:
There are also other problems with efficiency of code generated for |
Storing values in a divergent control flow can be inefficient for byte and word data types such as
bool
,int8
,int16
,uint8
anduint16
. sometimes ispc may perform gather/scatter without emitting a performance warning. This can occur when working with mentioned types in masked regions, particularly when:Here is an example
Compiled for AVX2: (click to expand)
following optimization is possible, which can also be applied to other types (
uint8
,int16
,uint16
,bool
)for AVX2, this would compile to
In case of accessing byte/word arrays
A performance warning should be emitted when gather/scatter is present.
It would be beneficial to have a set of functions for loading from and storing to byte/word arrays that assume the array size is a multiple of 4 or 2. These functions could use more efficient instructions such as
vpmaskmov
orvpgatherdd
. If the array size is known at compile time, the compiler could automatically use these functions. However, invoking these functions with an incorrect array size may result in a memory access violation.Here is an implementation to load from
int8
array when the index is not continuous:I think
vpmaskmov
can be used if index is continuous, but I haven't been able to implement it yet.assume
keyword could be used to inform the compiler of a safe upper bound, allowing it to avoid gather/scatter usingvpmaskmov
or perform a more efficient gather as shown in theFastLoadByte
function.The text was updated successfully, but these errors were encountered: