Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Compute quickly the byte lengths without look-ups #12
added a commit
Nov 7, 2017
It is definitely susceptible to various bithacks. I used the permutation table since it's available, but the 2-bit fields could also be summed in a few instructions.
mod15 compiles to a multiply and some shifts and subtracts, so it may be decomposable into fewer instructions in this context.
Generally I would spread the bitfields out and use a multiply to sum them into something like the most significant byte; that's what I do in the compressor when I have the 2-bit fields already spread out. Your method of multiplying by 0x401 and masking looks promising -- they're far enough apart to sum into a nybble at that point. Multiplying by (1<<28)|(1<<24)|(1<<16)|(1<<12) should drop them on top of each other in the high nybble, if I've got those shifts right.
@lemire @aqrit I started a branch to see if precomputing byte lengths 8-at-a-time and prefix summing would speed up the decode loop. The pointer arithmetic in each decode_avx seemed like a barrier to ILP, and this modification makes it possible to dispatch the 8 decode_avx calls independently.
However it has proved slower for the two methods I've tried (scalar shift/mask/add and a pshufb LUT, both similar to the blog article).
I suspect the pshufb's are already single-threaded (there is only one port for them, I believe), and the length processing is not the critical path.
could eventually become:
If the literal data bytes get loaded at the high-end of the xmmword...
then the lowest byte of the shuffle mask will also be the number of unused bytes in the data xmmword