New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute quickly the byte lengths without look-ups #12

Open
lemire opened this Issue Nov 7, 2017 · 7 comments

Comments

Projects
None yet
3 participants
@lemire
Owner

lemire commented Nov 7, 2017

Some look-ups could be efficiently replaced by fast instructions such as a pdep followed by a multiplication and a shift. It is unlikely to be generally faster than a look-up, but it might be worth exploring.

@KWillets

This comment has been minimized.

Show comment
Hide comment
@KWillets

KWillets Nov 7, 2017

Collaborator

The length is 1 + the rightmost index in the decoding shuffle, which is within the last four entries.

Collaborator

KWillets commented Nov 7, 2017

The length is 1 + the rightmost index in the decoding shuffle, which is within the last four entries.

lemire added a commit that referenced this issue Nov 7, 2017

@lemire

This comment has been minimized.

Show comment
Hide comment
@lemire

lemire Nov 7, 2017

Owner

As per this commit dac1b23 , @KWillets's approach can be enabled by defining a macro. I am concerned about making this a default without enough hard numbers.

Note that I have ported the approach to neon (it was trivial).

Owner

lemire commented Nov 7, 2017

As per this commit dac1b23 , @KWillets's approach can be enabled by defining a macro. I am concerned about making this a default without enough hard numbers.

Note that I have ported the approach to neon (it was trivial).

@aqrit

This comment has been minimized.

Show comment
Hide comment
@aqrit

aqrit Nov 23, 2017

Collaborator
// is mod15 faster than pdep?
len = ((key * 0x0401 & 0x00033033) % 0x0F) + 4; 

// simd
v = ((v >> 2) & 0x33333333) + (v & 0x33333333);
v = ((v >> 4) & 0x0F0F0F0F) + (v & 0x0F0F0F0F);
len4 = v + 0x04040404;
// todo: now extract each byte...
Collaborator

aqrit commented Nov 23, 2017

// is mod15 faster than pdep?
len = ((key * 0x0401 & 0x00033033) % 0x0F) + 4; 

// simd
v = ((v >> 2) & 0x33333333) + (v & 0x33333333);
v = ((v >> 4) & 0x0F0F0F0F) + (v & 0x0F0F0F0F);
len4 = v + 0x04040404;
// todo: now extract each byte...
@KWillets

This comment has been minimized.

Show comment
Hide comment
@KWillets

KWillets Nov 23, 2017

Collaborator

It is definitely susceptible to various bithacks. I used the permutation table since it's available, but the 2-bit fields could also be summed in a few instructions.

mod15 compiles to a multiply and some shifts and subtracts, so it may be decomposable into fewer instructions in this context.

Generally I would spread the bitfields out and use a multiply to sum them into something like the most significant byte; that's what I do in the compressor when I have the 2-bit fields already spread out. Your method of multiplying by 0x401 and masking looks promising -- they're far enough apart to sum into a nybble at that point. Multiplying by (1<<28)|(1<<24)|(1<<16)|(1<<12) should drop them on top of each other in the high nybble, if I've got those shifts right.

Collaborator

KWillets commented Nov 23, 2017

It is definitely susceptible to various bithacks. I used the permutation table since it's available, but the 2-bit fields could also be summed in a few instructions.

mod15 compiles to a multiply and some shifts and subtracts, so it may be decomposable into fewer instructions in this context.

Generally I would spread the bitfields out and use a multiply to sum them into something like the most significant byte; that's what I do in the compressor when I have the 2-bit fields already spread out. Your method of multiplying by 0x401 and masking looks promising -- they're far enough apart to sum into a nybble at that point. Multiplying by (1<<28)|(1<<24)|(1<<16)|(1<<12) should drop them on top of each other in the high nybble, if I've got those shifts right.

@lemire

This comment has been minimized.

Show comment
Hide comment
@lemire
Owner

lemire commented Nov 28, 2017

@KWillets

This comment has been minimized.

Show comment
Hide comment
@KWillets

KWillets May 8, 2018

Collaborator

@lemire @aqrit I started a branch to see if precomputing byte lengths 8-at-a-time and prefix summing would speed up the decode loop. The pointer arithmetic in each decode_avx seemed like a barrier to ILP, and this modification makes it possible to dispatch the 8 decode_avx calls independently.

However it has proved slower for the two methods I've tried (scalar shift/mask/add and a pshufb LUT, both similar to the blog article).

I suspect the pshufb's are already single-threaded (there is only one port for them, I believe), and the length processing is not the critical path.

Collaborator

KWillets commented May 8, 2018

@lemire @aqrit I started a branch to see if precomputing byte lengths 8-at-a-time and prefix summing would speed up the decode loop. The pointer arithmetic in each decode_avx seemed like a barrier to ILP, and this modification makes it possible to dispatch the 8 decode_avx calls independently.

However it has proved slower for the two methods I've tried (scalar shift/mask/add and a pshufb LUT, both similar to the blog article).

I suspect the pshufb's are already single-threaded (there is only one port for them, I believe), and the length processing is not the critical path.

@aqrit

This comment has been minimized.

Show comment
Hide comment
@aqrit

aqrit Sep 29, 2018

Collaborator

@KWillets suggestion:
len = pshuf[12 + (key >> 6)] + 1;

could eventually become:
len = 16 - pshuf[0];

If the literal data bytes get loaded at the high-end of the xmmword...
(rather than the low-end as is currently done)

then the lowest byte of the shuffle mask will also be the number of unused bytes in the data xmmword
(which can be converted to length at little to no additional cost)

gist here

Collaborator

aqrit commented Sep 29, 2018

@KWillets suggestion:
len = pshuf[12 + (key >> 6)] + 1;

could eventually become:
len = 16 - pshuf[0];

If the literal data bytes get loaded at the high-end of the xmmword...
(rather than the low-end as is currently done)

then the lowest byte of the shuffle mask will also be the number of unused bytes in the data xmmword
(which can be converted to length at little to no additional cost)

gist here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment