rejigger bit counting intrinsics #898

aqrit · 2020-08-12T01:18:37Z

Fix #867
Optimize software fallback routines.
Delete some faulty (and dead?) MSVC big endian code.

Fix #867 Optimize software fallback routines. Delete some faulty (and dead?) MSVC big endian code.

Cyan4973 · 2020-08-12T01:27:18Z

Could you publish some benchmark and/or profile results
showing that the new software fallback is more efficient ?

aqrit · 2020-08-13T23:25:29Z

The new little endian fallback is unlikely to show up in any "fair" benchmark.
If we're against nonessential changes I can switch it back to the DeBruijn method.

Static analysis (https://godbolt.org/z/cYqxTW) shows the new proposed method to have lower latency.
It replaces the table lookup with a bitwise-and.

Clang -O3 -fPIC
    DeBruijn:
        Instructions:      8
        Total Cycles:      15
    HSum:
        Instructions:      7
        Total Cycles:      10

clang -O3 -march=skylake
    DeBruijn:
        Instructions:      5
        Total Cycles:      13
    HSum:
        Instructions:      6
        Total Cycles:      9

LLVM-MCA assigns a latency of 5 cycles to the table lookup.
Even assigning 2 cycles for a table lookup does not make the DeBruijn method's latency better
than the HSum method.

I have no clue how a typical big endian machine behaves.

The proposed 64-bit big endian method would be an "obvious" win,
if the CPU has fast 64-bit multiplications and expensive branch misses.

The proposed 32-bit big endian method has good reciprocal throughput - when unrolled on x64 - but
I am concerned about how it might perform in this instance.

Cyan4973 · 2020-08-17T19:42:28Z

Analysis is fine.
Provided figures in instructions and cycle are a good enough start.

This could be completed with more dynamic measurements, with a tool like perf, though I understand that it's more difficult to do since this part of the algorithm is dwarfed by other concerns in the compression routine.

The new little endian fallback is unlikely to show up in any "fair" benchmark.

This part of the code, counting bytes to extract the match length, is among the hottest ones.
I would have expected that forcing the algorithm to use the software fallback
should make it possible to notice a difference, even a small one.

If I do remember correctly, the difference between hardware vs software code paths for byte count was measurable.

Also, it would provide a way to ensure that the code is indeed correct.

If we're against nonessential changes

There is no such thing. If the benefit seems good enough, this code can be merged.
It's just that it wouldn't be the first time that a performance tweak that "seem fine" on paper ends up being strangely detrimental in practice. Hence, nothing can nor should replace direct measurements.

Also, sometimes, some of these "improvements" introduce other concerns, such as complexity, code size or portability limitations.
The good news here is that your PR seems free of those concerns for the most part,
except maybe in one code path where it adds a 128-byte table where none was previously present.

aqrit · 2020-08-17T20:40:34Z

make CFLAGS=-DLZ4_FORCE_SW_BITCOUNT
sudo perf record ./lz4 -b1 silesia.tar out
sudo perf report

LZ4_NbCommonBytes
original: 0.83%
patch: 0.66%

Cyan4973 · 2020-08-17T21:37:33Z

This looks good.
I believe it's a right direction,
at the very least for the little-endian software backup.

For the big-endian one,
I'm a bit interrogative about the code for 64-bit big-endian,
because it adds a substantial lookup table to the code (128 bytes).
I can imagine some push-back on this.

Hence I was wondering if an alternative using less table space would be achievable ?
It looks as if the table could be folded in some way, due to large repetitions within columns.

MSVC debug mode complains

aqrit · 2020-08-17T22:32:41Z

Nothing good comes to mind.
There is no fast way to isolate the highest set bit or smear right.
In this case, the extracted bits get reversed (for no real reason) which turns it into a 7-bit count trailing zero problem.
Which could be solved in the usual ways (popcnt, DeBruijn, etc.) but that is a lot of extra instructions.

The table size could be halved at the cost of 3 instructions (1 on the critical path):

static const unsigned char ctz7_tab[64] = {
	7, 1, 2, 1, 3, 1, 2, 1, 4, 1, 2, 1, 3, 1, 2, 1,
	5, 1, 2, 1, 3, 1, 2, 1, 4, 1, 2, 1, 3, 1, 2, 1,
	6, 1, 2, 1, 3, 1, 2, 1, 4, 1, 2, 1, 3, 1, 2, 1,
	5, 1, 2, 1, 3, 1, 2, 1, 4, 1, 2, 1, 3, 1, 2, 1
};
const U64 mask = 0x0101010101010101ULL;
U64 t = (((val >> 8) - mask) | val) & mask;
return ctz7_tab[(t * 0x0080402010080402ULL) >> 58] & ((t >> 56) - 1);

Cyan4973 · 2020-08-24T22:13:14Z

Integrating into a feature branch,
in order to adjust comments.

rejigger bit counting intrinsics

e728974

Fix #867 Optimize software fallback routines. Delete some faulty (and dead?) MSVC big endian code.

silence warning

e45defa

MSVC debug mode complains

aqrit marked this pull request as ready for review August 17, 2020 22:17

Cyan4973 changed the base branch from dev to fasterCount August 24, 2020 22:10

Cyan4973 merged commit ee0a3cf into lz4:fasterCount Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rejigger bit counting intrinsics #898

rejigger bit counting intrinsics #898

aqrit commented Aug 12, 2020

Cyan4973 commented Aug 12, 2020

aqrit commented Aug 13, 2020 •

edited

Cyan4973 commented Aug 17, 2020 •

edited

aqrit commented Aug 17, 2020

Cyan4973 commented Aug 17, 2020

aqrit commented Aug 17, 2020 •

edited

Cyan4973 commented Aug 24, 2020

rejigger bit counting intrinsics #898

rejigger bit counting intrinsics #898

Conversation

aqrit commented Aug 12, 2020

Cyan4973 commented Aug 12, 2020

aqrit commented Aug 13, 2020 • edited

Cyan4973 commented Aug 17, 2020 • edited

aqrit commented Aug 17, 2020

Cyan4973 commented Aug 17, 2020

aqrit commented Aug 17, 2020 • edited

Cyan4973 commented Aug 24, 2020

aqrit commented Aug 13, 2020 •

edited

Cyan4973 commented Aug 17, 2020 •

edited

aqrit commented Aug 17, 2020 •

edited