Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rejigger bit counting intrinsics #898

Merged
merged 2 commits into from Aug 24, 2020
Merged

rejigger bit counting intrinsics #898

merged 2 commits into from Aug 24, 2020

Conversation

aqrit
Copy link
Contributor

@aqrit aqrit commented Aug 12, 2020

Fix #867
Optimize software fallback routines.
Delete some faulty (and dead?) MSVC big endian code.

Fix #867
Optimize software fallback routines.
Delete some faulty (and dead?) MSVC big endian code.
@Cyan4973
Copy link
Member

Could you publish some benchmark and/or profile results
showing that the new software fallback is more efficient ?

@aqrit
Copy link
Contributor Author

aqrit commented Aug 13, 2020

The new little endian fallback is unlikely to show up in any "fair" benchmark.
If we're against nonessential changes I can switch it back to the DeBruijn method.

Static analysis (https://godbolt.org/z/cYqxTW) shows the new proposed method to have lower latency.
It replaces the table lookup with a bitwise-and.

Clang -O3 -fPIC
    DeBruijn:
        Instructions:      8
        Total Cycles:      15
    HSum:
        Instructions:      7
        Total Cycles:      10

clang -O3 -march=skylake
    DeBruijn:
        Instructions:      5
        Total Cycles:      13
    HSum:
        Instructions:      6
        Total Cycles:      9

LLVM-MCA assigns a latency of 5 cycles to the table lookup.
Even assigning 2 cycles for a table lookup does not make the DeBruijn method's latency better
than the HSum method.


I have no clue how a typical big endian machine behaves.

The proposed 64-bit big endian method would be an "obvious" win,
if the CPU has fast 64-bit multiplications and expensive branch misses.

The proposed 32-bit big endian method has good reciprocal throughput - when unrolled on x64 - but
I am concerned about how it might perform in this instance.

@Cyan4973
Copy link
Member

Cyan4973 commented Aug 17, 2020

Analysis is fine.
Provided figures in instructions and cycle are a good enough start.

This could be completed with more dynamic measurements, with a tool like perf, though I understand that it's more difficult to do since this part of the algorithm is dwarfed by other concerns in the compression routine.

The new little endian fallback is unlikely to show up in any "fair" benchmark.

This part of the code, counting bytes to extract the match length, is among the hottest ones.
I would have expected that forcing the algorithm to use the software fallback
should make it possible to notice a difference, even a small one.

If I do remember correctly, the difference between hardware vs software code paths for byte count was measurable.

Also, it would provide a way to ensure that the code is indeed correct.

If we're against nonessential changes

There is no such thing. If the benefit seems good enough, this code can be merged.
It's just that it wouldn't be the first time that a performance tweak that "seem fine" on paper ends up being strangely detrimental in practice. Hence, nothing can nor should replace direct measurements.

Also, sometimes, some of these "improvements" introduce other concerns, such as complexity, code size or portability limitations.
The good news here is that your PR seems free of those concerns for the most part,
except maybe in one code path where it adds a 128-byte table where none was previously present.

@aqrit
Copy link
Contributor Author

aqrit commented Aug 17, 2020

make CFLAGS=-DLZ4_FORCE_SW_BITCOUNT
sudo perf record ./lz4 -b1 silesia.tar out
sudo perf report

LZ4_NbCommonBytes
original: 0.83%
patch: 0.66%

@Cyan4973
Copy link
Member

This looks good.
I believe it's a right direction,
at the very least for the little-endian software backup.

For the big-endian one,
I'm a bit interrogative about the code for 64-bit big-endian,
because it adds a substantial lookup table to the code (128 bytes).
I can imagine some push-back on this.

Hence I was wondering if an alternative using less table space would be achievable ?
It looks as if the table could be folded in some way, due to large repetitions within columns.

MSVC debug mode complains
@aqrit aqrit marked this pull request as ready for review August 17, 2020 22:17
@aqrit
Copy link
Contributor Author

aqrit commented Aug 17, 2020

Nothing good comes to mind.
There is no fast way to isolate the highest set bit or smear right.
In this case, the extracted bits get reversed (for no real reason) which turns it into a 7-bit count trailing zero problem.
Which could be solved in the usual ways (popcnt, DeBruijn, etc.) but that is a lot of extra instructions.

The table size could be halved at the cost of 3 instructions (1 on the critical path):

static const unsigned char ctz7_tab[64] = {
	7, 1, 2, 1, 3, 1, 2, 1, 4, 1, 2, 1, 3, 1, 2, 1,
	5, 1, 2, 1, 3, 1, 2, 1, 4, 1, 2, 1, 3, 1, 2, 1,
	6, 1, 2, 1, 3, 1, 2, 1, 4, 1, 2, 1, 3, 1, 2, 1,
	5, 1, 2, 1, 3, 1, 2, 1, 4, 1, 2, 1, 3, 1, 2, 1
};
const U64 mask = 0x0101010101010101ULL;
U64 t = (((val >> 8) - mask) | val) & mask;
return ctz7_tab[(t * 0x0080402010080402ULL) >> 58] & ((t >> 56) - 1);

@Cyan4973 Cyan4973 changed the base branch from dev to fasterCount August 24, 2020 22:10
@Cyan4973
Copy link
Member

Integrating into a feature branch,
in order to adjust comments.

@Cyan4973 Cyan4973 merged commit ee0a3cf into lz4:fasterCount Aug 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve compression speed on ZEN/ZEN2
2 participants