-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CRC32 calculation #5
Comments
I've just added a naive SSE4.2 based CRC32 calc, |
I think something like crc32 would deserve its own benchmark file. I would generate some random data and then run crc32 on it. Displaying the result as a bandwidth makes it comparable to all other benchmarks. Anything above 500 MB/s shouldn't image gzip decompression (~250 MB/s) very much except for non-compressed gzip files (> 4 GB/s). |
This is under the assumption that CRC32 itself can be computed in parallel and then the separate results can be simply combined, else I'd like to see CRC32 bandwidths of 10 GB/s and more... Also, I'm not sure whether SSE4 would be necessary. There are other algorithms like Slice-by-N I would also try. But this would be something nice for the benchmark file, to compare different CRC32 implementations against each other. |
It seems that my naive approach to accelerate CRC32 did not work. |
How are you testing this? The CRC32 is only calculated for |
I'm using gcc for compiling The result:
With -P 0 the result is approx x2 as fast (i have 4 cores) on my machine |
These are not the speeds I'd want to see :/. Could you please try to run with Edit: Ok, if you only have 4 (virtual) cores, ergo 2 physical cores, then a speedup of 2 sounds realistic. A speedup of 2 might even be somewhat realistic for 4 cores though. A direct comparison to gzip would be interesting. On my system, gzip can achieve ~200 MB/s per core. |
Actually i've branched my sse4 branch off you develop branch... |
Yes, there is no profiling for I guess you just have a "slow" CPU similar to one of the server CPUs running at 2 GHz, which I used for my scaling benchmarks up to 128 cores. A direct comparison to If |
Here are the gunzip results:
|
In that case why pragzip is not complaining? |
See my comments in your linked patch commit. You only added the CRC32 for non-compressed blocks. There might be no non-compressed blocks in your test data. |
Furthermore, I'm sorry about that but I think I mentioned that CRC32 is not used for the parallel code and I wasn't sure whether it is used for the serial code... Turns out it isn't :( See pragzip.cpp: pragzip::GzipReader</* CRC32 */ false> gzipReader{ std::move( inputFile ) }; Before and after flipping that flag: Decompressed in total 536870912 B in 2.21557 s -> 242.317 MB/s
> m pragzip && src/tools/pragzip -v -d -o /dev/null -P 1 test-files/small/small.gz
Decompressed in total 536870912 B in 2.81561 s -> 190.677 MB/s 22% slowdown caused by CRC32. |
Ahh I see... Anyway, I've recruited chatGPT for help which produced the following using the correct polynomial. #include <cstdint>
#include <cstddef>
#include <immintrin.h>
static constexpr uint32_t CRC_POLY = 0xEDB88320;
// Precalculate the CRC lookup table
static uint32_t CRC_TABLE[256];
static void init_crc_table() {
for (uint32_t i = 0; i < 256; ++i) {
uint32_t crc = i;
for (int j = 0; j < 8; ++j) {
if (crc & 1) {
crc = CRC_POLY ^ (crc >> 1);
} else {
crc >>= 1;
}
}
CRC_TABLE[i] = crc;
}
}
// Update 8 bytes using AVX2 instructions
static inline __m256i crc32_avx2(__m256i crc, const uint8_t* data) {
__m256i data_reg = _mm256_lddqu_si256((__m256i const*)data);
// XOR the data with the current CRC value
__m256i xor_data = _mm256_xor_si256(data_reg, crc);
// Use a mask to select the correct element of the CRC table for each element of the vector
__m256i mask = _mm256_set_epi32(0x03020100, 0x03020100, 0x03020100, 0x03020100,
0x03020100, 0x03020100, 0x03020100, 0x03020100);
__m256i lookup = _mm256_i32gather_epi32((const int*)CRC_TABLE, xor_data & _mm256_set1_epi32(0xFF), 4);
lookup = _mm256_permutevar8x32_epi32(lookup, mask);
// Shift the CRC right by 8 bits
crc = _mm256_srli_epi32(crc, 8);
// XOR the lookup table with the shifted CRC value
crc = _mm256_xor_si256(crc, lookup);
return crc;
}
// Update the CRC value for the remaining data
uint32_t crc32_remainder_avx2(uint32_t crc, const uint8_t* data, size_t size) {
__m256i crc_reg = _mm256_set1_epi32(crc ^ 0xFFFFFFFF);
// Update 8 bytes at a time
while (size >= 8) {
crc_reg = crc32_avx2(crc_reg, data);
data += 8;
size -= 8;
}
// Extract the final CRC value from the vector
crc_reg = _mm256_xor_si256(crc_reg, _mm256_set1_epi32(0xFFFFFFFF));
uint32_t result = _mm256_extract_epi32(crc_reg, 0);
return result;
} |
As far as I understand, this SIMD version simply uses the AVX table lookup to speed up computation. But it would have to be benchmarked whether it is actually faster than using a larger table and no SIMD. Btw, you can write
I'm not sure I follow. Alternatively, you could check in which loops |
I was thinking about |
Ah ok. Well, the problem is the condition when to trigger the CRC32 update. Only on wrap-around is insufficient as you already pointed out. It seems to me that this can only be done from the calling sites of |
This can be used as a testing ground because the hashes are printed out, and for comparing benchmarks. |
Slice-by-N works wonders! It's mesmerizing to see that the same speed with slice-by-N lookup tables can reach similar speeds to using SIMD intrinsics. This isn't even the first or second LUT used in pragzip. I make heavy use of them everywhere. Since this project, I've started to see lookup tables/cpu caches as something akin to FPGAs. You can define any arbitrary byte-to-byte or even word-to-word mapping/operation with a sufficiently fast and large L1 cache. > m benchmarkCRC32 && src/benchmarks/benchmarkCRC32
Initializing random data for benchmark... Done (1.43165 s)
[Compute CRC32 (LUT)] ( min: 516.417, 517.3 +- 1.0, max: 519.571 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 4)] ( min: 1402.55, 1414 +- 5, max: 1419.92 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 8)] ( min: 2553.45, 2588 +- 19, max: 2618.4 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 12)] ( min: 3602.26, 3760 +- 60, max: 3808.64 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 16)] ( min: 3869.64, 3970 +- 50, max: 4038.77 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 20)] ( min: 2586.97, 2627 +- 23, max: 2644.93 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 24)] ( min: 2956.9 , 2988 +- 12, max: 2997.68 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 32)] ( min: 2736.25, 2806 +- 29, max: 2828.43 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 64)] ( min: 2104.77, 2139 +- 13, max: 2150.09 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (_mm_crc32_u32)] ( min: 5212.63, 5280 +- 50, max: 5351.18 ) MB/s -> Result: 0xAFDBD4A7
[Compute CRC32 (_mm_crc32_u64)] ( min: 9012.49, 9700 +- 400, max: 10155.2 ) MB/s -> Result: 0xAFDBD4A7 I wonder if some of those 32-bit operations could be implemented with 64-bit or even SIMD ... The table lookup might even work SIMD. With explicit loop unrolling (and without
|
Yes impressive |
Because they compute the CRC-32C (Castagnoli), something completely different and therefore unusable for pragzip. It uses 0x82F63B78 as the generator polynomial while CRC-32 uses 0xEDB88320. |
This kind of CRC can be sped up with Then again, this implementation in Rust "only" shows a bandwidth of 7.3 GB/s. This doesn't sound like that much of a deal compared to the 4.6 GB/s, which I have reached with simple lookup tables. Although, it might improve cache behavior by getting rid of those lookup tables. The lookup tables are of size |
Added with 08b453f. It adds ~5-6% overhead. Further to do (might create another issue to track those):
|
Similar to pugz, CRC32 is currently not yet implemented because it introduces performance and complexity overhead and because in my opinion the fact that the end of the file can be reached is already quite a strong sanity check.
In order to parallelize CRC32 combination, using the linearity of CRC32 might work. The index could also add checksums for each deflate block or chunk to add more fine-granular checks when the index exists.
The text was updated successfully, but these errors were encountered: