Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize decompress generic #645

Merged
merged 10 commits into from Feb 12, 2019

Conversation

Projects
None yet
3 participants
@djwatson
Copy link
Contributor

commented Jan 29, 2019

The current decompress_generic code suffers from many blocked loads, originally mentioned in issue #411. This patchset attemps to fix that issue. Also similar issues mentioned in pull #470 , issue #126, and probably others.

The heart of the issue is that LZ4_wildCopy does a copy by 8 bytes at a time, leading to lots of work happening side-by-side. The farther away we can move these copies by extending the copy width, the faster things seem to go. Increasing the copy width to as wide as 32 bytes results in overall decompression speed improvement of around ~10% a xeon, ~20% on a modern AMD.

The primary code challenge then is that we have only 8 bytes of WILDCOPYLENGTH, when we'd really like far larger. So we end up duplicating most of the decode loop, to add a fastpath for a larger copy length, and then the last handful of bytes can use the original loop.

I've split up the code in what I hope are readable small diffs. Testing has mostly consisted of 'make test', and an attempt at fuzzing with afl. Has not been tested on other arches, but it would be fairly easy to #ifdef the first loop away and be the same as previous if it shows a regression.

Benchmarks on Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, from lzbench. The final percent is how much faster this patchset is vs. the current dev branch

./lzbench -elz4 silesia/*

lz4 1.8.3                 297 MB/s  2598 MB/s     6428742  63.07 silesia/dickens
lz4 1.8.3                 328 MB/s  2452 MB/s     6428742  63.07 silesia/dickens    -5% 

lz4 1.8.3                 472 MB/s  2374 MB/s    26435667  51.61 silesia/mozilla
lz4 1.8.3                 493 MB/s  2658 MB/s    26435667  51.61 silesia/mozilla        12%

lz4 1.8.3                 454 MB/s  2778 MB/s     5440937  54.57 silesia/mr
lz4 1.8.3                 493 MB/s  3037 MB/s     5440937  54.57 silesia/mr             9%

lz4 1.8.3                 796 MB/s  3240 MB/s     5533040  16.49 silesia/nci
lz4 1.8.3                 877 MB/s  4061 MB/s     5533040  16.49 silesia/nci            25%

lz4 1.8.3                 385 MB/s  2383 MB/s     4338918  70.53 silesia/ooffice
lz4 1.8.3                 426 MB/s  2709 MB/s     4338918  70.53 silesia/ooffice        14%

lz4 1.8.3                 413 MB/s  2463 MB/s     5256666  52.12 silesia/osdb
lz4 1.8.3                 439 MB/s  2950 MB/s     5256666  52.12 silesia/osdb           20%

lz4 1.8.3                 303 MB/s  2181 MB/s     3181387  48.00 silesia/reymont
lz4 1.8.3                 331 MB/s  2182 MB/s     3181387  48.00 silesia/reymont        0%

lz4 1.8.3                 511 MB/s  2806 MB/s     7716839  35.72 silesia/samba
lz4 1.8.3                 552 MB/s  3330 MB/s     7716839  35.72 silesia/samba          19%

lz4 1.8.3                 381 MB/s  2899 MB/s     6790273  93.63 silesia/sao
lz4 1.8.3                 412 MB/s  3586 MB/s     6790273  93.63 silesia/sao            24%

lz4 1.8.3                 448 MB/s  2475 MB/s   100881331  47.60 silesia/silesia.tar
lz4 1.8.3                 474 MB/s  2784 MB/s   100881331  47.60 silesia/silesia.tar    12%

lz4 1.8.3                 344 MB/s  2395 MB/s    20139988  48.58 silesia/webster
lz4 1.8.3                 372 MB/s  2536 MB/s    20139988  48.58 silesia/webster        6%

lz4 1.8.3                1104 MB/s  7769 MB/s     8390195  99.01 silesia/x-ray
lz4 1.8.3                1032 MB/s  9056 MB/s     8390195  99.01 silesia/x-ray          17%

lz4 1.8.3                 715 MB/s  2932 MB/s     1227495  22.96 silesia/xml
lz4 1.8.3                 729 MB/s  3514 MB/s     1227495  22.96 silesia/xml            20%

An all-zeroes test is particularly ugly for the current code:


dd if=/dev/zero of=zerotest bs=1M count=2
(Currently any match extending in to the last 64 bytes uses the slowpath, TODO)
dd if=/dev/random bs=65 count=1 >> zerotest

lz4 1.8.3               15996 MB/s  4920 MB/s        8296   0.40 silesia/zerotest
lz4 1.8.3               16041 MB/s 10151 MB/s        8296   0.40 silesia/zerotest       106%

I also tested on a AMD 2400G:

dickens          7%
mozilla          12%
mr               18%
nci              29%
ooffice          23%
osdb             28%
reymont          7%
samba            25%
sao              39%
silesia.tar      20%
webster          17%
xml              32%
x-ray            6%

The same issue likely exists in zstd (where it is likely easier to fix, since parsing isn't mixed up with sequence execution).

@Cyan4973

This comment has been minimized.

Copy link
Member

commented Jan 29, 2019

Looks great @djwatson !

I noticed some non-trivial differences in compression speed, according to your measurements.
Are they related to your patch too, or is it measurement noise ?

@Cyan4973

This comment has been minimized.

Copy link
Member

commented Jan 29, 2019

I confirm big decompression speed gains in benchmark measurements on silesia.
On my laptop, with clang, decompression speed went up, from 2800 MB/s to 3000 MB/s.
On my desktop, with gcc, gains were more dramatic, from 3150 MB/s to 3700 MB/s.
Desktop + gcc8 + calgary : from 3200 MB/s to 3550 MB/s.

However, in all tests, compression speed was completely stable.
I don't understand yet why your measurements seem to show some differences on compression side.

@djwatson

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2019

Any compression speed changes are noise, I didn't even notice that. It is a bit strange they are all biased one way, will have to double check.

I will work through the contbuild issues when I get a chance.

@djwatson djwatson force-pushed the djwatson:optimize_decompress_generic branch from c1c2f5e to ab51e7b Jan 29, 2019

lib/lz4.c Outdated
@@ -1434,6 +1504,26 @@ typedef enum { decode_full_block = 0, partial_decode = 1 } earlyEnd_directive;
#undef MIN
#define MIN(a,b) ( (a) < (b) ? (a) : (b) )

LZ4_FORCE_INLINE unsigned read_variable_length(const BYTE**ip, const BYTE* lencheck, int loop_check, int initial_check, int* error) {

This comment has been minimized.

Copy link
@Cyan4973

Cyan4973 Jan 29, 2019

Member

Could you document this function ?
Especially its arguments and their roles ?

This comment has been minimized.

Copy link
@djwatson

djwatson Jan 30, 2019

Author Contributor

Documented, and changed error code to a typedef enum.

@Cyan4973

This comment has been minimized.

Copy link
Member

commented Jan 29, 2019

Thanks for this patch @djwatson , it's in pretty good shape !

2 remaining topics to get to the bottom of it :

  • The main performance trick presented is to enlarge stripes' length, thus reducing the nb of loops, hence of branches. But I see another modification present in the code, with dedicated code for overlapping matches. Both changes seem to contribute to the same objective (decompression speed), yet be distinct.
    • Is it possible to have a breakdown of the benefit for each part ? I would expect larger stripes to dominate, but I could be wrong on that. Also, the new overlapping code is more complex, and I want to be convinced it "pays off" for its complexity. We had bad surprises on this topic in the past.
  • Your performance tests show that the new decoder decompresses noticeably faster on recent Intel and AMD cpus. But what about other architectures ? Case in point, ARM devices, and even aarch64 server if possible. We would like to be sure we don't degrade (too much) performance of other architecture by overfitting too closely current generation x64 cpus.
    • I can certainly have a direct look at this topic too, using some ARM-based smartphones around
  • Nerd's question (optional) : are you using perf or something similar to guide your investigation ? If yes, do you have some numbers to share ?

@djwatson djwatson force-pushed the djwatson:optimize_decompress_generic branch from ab51e7b to 67e3099 Jan 30, 2019

@djwatson

This comment has been minimized.

Copy link
Contributor Author

commented Jan 30, 2019

  • Is it possible to have a breakdown of the benefit for each part ? I would expect larger stripes to dominate, but I could be wrong on that. Also, the new overlapping code is more complex, and I want to be convinced it "pays off" for its complexity. We had bad surprises on this topic in the past.

origin/dev:

lz4 1.8.3                 324 MB/s  2503 MB/s     6428742  63.07 silesia/dickens
lz4 1.8.3                 526 MB/s  2330 MB/s    26435667  51.61 silesia/mozilla
lz4 1.8.3                 500 MB/s  2772 MB/s     5440937  54.57 silesia/mr
lz4 1.8.3                 870 MB/s  3016 MB/s     5533040  16.49 silesia/nci
lz4 1.8.3                 423 MB/s  2241 MB/s     4338918  70.53 silesia/ooffice
lz4 1.8.3                 450 MB/s  2215 MB/s     5256666  52.12 silesia/osdb
lz4 1.8.3                 329 MB/s  2146 MB/s     3181387  48.00 silesia/reymont
lz4 1.8.3                 515 MB/s  2495 MB/s     7716839  35.72 silesia/samba
lz4 1.8.3                 412 MB/s  2592 MB/s     6790273  93.63 silesia/sao
lz4 1.8.3                 363 MB/s  2142 MB/s    20139988  48.58 silesia/webster
lz4 1.8.3                1194 MB/s  8084 MB/s     8390195  99.01 silesia/x-ray
lz4 1.8.3                 712 MB/s  2588 MB/s     1227495  22.96 silesia/xml

Without the last diff optimizing small offsets, so only increasing the copy size to 32bytes:

lz4 1.8.3                 321 MB/s  2550 MB/s     6428742  63.07 silesia/dickens
lz4 1.8.3                 526 MB/s  2641 MB/s    26435667  51.61 silesia/mozilla
lz4 1.8.3                 495 MB/s  3018 MB/s     5440937  54.57 silesia/mr
lz4 1.8.3                 883 MB/s  3810 MB/s     5533040  16.49 silesia/nci
lz4 1.8.3                 435 MB/s  2755 MB/s     4338918  70.53 silesia/ooffice
lz4 1.8.3                 461 MB/s  2862 MB/s     5256666  52.12 silesia/osdb
lz4 1.8.3                 315 MB/s  2215 MB/s     3181387  48.00 silesia/reymont
lz4 1.8.3                 562 MB/s  3259 MB/s     7716839  35.72 silesia/samba
lz4 1.8.3                 445 MB/s  3752 MB/s     6790273  93.63 silesia/sao
lz4 1.8.3                 379 MB/s  2520 MB/s    20139988  48.58 silesia/webster
lz4 1.8.3                1201 MB/s  8405 MB/s     8390195  99.01 silesia/x-ray
lz4 1.8.3                 705 MB/s  3402 MB/s     1227495  22.96 silesia/xml

The whole series:

lz4 1.8.3                 318 MB/s  2918 MB/s     6428742  63.07 silesia/dickens
lz4 1.8.3                 498 MB/s  2717 MB/s    26435667  51.61 silesia/mozilla
lz4 1.8.3                 491 MB/s  3372 MB/s     5440937  54.57 silesia/mr
lz4 1.8.3                 837 MB/s  3937 MB/s     5533040  16.49 silesia/nci
lz4 1.8.3                 425 MB/s  2821 MB/s     4338918  70.53 silesia/ooffice
lz4 1.8.3                 453 MB/s  2937 MB/s     5256666  52.12 silesia/osdb
lz4 1.8.3                 314 MB/s  2488 MB/s     3181387  48.00 silesia/reymont
lz4 1.8.3                 551 MB/s  3517 MB/s     7716839  35.72 silesia/samba
lz4 1.8.3                 443 MB/s  4002 MB/s     6790273  93.63 silesia/sao
lz4 1.8.3                 376 MB/s  2746 MB/s    20139988  48.58 silesia/webster
lz4 1.8.3                1222 MB/s  8647 MB/s     8390195  99.01 silesia/x-ray
lz4 1.8.3                 717 MB/s  3610 MB/s     1227495  22.96 silesia/xml

So both stripe copy size and small offset optimization seem to be helping. Stripe copy size helps more when there are large matches (xml), or large literal runs (sao).

  • Your performance tests show that the new decoder decompresses noticeably faster on recent Intel and AMD cpus. But what about other architectures ? Case in point, ARM devices, and even aarch64 server if possible. We would like to be sure we don't degrade (too much) performance of other architecture by overfitting too closely current generation x64 cpus.

The only arm I own is a raspberry pi2, armv7, which looks like a small win:

with this series, in MB/s

mozilla 198
xml 168
sao 132
samba 152
x-ray 250
dickens 86

origin/dev

mozille 196
xml 161
sao 140
samba 147
x-ray 243
dickens 86

I borrowed a switch that has an ARMv8 Cortex-A72. It also looks like a small win (not nearly as drastic as x64 though):

With this series:

lz4 1.8.3                 198 MB/s  1669 MB/s     6428742  63.07 silesia/silesia/dickens
lz4 1.8.3                 276 MB/s  1976 MB/s    26435667  51.61 silesia/silesia/mozilla
lz4 1.8.3                 286 MB/s  1720 MB/s     5440937  54.57 silesia/silesia/mr
lz4 1.8.3                 478 MB/s  1951 MB/s     5533040  16.49 silesia/silesia/nci
lz4 1.8.3                 253 MB/s  2047 MB/s     4338918  70.53 silesia/silesia/ooffice
lz4 1.8.3                 236 MB/s  1712 MB/s     5256666  52.12 silesia/silesia/osdb
lz4 1.8.3                 256 MB/s  1711 MB/s     3181387  48.00 silesia/silesia/reymont
lz4 1.8.3                 321 MB/s  1916 MB/s     7716839  35.72 silesia/silesia/samba
lz4 1.8.3                 222 MB/s  1906 MB/s     6790273  93.63 silesia/silesia/sao
lz4 1.8.3                 214 MB/s  1644 MB/s    20139988  48.58 silesia/silesia/webster
lz4 1.8.3                 396 MB/s  3489 MB/s     8390195  99.01 silesia/silesia/x-ray
lz4 1.8.3                 525 MB/s  2280 MB/s     1227495  22.96 silesia/silesia/xml

origin/dev

lz4 1.8.3                 187 MB/s  1609 MB/s     6428742  63.07 silesia/silesia/dickens
lz4 1.8.3                 263 MB/s  1883 MB/s    26435667  51.61 silesia/silesia/mozilla
lz4 1.8.3                 269 MB/s  1641 MB/s     5440937  54.57 silesia/silesia/mr
lz4 1.8.3                 459 MB/s  1951 MB/s     5533040  16.49 silesia/silesia/nci
lz4 1.8.3                 238 MB/s  1856 MB/s     4338918  70.53 silesia/silesia/ooffice
lz4 1.8.3                 228 MB/s  1626 MB/s     5256666  52.12 silesia/silesia/osdb
lz4 1.8.3                 245 MB/s  1618 MB/s     3181387  48.00 silesia/silesia/reymont
lz4 1.8.3                 308 MB/s  1811 MB/s     7716839  35.72 silesia/silesia/samba
lz4 1.8.3                 197 MB/s  1793 MB/s     6790273  93.63 silesia/silesia/sao
lz4 1.8.3                 206 MB/s  1574 MB/s    20139988  48.58 silesia/silesia/webster
lz4 1.8.3                 380 MB/s  3273 MB/s     8390195  99.01 silesia/silesia/x-ray
lz4 1.8.3                 505 MB/s  2063 MB/s     1227495  22.96 silesia/silesia/xml

Unfortunately I don't have any phones I can test on at the moment.

  • Nerd's question (optional) : are you using perf or something similar to guide your investigation ? If yes, do you have some numbers to share ?

As mentioned in #411, ld_blocks.store_forward shows the general location of the issue. cycles:pp is helpful when optimizing. But even just counting how many times we need to read extra bytes for match lengths and literal lengths was useful. I also dumped histograms of the offset sizes for the small offset optimization - offset 1 and offset 2 are by far the most common for offsets < 8, and the most important to optimize.

Here's a general perf stat dump of silesia.tar:

./lz4 --no-frame-crc silesia.tar

Original 

perf stat ./lz4 -f -d silesia.tar.lz4 /dev/null
Using perf wrapper that supports hot-text. Try perf.real if you encounter any issues.
silesia.tar.lz4      : decoded 211957760 bytes                                 

 Performance counter stats for './lz4 -f -d silesia.tar.lz4 /dev/null':

        154.027595      task-clock (msec)         #    0.998 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
              2145      page-faults               #    0.014 M/sec                  
         442290349      cycles                    #    2.872 GHz                      (74.68%)
         723758179      instructions              #    1.64  insn per cycle           (75.06%)
         116629064      branches                  #  757.196 M/sec                    (75.33%)
           3926800      branch-misses             #    3.37% of all branches          (74.93%)

       0.154356436 seconds time elapsed

This series

perf stat ./lz4 -f -d silesia.tar.lz4 /dev/null
Using perf wrapper that supports hot-text. Try perf.real if you encounter any issues.
silesia.tar.lz4      : decoded 211957760 bytes                                 

 Performance counter stats for './lz4 -f -d silesia.tar.lz4 /dev/null':

        138.692866      task-clock (msec)         #    0.997 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
              2144      page-faults               #    0.015 M/sec                  
         398272806      cycles                    #    2.872 GHz                      (74.77%)
         676705413      instructions              #    1.70  insn per cycle           (74.77%)
         125361029      branches                  #  903.875 M/sec                    (75.03%)
           3056757      branch-misses             #    2.44% of all branches          (75.44%)

       0.139044307 seconds time elapsed

@djwatson djwatson force-pushed the djwatson:optimize_decompress_generic branch from 67e3099 to ed673eb Jan 30, 2019

lib/lz4.c Outdated
const BYTE* s = (const BYTE*)srcPtr;
BYTE* const e = (BYTE*)dstEnd;

memcpy(d,s,16); d+=16; s+=16;

This comment has been minimized.

Copy link
@Cyan4973

Cyan4973 Jan 30, 2019

Member

question :
this function selects to copy 16 bytes first,
and then, provided that length is large enough, will proceed with further stripes of 32 bytes.

Any reason it starts with 16 bytes, instead of 32 bytes ?
I guess you made tests to reach this conclusion, and maybe could share some results ?

This comment has been minimized.

Copy link
@djwatson

djwatson Jan 30, 2019

Author Contributor

Starting with 16, then doing a copy loop of 32:

lz4 1.8.3                 319 MB/s  2914 MB/s     6428742  63.07 silesia/dickens
lz4 1.8.3                 493 MB/s  2614 MB/s    26435667  51.61 silesia/mozilla
lz4 1.8.3                 494 MB/s  3376 MB/s     5440937  54.57 silesia/mr
lz4 1.8.3                 845 MB/s  3884 MB/s     5533040  16.49 silesia/nci
lz4 1.8.3                 412 MB/s  2651 MB/s     4338918  70.53 silesia/ooffice
lz4 1.8.3                 456 MB/s  2977 MB/s     5256666  52.12 silesia/osdb
lz4 1.8.3                 329 MB/s  2483 MB/s     3181387  48.00 silesia/reymont
lz4 1.8.3                 557 MB/s  3516 MB/s     7716839  35.72 silesia/samba
lz4 1.8.3                 441 MB/s  3985 MB/s     6790273  93.63 silesia/sao
lz4 1.8.3                 361 MB/s  2714 MB/s    20139988  48.58 silesia/webster
lz4 1.8.3                1218 MB/s  8687 MB/s     8390195  99.01 silesia/x-ray
lz4 1.8.3                 718 MB/s  3611 MB/s     1227495  22.96 silesia/xml

Going straight to a 32byte copy loop:

lz4 1.8.3                 312 MB/s  3091 MB/s     6428742  63.07 silesia/dickens
lz4 1.8.3                 499 MB/s  2744 MB/s    26435667  51.61 silesia/mozilla
lz4 1.8.3                 496 MB/s  3486 MB/s     5440937  54.57 silesia/mr
lz4 1.8.3                 877 MB/s  4328 MB/s     5533040  16.49 silesia/nci
lz4 1.8.3                 424 MB/s  2839 MB/s     4338918  70.53 silesia/ooffice
lz4 1.8.3                 459 MB/s  3202 MB/s     5256666  52.12 silesia/osdb
lz4 1.8.3                 326 MB/s  2619 MB/s     3181387  48.00 silesia/reymont
lz4 1.8.3                 551 MB/s  3555 MB/s     7716839  35.72 silesia/samba
lz4 1.8.3                 431 MB/s  4116 MB/s     6790273  93.63 silesia/sao
lz4 1.8.3                 355 MB/s  2872 MB/s    20139988  48.58 silesia/webster
lz4 1.8.3                1206 MB/s  9369 MB/s     8390195  99.01 silesia/x-ray
lz4 1.8.3                 682 MB/s  3623 MB/s     1227495  22.96 silesia/xml

So yea we should change it to 32-byte, which makes sense: We only use this wildcopy when we are already copying >= 15 bytes.

@djwatson djwatson force-pushed the djwatson:optimize_decompress_generic branch from ed673eb to 6d39c69 Jan 30, 2019

@Cyan4973

This comment has been minimized.

Copy link
Member

commented Jan 30, 2019

Thanks for all these interesting data points @djwatson .
Indeed, the improvements related to overlapping copies are much larger than expectation. It's great !

One last question :
the new decoder seems to completely remove the previous "2-stages shortcut", which is designed to special-case "common" small_run+small_matches.

Is that intentional ?

From a high level, it seems your optimizations, which target long copies, and overlapping copies, feel complementary with the shortcut, which specifically avoids these cases.

@djwatson djwatson force-pushed the djwatson:optimize_decompress_generic branch from 6d39c69 to ee675fc Jan 30, 2019

@djwatson

This comment has been minimized.

Copy link
Contributor Author

commented Jan 30, 2019

One last question :
the new decoder seems to completely remove the previous "2-stages shortcut", which is designed to special-case "common" small_run+small_matches.

Is that intentional ?

Yes, this was intentional. As far as I can tell, the previous shortcut did three things:

  • For literal copy lengths that don't overflow, it did a memcpy by 16, instead of a variable copy by length. We still do this.

  • For match copy lengths that don't overflow, it did a memcopy by 8+8+2. We still do this, but 16+2.

  • If both match and literal lengths don't overflow, we do a combined check for the input and output size only once. We also still do this: Input overflow is only checked for the literal copy. Output overflow is checked once per loop if we don't overflow the literals

                if (op + length >= oend - FASTLOOP_SAFE_DISTANCE) {
                    goto safe_match_copy;
                }

See "decompress_generic: re-add fastpath "

And actually I believe this is an improvement on the previous shortcut, because even if match length overflows, we still only check output size once.

We did however remove the shortcut for the last 32 bytes.

@Cyan4973

This comment has been minimized.

Copy link
Member

commented Jan 30, 2019

I just made a test of the proposed PR on an ARM platform.

> lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  2
Socket(s):           2
Vendor ID:           Qualcomm
Model:               2
Model name:          Kryo
Stepping:            0x1
CPU max MHz:         2150.3999
CPU min MHz:         307.2000
BogoMIPS:            38.40
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32

Unfortunately, results are not so good on this platform :

Comparing djwatson:optimize_decompress_generic :

 1#silesia.tar       : 211948032 -> 100881000 (2.101), 160.6 MB/s , 759.6 MB/s
 1#calgary.tar       :   3265536 ->   1686059 (1.937), 137.4 MB/s , 681.1 MB/s
 1#dickens           :  10192446 ->   6428742 (1.585), 100.9 MB/s , 547.4 MB/s
 1#mozilla           :  51220480 ->  26435667 (1.938), 169.4 MB/s , 764.1 MB/s
 1#mr                :   9970564 ->   5440937 (1.833), 171.8 MB/s , 799.6 MB/s
 1#nci               :  33553445 ->   5533040 (6.064), 314.7 MB/s ,1058.9 MB/s
 1#ooffice           :   6152192 ->   4338918 (1.418), 141.0 MB/s , 684.3 MB/s
 1#osdb              :  10085684 ->   5256666 (1.919), 151.7 MB/s , 887.9 MB/s
 1#reymont           :   6627202 ->   3181387 (2.083), 107.5 MB/s , 463.3 MB/s
 1#samba             :  21606400 ->   7716839 (2.800), 195.9 MB/s , 793.7 MB/s
 1#sao               :   7251944 ->   6790273 (1.068), 138.5 MB/s ,1022.6 MB/s
 1#webster           :  41458703 ->  20139988 (2.059), 119.7 MB/s , 587.9 MB/s
 1#xml               :   5345280 ->   1227495 (4.355), 245.4 MB/s , 900.9 MB/s
 1#x-ray             :   8474240 ->   8390195 (1.010), 385.4 MB/s ,2625.9 MB/s

vs lz4:dev :

1#silesia.tar       : 211948032 -> 100881000 (2.101), 161.2 MB/s , 876.6 MB/s
1#calgary.tar       :   3265536 ->   1686059 (1.937), 134.4 MB/s , 832.7 MB/s
1#dickens           :  10192446 ->   6428742 (1.585), 103.1 MB/s , 774.6 MB/s
1#mozilla           :  51220480 ->  26435667 (1.938), 168.2 MB/s , 908.9 MB/s
1#mr                :   9970564 ->   5440937 (1.833), 167.7 MB/s , 996.4 MB/s
1#nci               :  33553445 ->   5533040 (6.064), 316.8 MB/s , 914.2 MB/s
1#ooffice           :   6152192 ->   4338918 (1.418), 139.7 MB/s , 764.0 MB/s
1#osdb              :  10085684 ->   5256666 (1.919), 148.2 MB/s , 837.2 MB/s
1#reymont           :   6627202 ->   3181387 (2.083), 105.0 MB/s , 679.7 MB/s
1#samba             :  21606400 ->   7716839 (2.800), 195.6 MB/s , 926.9 MB/s
1#sao               :   7251944 ->   6790273 (1.068), 138.7 MB/s , 968.2 MB/s
1#webster           :  41458703 ->  20139988 (2.059), 119.8 MB/s , 740.5 MB/s
1#xml               :   5345280 ->   1227495 (4.355), 247.2 MB/s , 927.7 MB/s
1#x-ray             :   8474240 ->   8390195 (1.010), 381.6 MB/s ,2762.5 MB/s

Difference is significant, and pretty consistent across multiple benchmark runs.
sao, nci and osdb stand out as being wins for this patch, likely due to the presence of large copies.
But everything else suffers more or less substantially.

Possible reasons (hand-waivy) :

  • maybe 32-bytes is a less optimal stripe length for this platform
  • maybe the shortcut was actually helpful, could be just a question of instruction locality for uop cache
  • on termux environment, local compiler is clang7 (vs gcc ?)
@Cyan4973

This comment has been minimized.

Copy link
Member

commented Feb 1, 2019

For match copy lengths that don't overflow, it did a memcopy by 8+8+2. We still do this, but 16+2.

There is a bit more to this story.
By doing 8+8+2, this makes the shortcut compatible
with offsets in the [8-15] range.

In contrast, doing 16+2 requires the offset to be at least 16+.

The second version saves one load instruction per sequence,
but the first version makes more sequences compatible with the shortcut.
It might not seem much, but this triggers an impact on branch predictability.

The shorcut is expected to be useful "as often as possible".
Restraining conditions of compatible sequences slightly reduces its predictability,
so now there are a bit more branch misses, hence more pipeline flushes, etc.

With modern cpus, the cost of branch unpredictability is much higher that the cost of an instruction.
So it favors the 8+8+2 trade off.

@djwatson

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2019

I just made a test of the proposed PR on an ARM platform.
Unfortunately, results are not so good on this platform :

Unfortunately I don't have any qualcomm arm chips, I'll have to borrow one to dig in to it.

Alternatively, we could #ifdef this to only x86, but it looked like there were some nice wins on larger arm chips, it would be nice to fix it.

By doing 8+8+2, this makes the shortcut compatible with offsets in the [8-15] range.
The second version saves one load instruction per sequence, but the first version makes more sequences compatible with the shortcut. It might not seem much, but this triggers an impact on branch predictability.

It's even slightly more complicated: The first version (8+8+2) potentially triggers the load blocked by store condition, while doing it by-16 doesn't seem to. However we don't do anything special to avoid this in memcpy_using_offset for 8-15 (currently only 1,2,4), although we could, but for the shortcut it's probably too small to worry about anyway.

Benchmarks:

this branch:
lz4 1.8.3                 325 MB/s  3074 MB/s     6428742  63.07 silesia/dickens
lz4 1.8.3                 527 MB/s  2808 MB/s    26435667  51.61 silesia/mozilla
lz4 1.8.3                 515 MB/s  3522 MB/s     5440937  54.57 silesia/mr
lz4 1.8.3                 886 MB/s  4299 MB/s     5533040  16.49 silesia/nci
lz4 1.8.3                 436 MB/s  2885 MB/s     4338918  70.53 silesia/ooffice
lz4 1.8.3                 468 MB/s  3198 MB/s     5256666  52.12 silesia/osdb
lz4 1.8.3                 334 MB/s  2576 MB/s     3181387  48.00 silesia/reymont
lz4 1.8.3                 569 MB/s  3581 MB/s     7716839  35.72 silesia/samba
lz4 1.8.3                 447 MB/s  4133 MB/s     6790273  93.63 silesia/sao
lz4 1.8.3                 475 MB/s  3013 MB/s   100881331  47.60 silesia/silesia.tar
lz4 1.8.3                 379 MB/s  2857 MB/s    20139988  48.58 silesia/webster
lz4 1.8.3                1236 MB/s  9404 MB/s     8390195  99.01 silesia/x-ray
lz4 1.8.3                 729 MB/s  3616 MB/s     1227495  22.96 silesia/xml

      41681.143709      task-clock (msec)         #    0.991 CPUs utilized          
               505      context-switches          #    0.012 K/sec                  
                11      cpu-migrations            #    0.000 K/sec                  
            291159      page-faults               #    0.007 M/sec                  
      123153166169      cycles                    #    2.955 GHz                      (75.03%)
      275849355190      instructions              #    2.24  insn per cycle           (75.04%)
       49297484532      branches                  # 1182.729 M/sec                    (75.02%)
        1668515452      branch-misses             #    3.38% of all branches          (75.02%)

      42.073131221 seconds time elapsed


Adding this patch: 

diff --git a/lib/lz4.c b/lib/lz4.c
index ddaab77..ccc697e 100644
--- a/lib/lz4.c
+++ b/lib/lz4.c
@@ -1649,8 +1649,9 @@ LZ4_decompress_generic(
 
                 /* Fastpath check: Avoids a branch in LZ4_wildCopy32 if true */
                 if (!(dict == usingExtDict) || (match >= lowPrefix)) {
-                    if (offset >= 16) {
-                        memcpy(op, match, 16);
+                    if (offset >= 8) {
+                        memcpy(op, match, 8);
+                        memcpy(op+8, match+8, 8);
                         memcpy(op+16, match+16, 2);
                         op += length;
                         continue;

lz4 1.8.3                 329 MB/s  2991 MB/s     6428742  63.07 silesia/dickens
lz4 1.8.3                 536 MB/s  2867 MB/s    26435667  51.61 silesia/mozilla
lz4 1.8.3                 514 MB/s  3495 MB/s     5440937  54.57 silesia/mr
lz4 1.8.3                 901 MB/s  4321 MB/s     5533040  16.49 silesia/nci
lz4 1.8.3                 438 MB/s  2958 MB/s     4338918  70.53 silesia/ooffice
lz4 1.8.3                 469 MB/s  3150 MB/s     5256666  52.12 silesia/osdb
lz4 1.8.3                 333 MB/s  2574 MB/s     3181387  48.00 silesia/reymont
lz4 1.8.3                 568 MB/s  3571 MB/s     7716839  35.72 silesia/samba
lz4 1.8.3                 446 MB/s  3977 MB/s     6790273  93.63 silesia/sao
lz4 1.8.3                 499 MB/s  3067 MB/s   100881331  47.60 silesia/silesia.tar
lz4 1.8.3                 378 MB/s  2819 MB/s    20139988  48.58 silesia/webster
lz4 1.8.3                1234 MB/s  9311 MB/s     8390195  99.01 silesia/x-ray
lz4 1.8.3                 729 MB/s  3645 MB/s     1227495  22.96 silesia/xml


      41687.349560      task-clock (msec)         #    0.991 CPUs utilized          
               505      context-switches          #    0.012 K/sec                  
                 9      cpu-migrations            #    0.000 K/sec                  
            291158      page-faults               #    0.007 M/sec                  
      120422000483      cycles                    #    2.889 GHz                      (75.01%)
      274586415771      instructions              #    2.28  insn per cycle           (75.03%)
       47621821770      branches                  # 1142.357 M/sec                    (75.03%)
        1566369291      branch-misses             #    3.29% of all branches          (75.04%)

      42.076941335 seconds time elapsed

The overall benchmark numbers look neutral-ish to me, but indeed the total branches and branch misses drops a bit. I can add the above patch. Thanks!

Note that the offset is chosen by the compressor, so potentially a favorDecSpeed compressor could choose offsets >= 32 to avoid the load blocked by store issue - it doesn't look like we do this currently.

@djwatson djwatson force-pushed the djwatson:optimize_decompress_generic branch from ee675fc to b5c3500 Feb 1, 2019

@Cyan4973

This comment has been minimized.

Copy link
Member

commented Feb 1, 2019

I noticed a few changes in the PR, so decided to give it another benchmark go.

I confirm the great speed up on x64. Tested on clang-7 and gcc-8, they both feature equivalent benefits.

However, the situation on Qualcom's aarch64 is still the same,
with many files showing large performance losses.

I guess it's not possible to correctly observe/debug the situation from x64.
You will likely need a similar platform to measure performance and optimize.

I can lend you a mobile aarch64 terminal with termux setup for your convenience if you wish.
Another possibility is to request a temporary access to one aarch64 server platform.
Both actions seem complementary by the way, I would expect both chips to behave differently, in spite of the same underlying aarch64 ipo.

Worst case would be to only keep the new decoder for x86/x64,
but I worry about having 2 versions of the decoder to maintain.
The main reason we have a single LZ4_decompress_generic() with all modes converging into it is to avoid having multiple places to maintain and keep in sync. Better avoid it again if possible.

@Cyan4973

This comment has been minimized.

Copy link
Member

commented Feb 1, 2019

Worth noting :
on a different ARM terminal, the difference in perf is no longer detrimental.
It actually seems a little bit positive.
cpu is different : it's an Exynos (previous one was a Qualcomm).

> lscpu
Architecture:         aarch64
Byte Order:           Little Endian
CPU(s):               8
On-line CPU(s) list:  0
Off-line CPU(s) list: 1-7
Thread(s) per core:   1
Core(s) per socket:   1
Socket(s):            1
Vendor ID:            ARM
Model:                2
Model name:           Cortex-A53
Stepping:             r0p2
CPU max MHz:          1500.0000
CPU min MHz:          400.0000
Flags:                fp asimd aes pmull sha1 sha2 crc32

So, 2 ARM architectures, 2 different outcomes. Makes things a bit more complex to interpret ...

Edit : for information, I restarted the benchmark on the previous phone, and got once again the same performance results ... I also tried to switch between power mode and saving mode, and the difference remains present, with approximately the same magnitude level.

djwatson added some commits Jan 24, 2019

decompress_generic: Refactor variable length fields
Make a helper function to read variable lengths for literals and
match length.
decompress_generic: Add a loop fastpath
Copy the main loop, and change checks such that op is always less
than oend-SAFE_DISTANCE.  Currently these are added for the literal
copy length check, and for the match copy length check.

Otherwise the first loop is exactly the same as the second.  Follow on
diffs will optimize the first copy loop based on this new requirement.

I also tried instead making a separate inlineable function for the copy
loop (similar to existing partialDecode flags, etc), but I think the
changes might be significant enough to warrent doubling the code, instead
pulling out common functionality to separate functions.

This is the basic transformation that will allow several following optimisations.
decompress_generic: optimize match copy
Add an LZ4_wildCopy16, that will wildcopy, potentially smashing up
to 16 bytes, and use it for match copy.  On x64, this avoids many
blocked loads due to store forwarding, similar to issue #411.
decompress_generic: Optimize literal copies
Use LZ4_wildCopy16 for variable-length literals.  For literal counts that
fit in the flag byte, copy directly.  We can also omit oend checks for
roughly the same reason as the previous shortcut:  We check once that both
match length and literal length fit in FASTLOOP_SAFE_DISTANCE, including
wildcopy distance.
decompress_generic: drop partial copy check in fast loop
We've already checked that we are more than FASTLOOP_SAFE_DISTANCE
away from the end, so this branch can never be true, we will have
already jumped to the second decode loop.
decompress_generic: re-add fastpath
This is the remaineder of the original 'shortcut'.  If true, we can avoid
the loop in LZ4_wildCopy, and directly copy instead.
decompress_generic: remove msan write
This store is also causing load-blocked-by-store issues, remove it.
The msan warning will have to be fixed another way if it is still an issue.
decompress_generic: Unroll loops a bit more
Generally we want our wildcopy loops to look like the
memcpy loops from our libc, but without the final byte copy checks.
We can unroll a bit to make long copies even faster.

The only catch is that this affects the value of FASTLOOP_SAFE_DISTANCE.
decompress_generic: Add fastpath for small offsets
For small offsets of size 1, 2, 4 and 8, we can set a single uint64_t,
and then use it to do a memset() variation.  In particular, this makes
the somewhat-common RLE (offset 1) about 2-4x faster than the previous
implementation - we avoid not only the load blocked by store, but also
avoid the loads entirely.
@djwatson

This comment has been minimized.

Copy link
Contributor Author

commented Feb 8, 2019

Restricted the new fastpath to x86 until we can track down the regression.

@Cyan4973

This comment has been minimized.

Copy link
Member

commented Feb 11, 2019

Restricted the new fastpath to x86 until we can track down the regression.

Was this last statement supposed to be followed by additional patches ?

@djwatson djwatson force-pushed the djwatson:optimize_decompress_generic branch from b5c3500 to bfef92b Feb 11, 2019

@djwatson

This comment has been minimized.

Copy link
Contributor Author

commented Feb 11, 2019

Ah sorry, pushed the wrong branch :/

Last patch skips over first loop for things not x86. I also kept the previous shortcut in the original loop. Thanks

@djwatson djwatson force-pushed the djwatson:optimize_decompress_generic branch from bfef92b to ed16483 Feb 11, 2019

@djwatson

This comment has been minimized.

Copy link
Contributor Author

commented Feb 11, 2019

fix msvc unreachable code warning

@djwatson djwatson force-pushed the djwatson:optimize_decompress_generic branch 2 times, most recently from 92b1052 to 59ab4cd Feb 11, 2019

decompress_generic: Limit fastpath to x86
New fastpath currently shows a regression on qualcomm
arm chips.  Restrict it to x86 for now

@djwatson djwatson force-pushed the djwatson:optimize_decompress_generic branch from 59ab4cd to 5d7d116 Feb 11, 2019

@Cyan4973

This comment has been minimized.

Copy link
Member

commented Feb 12, 2019

Thanks @djwatson !
It's a great speed optimization for LZ4 decoder ! Definitely worth being part of next release !

I'll try to save up some time later this month to investigate the issue on Qualcom's chips.

@Cyan4973 Cyan4973 merged commit d85bdb4 into lz4:dev Feb 12, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@Cyan4973

This comment has been minimized.

Copy link
Member

commented Apr 2, 2019

OK, I'm back onto lz4 code base, with the objective to aim at v1.9.0 release.

One major point of this last sprint is checking this optimization path on ARM, and more specifically for aarch64. The path has been gated to be enabled on x86/x64 only, so it needs to be force-enabled for other targets, such asaarch64. To this end, I created a new build macros, LZ4_FAST_DEC_LOOP, which makes it possible to control the decoder loop behavior from command line at compile time.

First results, on Qualcomm Kryo 2 using clang version 7.0.1:
normal (gated off):

 1#calgary.tar       :   3265536 ->   1686059 (1.937), 139.1 MB/s , 841.8 MB/s

force-enabled :

 1#calgary.tar       :   3265536 ->   1686059 (1.937), 139.8 MB/s , 644.2 MB/s

Tested run consistently multiple times. That's clearly a large regression.

Interestingly, gcc v8.3.0 has recently been fixed for termux on Android environment, so it's another viable option. And with this compiler, I cannot measure any significant difference.

normal (gated off):

 1#calgary.tar       :   3265536 ->   1686059 (1.937), 113.3 MB/s , 815.3 MB/s

force-enabled :

 1#calgary.tar       :   3265536 ->   1686059 (1.937), 109.7 MB/s , 823.6 MB/s

clang outcome != gcc outcome. Since there is no gain either, gating this optimization to remain disabled on aarch64 seems a good choice for the time being.

Now, onto investigating more finely....

const BYTE* s = (const BYTE*)srcPtr;
BYTE* const e = (BYTE*)dstEnd;

do { memcpy(d,s,16); memcpy(d+16,s+16,16); d+=32; s+=32; } while (d<e);

This comment has been minimized.

Copy link
@Cyan4973

Cyan4973 Apr 3, 2019

Member

I'll document that this function makes 2 copies of 16 bytes instead of a single copy of 32 bytes,
because it must be compatible with offset >= 16.

It probably doesn't make a difference in most architectures nowadays,
but as AVX 256-bit registers become the expectation in the x64 world,
copying 32-bytes wide might become advantageous at some point, so the question will come.

cpy = op + length;

/* partialDecoding : may not respect endBlock parsing restrictions */
assert(op<=oend);

This comment has been minimized.

Copy link
@Cyan4973

Cyan4973 Apr 3, 2019

Member

Considering that LZ4_wildCopy32 starts by blindly copying 32 bytes at op position, I guess this condition op<=oend is not enough to avoid a buffer overflow.

Note that I'm pretty sure that the condition is met. It's just the assert() itself which feels incorrect.

/* copy match within block */
cpy = op + length;

/* partialDecoding : may not respect endBlock parsing restrictions */

This comment has been minimized.

Copy link
@Cyan4973

Cyan4973 Apr 3, 2019

Member

I guess this comment is a left-over

memcpy(v, srcPtr, 4);
memcpy(&v[4], srcPtr, 4);
goto copy_loop;
case 3:

This comment has been minimized.

Copy link
@Cyan4973

Cyan4973 Apr 3, 2019

Member

I guess this can be simplified to default:

@terrelln

This comment has been minimized.

Copy link
Contributor

commented Apr 3, 2019

Seeing as the performance is fine on gcc, and terrible on clang, I would bet its a compiler problem. Something like it makes a bad inlining decision now, or it is failing to constant fold one of the "template parameters".

If you can narrow down what part of the patch is causing the regression, I bet it can be improved by rewriting the same or similar logic in a slightly different way.

@Cyan4973

This comment has been minimized.

Copy link
Member

commented Apr 3, 2019

I agree @terrelln .

Narrowing down the issue may help us understand what works better for the compiler, eventually uncovering an even more favorable trade-off.

@Cyan4973

This comment has been minimized.

Copy link
Member

commented on lib/lz4.c in faac110 Apr 17, 2019

It seems we missed that one.

That cannot work with LZ4_decompress_fast(),
translated into !endOnInput here.

Reason is, LZ4_decompress_fast() doesn't know its input size (that's why it's unsafe).
So it relies on stream property to remain correct for its memory accesses,
as long as compressed data is valid (it cannot defend itself against invalid data).

The stream property guarantees that last 5 bytes are literals.
This basically means that it can copy up to 8 bytes from input "blindly", without risk of reading beyond input buffer (assuming it's valid and contain exactly the compressed size).
But it cannot copy more : it would risk copying bytes beyond input buffer.

This issue triggered asan bug reports on upgrading to v1.9.0.

In hindsight, a larger problem is that we were missing a test able to catch this issue.
That's because LZ4_decompress_fast() was tested into a pre-allocated over-sized buffer, so it never read beyond the buffer's boundaries. This is now fixed.

It also means LZ4_decompress_fast() cannot copy more than 8 bytes at a time,
which is most likely detrimental to its speed.
This further re-inforces the idea that LZ4_decompress_fast() must be deprecated now.

This comment has been minimized.

Copy link
Member

replied Apr 19, 2019

Another problem is that LZ4_wildCopy32() writes 32-bytes at a time,
hence up to 32 bytes after the limit.

However, the test only checks that ip+length<=iend-(2+1+LASTLITERALS)
translated as ip+length<=iend-8.
Hence, the copy may read up to 24 bytes beyond input buffer.

Fuzzer tests have been updated to detect such issues in the future.

The fix is simple : test that ip+length<=iend-32

parheliamm pushed a commit to parheliamm/lz4 that referenced this pull request May 24, 2019

Chenxi Mao
LZ4_decompress_generic:
LZ4_FAST_DEC_LOOP lz4#645 lz4#707 was introduced to improve decompress
performance not only X86 but also ARM64.

On my ARM64 target(Cortex-a53) with LZ4_FAST_DEC_LOOP feature has
performance downgrade on lzbench test case.

lzbench 1.7.3 (64-bit Linux)   Assembled by P.Skibinski
Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                   1919 MB/s  1924 MB/s    10192446 100.00 /sdcard/silesia/dickens
lz4 1.8.3                  80 MB/s   448 MB/s     6428742  63.07 /sdcard/silesia/dickens
lz4 1.8.3                 120 MB/s   562 MB/s    26435667  51.61 /sdcard/silesia/mozilla
lz4 1.8.3                 127 MB/s   556 MB/s     5440937  54.57 /sdcard/silesia/mr
lz4 1.8.3                 321 MB/s   601 MB/s     5533040  16.49 /sdcard/silesia/nci
lz4 1.8.3                  83 MB/s   546 MB/s     4338918  70.53 /sdcard/silesia/ooffice
lz4 1.8.3                  87 MB/s   535 MB/s     5256666  52.12 /sdcard/silesia/osdb
lz4 1.8.3                 115 MB/s   417 MB/s     3181387  48.00 /sdcard/silesia/reymont
lz4 1.8.3                 178 MB/s   572 MB/s     7716839  35.72 /sdcard/silesia/samba
lz4 1.8.3                  68 MB/s   600 MB/s     6790273  93.63 /sdcard/silesia/sao
lz4 1.8.3                 103 MB/s   483 MB/s    20139988  48.58 /sdcard/silesia/webster
lz4 1.8.3                 152 MB/s  1064 MB/s     8390195  99.01 /sdcard/silesia/x-ray
lz4 1.8.3                 244 MB/s   583 MB/s     1227495  22.96 /sdcard/silesia/xml

Current Lz4-1.9.x dev branch has obvious downgrade on dickens case.
lzbench 1.7.3 (64-bit Linux)   Assembled by P.Skibinski
Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                   1913 MB/s  1934 MB/s    10192446 100.00 /sdcard/silesia/dickens
lz4 1.9.x                  80 MB/s   436 MB/s     6428742  63.07 /sdcard/silesia/dickens
lz4 1.9.x                 120 MB/s   565 MB/s    26435667  51.61 /sdcard/silesia/mozilla
lz4 1.9.x                 127 MB/s   558 MB/s     5440937  54.57 /sdcard/silesia/mr
lz4 1.9.x                 319 MB/s   623 MB/s     5533040  16.49 /sdcard/silesia/nci
lz4 1.9.x                  82 MB/s   544 MB/s     4338918  70.53 /sdcard/silesia/ooffice
lz4 1.9.x                  87 MB/s   548 MB/s     5256666  52.12 /sdcard/silesia/osdb
lz4 1.9.x                 113 MB/s   404 MB/s     3181387  48.00 /sdcard/silesia/reymont
lz4 1.9.x                 176 MB/s   581 MB/s     7716839  35.72 /sdcard/silesia/samba
lz4 1.9.x                  68 MB/s   597 MB/s     6790273  93.63 /sdcard/silesia/sao
lz4 1.9.x                 103 MB/s   481 MB/s    20139988  48.58 /sdcard/silesia/webster
lz4 1.9.x                 152 MB/s  1073 MB/s     8390195  99.01 /sdcard/silesia/x-ray
lz4 1.9.x                 242 MB/s   597 MB/s     1227495  22.96 /sdcard/silesia/xml
done... (cIters=1 dIters=1 cTime=1.0 dTime=2.0 chunkSize=1706MB cSpeed=0MB)

After apply the patch, the overall test result improvemed on my ARM64 device.
lzbench 1.7.3 (64-bit Linux)   Assembled by P.Skibinski
Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                   1909 MB/s  1916 MB/s    10192446 100.00 /sdcard/silesia/dickens
lz4 1.9.x      dickens            82 MB/s   446 MB/s     6428742  63.07 /sdcard/silesia/dickens
lz4 1.9.x                 125 MB/s   571 MB/s    26435667  51.61 /sdcard/silesia/mozilla
lz4 1.9.x                 131 MB/s   565 MB/s     5440937  54.57 /sdcard/silesia/mr
lz4 1.9.x                 327 MB/s   626 MB/s     5533040  16.49 /sdcard/silesia/nci
lz4 1.9.x                  87 MB/s   551 MB/s     4338918  70.53 /sdcard/silesia/ooffice
lz4 1.9.x                  90 MB/s   556 MB/s     5256666  52.12 /sdcard/silesia/osdb
lz4 1.9.x                 121 MB/s   412 MB/s     3181387  48.00 /sdcard/silesia/reymont
lz4 1.9.x                 185 MB/s   584 MB/s     7716839  35.72 /sdcard/silesia/samba
lz4 1.9.x                  73 MB/s   601 MB/s     6790273  93.63 /sdcard/silesia/sao
lz4 1.9.x                 109 MB/s   489 MB/s    20139988  48.58 /sdcard/silesia/webster
lz4 1.9.x                 164 MB/s  1067 MB/s     8390195  99.01 /sdcard/silesia/x-ray
lz4 1.9.x                 249 MB/s   603 MB/s     1227495  22.96 /sdcard/silesia/xml

To make sure this change don't have side effect, this change only enabled on ARM64
with GCC build.

parheliamm pushed a commit to parheliamm/lz4 that referenced this pull request May 24, 2019

Chenxi Mao
LZ4_decompress_generic: perofmrance improvement on AARCH64
LZ4_FAST_DEC_LOOP lz4#645 lz4#707 was introduced to improve decompress
performance not only X86 but also ARM64.

On my ARM64 target(Cortex-a53) with LZ4_FAST_DEC_LOOP feature has
performance downgrade on lzbench test case.

lzbench 1.7.3 (64-bit Linux)   Assembled by P.Skibinski
Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                   1919 MB/s  1924 MB/s    10192446 100.00 /sdcard/silesia/dickens
lz4 1.8.3                  80 MB/s   448 MB/s     6428742  63.07 /sdcard/silesia/dickens
lz4 1.8.3                 120 MB/s   562 MB/s    26435667  51.61 /sdcard/silesia/mozilla
lz4 1.8.3                 127 MB/s   556 MB/s     5440937  54.57 /sdcard/silesia/mr
lz4 1.8.3                 321 MB/s   601 MB/s     5533040  16.49 /sdcard/silesia/nci
lz4 1.8.3                  83 MB/s   546 MB/s     4338918  70.53 /sdcard/silesia/ooffice
lz4 1.8.3                  87 MB/s   535 MB/s     5256666  52.12 /sdcard/silesia/osdb
lz4 1.8.3                 115 MB/s   417 MB/s     3181387  48.00 /sdcard/silesia/reymont
lz4 1.8.3                 178 MB/s   572 MB/s     7716839  35.72 /sdcard/silesia/samba
lz4 1.8.3                  68 MB/s   600 MB/s     6790273  93.63 /sdcard/silesia/sao
lz4 1.8.3                 103 MB/s   483 MB/s    20139988  48.58 /sdcard/silesia/webster
lz4 1.8.3                 152 MB/s  1064 MB/s     8390195  99.01 /sdcard/silesia/x-ray
lz4 1.8.3                 244 MB/s   583 MB/s     1227495  22.96 /sdcard/silesia/xml

Current Lz4-1.9.x dev branch has obvious downgrade on dickens case.
lzbench 1.7.3 (64-bit Linux)   Assembled by P.Skibinski
Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                   1913 MB/s  1934 MB/s    10192446 100.00 /sdcard/silesia/dickens
lz4 1.9.x                  80 MB/s   436 MB/s     6428742  63.07 /sdcard/silesia/dickens
lz4 1.9.x                 120 MB/s   565 MB/s    26435667  51.61 /sdcard/silesia/mozilla
lz4 1.9.x                 127 MB/s   558 MB/s     5440937  54.57 /sdcard/silesia/mr
lz4 1.9.x                 319 MB/s   623 MB/s     5533040  16.49 /sdcard/silesia/nci
lz4 1.9.x                  82 MB/s   544 MB/s     4338918  70.53 /sdcard/silesia/ooffice
lz4 1.9.x                  87 MB/s   548 MB/s     5256666  52.12 /sdcard/silesia/osdb
lz4 1.9.x                 113 MB/s   404 MB/s     3181387  48.00 /sdcard/silesia/reymont
lz4 1.9.x                 176 MB/s   581 MB/s     7716839  35.72 /sdcard/silesia/samba
lz4 1.9.x                  68 MB/s   597 MB/s     6790273  93.63 /sdcard/silesia/sao
lz4 1.9.x                 103 MB/s   481 MB/s    20139988  48.58 /sdcard/silesia/webster
lz4 1.9.x                 152 MB/s  1073 MB/s     8390195  99.01 /sdcard/silesia/x-ray
lz4 1.9.x                 242 MB/s   597 MB/s     1227495  22.96 /sdcard/silesia/xml
done... (cIters=1 dIters=1 cTime=1.0 dTime=2.0 chunkSize=1706MB cSpeed=0MB)

After apply the patch, the overall test result improvemed on my ARM64 device.
lzbench 1.7.3 (64-bit Linux)   Assembled by P.Skibinski
Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                   1909 MB/s  1916 MB/s    10192446 100.00 /sdcard/silesia/dickens
lz4 1.9.x      dickens            82 MB/s   446 MB/s     6428742  63.07 /sdcard/silesia/dickens
lz4 1.9.x                 125 MB/s   571 MB/s    26435667  51.61 /sdcard/silesia/mozilla
lz4 1.9.x                 131 MB/s   565 MB/s     5440937  54.57 /sdcard/silesia/mr
lz4 1.9.x                 327 MB/s   626 MB/s     5533040  16.49 /sdcard/silesia/nci
lz4 1.9.x                  87 MB/s   551 MB/s     4338918  70.53 /sdcard/silesia/ooffice
lz4 1.9.x                  90 MB/s   556 MB/s     5256666  52.12 /sdcard/silesia/osdb
lz4 1.9.x                 121 MB/s   412 MB/s     3181387  48.00 /sdcard/silesia/reymont
lz4 1.9.x                 185 MB/s   584 MB/s     7716839  35.72 /sdcard/silesia/samba
lz4 1.9.x                  73 MB/s   601 MB/s     6790273  93.63 /sdcard/silesia/sao
lz4 1.9.x                 109 MB/s   489 MB/s    20139988  48.58 /sdcard/silesia/webster
lz4 1.9.x                 164 MB/s  1067 MB/s     8390195  99.01 /sdcard/silesia/x-ray
lz4 1.9.x                 249 MB/s   603 MB/s     1227495  22.96 /sdcard/silesia/xml

To make sure this change don't have side effect, this change only enabled on ARM64
with GCC build.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.