-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LZ4_decompress_generic: perofmrance improvement on AARCH64 #713
Conversation
BTW, I port my change to kernel lz4 module, the test result improved on ZRAM but still downgrade compare with 1.8.3 in kernel. |
Hi @Cyan4973: |
Hi @Cyan4973 : Chenxi |
I can build lzbench for aarch64 via clang. v1.9.x-with patch It is close to 1.8.3 but still has downgrade. so FAST_DEC_LOOP still enable only on GCC. |
@parheliamm , I will have access to 2 ARM64 systems soon. |
v[1] = srcPtr[0]; | ||
v[2] = srcPtr[0]; | ||
v[3] = srcPtr[0]; | ||
memcpy(v+4, v, 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This modification is interesting.
It's correct I think, just it's surprising it ends up being faster than memset()
.
I guess it might depend on the compiler, and its ability to properly optimize memset()
for a constant length.
Anyway, it would be good to try to isolate how much gains are achieved through this change alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Cyan4973 :
Please refer to below dump from ARM64:
The GCC invoke memset API directlty.
case 1:
memset(v, *srcPtr, 8);
390: f94013e0 ldr x0, [sp,#32]
394: 39400000 ldrb w0, [x0]
398: 2a0003e1 mov w1, w0
39c: 9100e3e0 add x0, sp, #0x38
3a0: d2800102 mov x2, #0x8 // #8
3a4: 94000000 bl 0 <memset>
goto copy_loop;
It is need more time from gprof:
2.22 2.14 0.05 memset
On the other hand, X86 GCC invoke the memset with avx instruction which provide much better performance than generic implementation.
1.05 0.94 0.01 __memset_avx2
@@ -1708,10 +1719,9 @@ LZ4_decompress_generic( | |||
/* get matchlength */ | |||
length = token & ML_MASK; | |||
|
|||
if ((checkOffset) && (unlikely(match + dictSize < lowPrefix))) { goto _output_error; } /* Error : offset outside buffers */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if I understand correctly,
this check has been moved forward into the code, and replaced by 2 instances of the same test.
While I guess it works correctly, I wonder what's the need for it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Cyan4973 :
If you profiling this condition check invoked number comparison between 1.9.x and 1.8.3.
You will see the condition check difference via lzbench with dickens test case.
v1.8.3 34959
v.1.9.x 1055885
After investigate the code, we could see the difference.
v.1.8.3 SKIP the condition check if
if condition is true in:
https://github.com/lz4/lz4/blob/v1.8.3/lib/lz4.c#L1463
AND below condition is true
https://github.com/lz4/lz4/blob/v1.8.3/lib/lz4.c#L1478
The "if ((checkOffset) && (unlikely(match + dictSize < lowPrefix))) goto _output_error; / Error : offset outside buffers /" should not be invoked.
v1.9.3
The "if ((checkOffset) && (unlikely(match + dictSize < lowPrefix))) goto _output_error; /* Error : offset outside buffers */" code will be invoked in every loop which lead to downgrade.
So the fix would be move this check to specific condition to avoid useless condition check.
After this change, the call number is same as v1.8.3
Chenxi
LZ4_FAST_DEC_LOOP #645 #707 was introduced to improve decompress performance not only X86 but also ARM64. On my ARM64 target(Cortex-a53) with LZ4_FAST_DEC_LOOP feature has performance downgrade on lzbench test case. Here are the change for ARM64 1. using variable assignment isstead of memset on ARM64. 2. Add offset >= 16 check and memory copy. Generic: 3. reduce buffer outsiite check number. After this change, both clang and GCC dec performance are imrpoved. Test result: lz4 1.8.3 81 MB/s 447 MB/s 6428742 63.07 /sdcard/silesia/dickens lz4 1.9.x 80 MB/s 435 MB/s 6428742 63.07 /sdcard/silesia/dickens lz4 1.9.x-pathced 82 MB/s 455 MB/s 6428742 63.07 /sdcard/silesia/dickens
I've been testing these proposed changes on a Qualcom Kryo 2 cpu ( In comparing performance with Note that, on this cpu at least, |
You meant to say GCC build is also has downgrade via your local test? My test enviornment:
Linaro GCC-6.3 or GCC7 |
Clang with FAST_DEC lzbench 1.7.3 (64-bit Linux) Assembled by P.Skibinski dev branch: |
Yes. My feeling is that the Therefore, we can separately observe radically different outcome for the same code modification. I'll have to be cautious on this front, and only consider changes with clear performance benefits across the board. This can happen sometimes, when removing a useless / redundant operation for example. Another good example is offering code simplification at no performance cost. I'll have to pass, as this patch does not belong to any of these categories, unfortunately. |
I thought you should review the checkoffset change which should be a bug fix for fast_dec_loop. |
I thought the |
Yes, as I mentioned previously, the checkoffset invoke number is much more than v1.8.3
If you don't want to enable aarch64 for fastloop, we could pick up checkoffset change to dev branch. Chenxi |
OK, so this is a performance enhancement suggestion, not a correctness issue. However, it seems to belong to the category "reduce nb of operations", so could end up being a general benefit on any platform. I'll have to test this change in isolation though, and come back with numbers. |
One more comments, I am not sure whether X86 could see the benefit directly. I can push a new pull request for this change today. |
Sure ! I'll try all architectures and compilers with the new PR. I only have |
new request #719 |
LZ4_FAST_DEC_LOOP #645 #707 was introduced to improve decompress
performance not only X86 but also ARM64.
On my ARM64 target(Cortex-a53) with LZ4_FAST_DEC_LOOP feature has
performance downgrade on lzbench test case.
lzbench 1.7.3 (64-bit Linux) Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 1919 MB/s 1924 MB/s 10192446 100.00 /sdcard/silesia/dickens
lz4 1.8.3 80 MB/s 448 MB/s 6428742 63.07 /sdcard/silesia/dickens
lz4 1.8.3 120 MB/s 562 MB/s 26435667 51.61 /sdcard/silesia/mozilla
lz4 1.8.3 127 MB/s 556 MB/s 5440937 54.57 /sdcard/silesia/mr
lz4 1.8.3 321 MB/s 601 MB/s 5533040 16.49 /sdcard/silesia/nci
lz4 1.8.3 83 MB/s 546 MB/s 4338918 70.53 /sdcard/silesia/ooffice
lz4 1.8.3 87 MB/s 535 MB/s 5256666 52.12 /sdcard/silesia/osdb
lz4 1.8.3 115 MB/s 417 MB/s 3181387 48.00 /sdcard/silesia/reymont
lz4 1.8.3 178 MB/s 572 MB/s 7716839 35.72 /sdcard/silesia/samba
lz4 1.8.3 68 MB/s 600 MB/s 6790273 93.63 /sdcard/silesia/sao
lz4 1.8.3 103 MB/s 483 MB/s 20139988 48.58 /sdcard/silesia/webster
lz4 1.8.3 152 MB/s 1064 MB/s 8390195 99.01 /sdcard/silesia/x-ray
lz4 1.8.3 244 MB/s 583 MB/s 1227495 22.96 /sdcard/silesia/xml
Current Lz4-1.9.x dev branch has obvious downgrade on dickens case.
lzbench 1.7.3 (64-bit Linux) Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 1913 MB/s 1934 MB/s 10192446 100.00 /sdcard/silesia/dickens
lz4 1.9.x 80 MB/s 436 MB/s 6428742 63.07 /sdcard/silesia/dickens
lz4 1.9.x 120 MB/s 565 MB/s 26435667 51.61 /sdcard/silesia/mozilla
lz4 1.9.x 127 MB/s 558 MB/s 5440937 54.57 /sdcard/silesia/mr
lz4 1.9.x 319 MB/s 623 MB/s 5533040 16.49 /sdcard/silesia/nci
lz4 1.9.x 82 MB/s 544 MB/s 4338918 70.53 /sdcard/silesia/ooffice
lz4 1.9.x 87 MB/s 548 MB/s 5256666 52.12 /sdcard/silesia/osdb
lz4 1.9.x 113 MB/s 404 MB/s 3181387 48.00 /sdcard/silesia/reymont
lz4 1.9.x 176 MB/s 581 MB/s 7716839 35.72 /sdcard/silesia/samba
lz4 1.9.x 68 MB/s 597 MB/s 6790273 93.63 /sdcard/silesia/sao
lz4 1.9.x 103 MB/s 481 MB/s 20139988 48.58 /sdcard/silesia/webster
lz4 1.9.x 152 MB/s 1073 MB/s 8390195 99.01 /sdcard/silesia/x-ray
lz4 1.9.x 242 MB/s 597 MB/s 1227495 22.96 /sdcard/silesia/xml
done... (cIters=1 dIters=1 cTime=1.0 dTime=2.0 chunkSize=1706MB cSpeed=0MB)
After apply the patch, the overall test result improvemed on my ARM64 device.
lzbench 1.7.3 (64-bit Linux) Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 1909 MB/s 1916 MB/s 10192446 100.00 /sdcard/silesia/dickens
lz4 1.9.x dickens 82 MB/s 446 MB/s 6428742 63.07 /sdcard/silesia/dickens
lz4 1.9.x 125 MB/s 571 MB/s 26435667 51.61 /sdcard/silesia/mozilla
lz4 1.9.x 131 MB/s 565 MB/s 5440937 54.57 /sdcard/silesia/mr
lz4 1.9.x 327 MB/s 626 MB/s 5533040 16.49 /sdcard/silesia/nci
lz4 1.9.x 87 MB/s 551 MB/s 4338918 70.53 /sdcard/silesia/ooffice
lz4 1.9.x 90 MB/s 556 MB/s 5256666 52.12 /sdcard/silesia/osdb
lz4 1.9.x 121 MB/s 412 MB/s 3181387 48.00 /sdcard/silesia/reymont
lz4 1.9.x 185 MB/s 584 MB/s 7716839 35.72 /sdcard/silesia/samba
lz4 1.9.x 73 MB/s 601 MB/s 6790273 93.63 /sdcard/silesia/sao
lz4 1.9.x 109 MB/s 489 MB/s 20139988 48.58 /sdcard/silesia/webster
lz4 1.9.x 164 MB/s 1067 MB/s 8390195 99.01 /sdcard/silesia/x-ray
lz4 1.9.x 249 MB/s 603 MB/s 1227495 22.96 /sdcard/silesia/xml
To make sure this change don't have side effect, this change only enabled on ARM64
with GCC build.