Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LZ4_decompress_generic: perofmrance improvement on AARCH64 #713

Closed
wants to merge 1 commit into from
Closed

LZ4_decompress_generic: perofmrance improvement on AARCH64 #713

wants to merge 1 commit into from

Conversation

parheliamm
Copy link

LZ4_FAST_DEC_LOOP #645 #707 was introduced to improve decompress
performance not only X86 but also ARM64.

On my ARM64 target(Cortex-a53) with LZ4_FAST_DEC_LOOP feature has
performance downgrade on lzbench test case.

lzbench 1.7.3 (64-bit Linux) Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 1919 MB/s 1924 MB/s 10192446 100.00 /sdcard/silesia/dickens
lz4 1.8.3 80 MB/s 448 MB/s 6428742 63.07 /sdcard/silesia/dickens
lz4 1.8.3 120 MB/s 562 MB/s 26435667 51.61 /sdcard/silesia/mozilla
lz4 1.8.3 127 MB/s 556 MB/s 5440937 54.57 /sdcard/silesia/mr
lz4 1.8.3 321 MB/s 601 MB/s 5533040 16.49 /sdcard/silesia/nci
lz4 1.8.3 83 MB/s 546 MB/s 4338918 70.53 /sdcard/silesia/ooffice
lz4 1.8.3 87 MB/s 535 MB/s 5256666 52.12 /sdcard/silesia/osdb
lz4 1.8.3 115 MB/s 417 MB/s 3181387 48.00 /sdcard/silesia/reymont
lz4 1.8.3 178 MB/s 572 MB/s 7716839 35.72 /sdcard/silesia/samba
lz4 1.8.3 68 MB/s 600 MB/s 6790273 93.63 /sdcard/silesia/sao
lz4 1.8.3 103 MB/s 483 MB/s 20139988 48.58 /sdcard/silesia/webster
lz4 1.8.3 152 MB/s 1064 MB/s 8390195 99.01 /sdcard/silesia/x-ray
lz4 1.8.3 244 MB/s 583 MB/s 1227495 22.96 /sdcard/silesia/xml

Current Lz4-1.9.x dev branch has obvious downgrade on dickens case.
lzbench 1.7.3 (64-bit Linux) Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 1913 MB/s 1934 MB/s 10192446 100.00 /sdcard/silesia/dickens
lz4 1.9.x 80 MB/s 436 MB/s 6428742 63.07 /sdcard/silesia/dickens
lz4 1.9.x 120 MB/s 565 MB/s 26435667 51.61 /sdcard/silesia/mozilla
lz4 1.9.x 127 MB/s 558 MB/s 5440937 54.57 /sdcard/silesia/mr
lz4 1.9.x 319 MB/s 623 MB/s 5533040 16.49 /sdcard/silesia/nci
lz4 1.9.x 82 MB/s 544 MB/s 4338918 70.53 /sdcard/silesia/ooffice
lz4 1.9.x 87 MB/s 548 MB/s 5256666 52.12 /sdcard/silesia/osdb
lz4 1.9.x 113 MB/s 404 MB/s 3181387 48.00 /sdcard/silesia/reymont
lz4 1.9.x 176 MB/s 581 MB/s 7716839 35.72 /sdcard/silesia/samba
lz4 1.9.x 68 MB/s 597 MB/s 6790273 93.63 /sdcard/silesia/sao
lz4 1.9.x 103 MB/s 481 MB/s 20139988 48.58 /sdcard/silesia/webster
lz4 1.9.x 152 MB/s 1073 MB/s 8390195 99.01 /sdcard/silesia/x-ray
lz4 1.9.x 242 MB/s 597 MB/s 1227495 22.96 /sdcard/silesia/xml
done... (cIters=1 dIters=1 cTime=1.0 dTime=2.0 chunkSize=1706MB cSpeed=0MB)

After apply the patch, the overall test result improvemed on my ARM64 device.
lzbench 1.7.3 (64-bit Linux) Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 1909 MB/s 1916 MB/s 10192446 100.00 /sdcard/silesia/dickens
lz4 1.9.x dickens 82 MB/s 446 MB/s 6428742 63.07 /sdcard/silesia/dickens
lz4 1.9.x 125 MB/s 571 MB/s 26435667 51.61 /sdcard/silesia/mozilla
lz4 1.9.x 131 MB/s 565 MB/s 5440937 54.57 /sdcard/silesia/mr
lz4 1.9.x 327 MB/s 626 MB/s 5533040 16.49 /sdcard/silesia/nci
lz4 1.9.x 87 MB/s 551 MB/s 4338918 70.53 /sdcard/silesia/ooffice
lz4 1.9.x 90 MB/s 556 MB/s 5256666 52.12 /sdcard/silesia/osdb
lz4 1.9.x 121 MB/s 412 MB/s 3181387 48.00 /sdcard/silesia/reymont
lz4 1.9.x 185 MB/s 584 MB/s 7716839 35.72 /sdcard/silesia/samba
lz4 1.9.x 73 MB/s 601 MB/s 6790273 93.63 /sdcard/silesia/sao
lz4 1.9.x 109 MB/s 489 MB/s 20139988 48.58 /sdcard/silesia/webster
lz4 1.9.x 164 MB/s 1067 MB/s 8390195 99.01 /sdcard/silesia/x-ray
lz4 1.9.x 249 MB/s 603 MB/s 1227495 22.96 /sdcard/silesia/xml

To make sure this change don't have side effect, this change only enabled on ARM64
with GCC build.

lib/lz4.c Outdated Show resolved Hide resolved
lib/lz4.c Show resolved Hide resolved
@parheliamm
Copy link
Author

BTW, I port my change to kernel lz4 module, the test result improved on ZRAM but still downgrade compare with 1.8.3 in kernel.
The overall test result is patched version still a little downgrade 1%~5% compare with 1.8.3.
I plan to disassemble obj code to check it.

@parheliamm
Copy link
Author

Hi @Cyan4973:
I though I found the root cause for the ARM64 device via profiling tools.
1. invoke kernel memcpy api if size is 32 on ARM64.
2. using variable assignment isstead of memset on ARM64.
3. Add offset >= 16 check and memory copy.
On the othre hand, I found the "buffer outsiite check" is much more invoked than 1.8.3.
I thought it should be a bug. I try to fix it and invoke number is back to normal after modification.
Pleae review it.
The ARM64 LZ4 decompress performance obviously improved on benchmark.

@parheliamm
Copy link
Author

Hi @Cyan4973 :
BTW, clang downgrade seems like resolved as well.
I don't have environment for clang build.
I did a test on my clang build kernel. The zram performance seems like back to normal.
Would you please double check on your side?
If clang downgrade issue fixed, we can remove clang definition check.

Chenxi

@parheliamm
Copy link
Author

I can build lzbench for aarch64 via clang.
According to result, clang build still has small downgrade compare with 1.8.3,
v.1.8.3
lzbench-1.8.3 -elz4 /sdcard/silesia/*
lzbench 1.7.3 (64-bit Linux) Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 1177 MB/s 1038 MB/s 10192446 100.00 /sdcard/silesia/dickens
lz4 1.8.3 64 MB/s 411 MB/s 6428742 63.07 /sdcard/silesia/dickens
lz4 1.8.3 93 MB/s 534 MB/s 26435667 51.61 /sdcard/silesia/mozilla
lz4 1.8.3 95 MB/s 498 MB/s 5440937 54.57 /sdcard/silesia/mr
lz4 1.8.3 233 MB/s 572 MB/s 5533040 16.49 /sdcard/silesia/nci
lz4 1.8.3 65 MB/s 522 MB/s 4338918 70.53 /sdcard/silesia/ooffice
lz4 1.8.3 71 MB/s 507 MB/s 5256666 52.12 /sdcard/silesia/osdb
lz4 1.8.3 91 MB/s 358 MB/s 3181387 48.00 /sdcard/silesia/reymont
lz4 1.8.3 134 MB/s 526 MB/s 7716839 35.72 /sdcard/silesia/samba
lz4 1.8.3 50 MB/s 539 MB/s 6790273 93.63 /sdcard/silesia/sao
lz4 1.8.3 76 MB/s 418 MB/s 20139988 48.58 /sdcard/silesia/webster
lz4 1.8.3 112 MB/s 857 MB/s 8390195 99.01 /sdcard/silesia/x-ray
lz4 1.8.3 168 MB/s 508 MB/s 1227495 22.96 /sdcard/silesia/xml

v1.9.x-with patch
lzbench 1.7.3 (64-bit Linux) Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 1188 MB/s 1037 MB/s 10192446 100.00 /sdcard/silesia/dickens
lz4 1.9.x 56 MB/s 410 MB/s 6428742 63.07 /sdcard/silesia/dickens
lz4 1.9.x 82 MB/s 532 MB/s 26435667 51.61 /sdcard/silesia/mozilla
lz4 1.9.x 80 MB/s 493 MB/s 5440937 54.57 /sdcard/silesia/mr
lz4 1.9.x 207 MB/s 562 MB/s 5533040 16.49 /sdcard/silesia/nci
lz4 1.9.x 56 MB/s 517 MB/s 4338918 70.53 /sdcard/silesia/ooffice
lz4 1.9.x 62 MB/s 508 MB/s 5256666 52.12 /sdcard/silesia/osdb
lz4 1.9.x 79 MB/s 358 MB/s 3181387 48.00 /sdcard/silesia/reymont
lz4 1.9.x 108 MB/s 495 MB/s 7716839 35.72 /sdcard/silesia/samba
lz4 1.9.x 43 MB/s 528 MB/s 6790273 93.63 /sdcard/silesia/sao
lz4 1.9.x 67 MB/s 418 MB/s 20139988 48.58 /sdcard/silesia/webster
lz4 1.9.x 100 MB/s 843 MB/s 8390195 99.01 /sdcard/silesia/x-ray
lz4 1.9.x 145 MB/s 507 MB/s 1227495 22.96 /sdcard/silesia/xml
done... (cIters=1 dIters=1 cTime=1.0 dTime=2.0 chunkSize=1706MB cSpeed=0MB)

It is close to 1.8.3 but still has downgrade. so FAST_DEC_LOOP still enable only on GCC.
I will update patch soon.

@Cyan4973
Copy link
Member

@parheliamm , I will have access to 2 ARM64 systems soon.
I should be able to test both gcc and clang if all goes according to plan.

lib/lz4.c Outdated Show resolved Hide resolved
v[1] = srcPtr[0];
v[2] = srcPtr[0];
v[3] = srcPtr[0];
memcpy(v+4, v, 4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modification is interesting.
It's correct I think, just it's surprising it ends up being faster than memset().
I guess it might depend on the compiler, and its ability to properly optimize memset() for a constant length.
Anyway, it would be good to try to isolate how much gains are achieved through this change alone.

Copy link
Author

@parheliamm parheliamm May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Cyan4973 :
Please refer to below dump from ARM64:
The GCC invoke memset API directlty.

case 1:
    memset(v, *srcPtr, 8);
 390:	f94013e0 	ldr	x0, [sp,#32]
 394:	39400000 	ldrb	w0, [x0]
 398:	2a0003e1 	mov	w1, w0
 39c:	9100e3e0 	add	x0, sp, #0x38
 3a0:	d2800102 	mov	x2, #0x8                   	// #8
 3a4:	94000000 	bl	0 <memset>
    goto copy_loop;

It is need more time from gprof:
2.22 2.14 0.05 memset
On the other hand, X86 GCC invoke the memset with avx instruction which provide much better performance than generic implementation.
1.05 0.94 0.01 __memset_avx2

@@ -1708,10 +1719,9 @@ LZ4_decompress_generic(
/* get matchlength */
length = token & ML_MASK;

if ((checkOffset) && (unlikely(match + dictSize < lowPrefix))) { goto _output_error; } /* Error : offset outside buffers */
Copy link
Member

@Cyan4973 Cyan4973 May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I understand correctly,
this check has been moved forward into the code, and replaced by 2 instances of the same test.

While I guess it works correctly, I wonder what's the need for it ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Cyan4973 :
If you profiling this condition check invoked number comparison between 1.9.x and 1.8.3.
You will see the condition check difference via lzbench with dickens test case.
v1.8.3 34959
v.1.9.x 1055885

After investigate the code, we could see the difference.
v.1.8.3 SKIP the condition check if
if condition is true in:
https://github.com/lz4/lz4/blob/v1.8.3/lib/lz4.c#L1463
AND below condition is true
https://github.com/lz4/lz4/blob/v1.8.3/lib/lz4.c#L1478
The "if ((checkOffset) && (unlikely(match + dictSize < lowPrefix))) goto _output_error; / Error : offset outside buffers /" should not be invoked.

v1.9.3
The "if ((checkOffset) && (unlikely(match + dictSize < lowPrefix))) goto _output_error; /* Error : offset outside buffers */" code will be invoked in every loop which lead to downgrade.
So the fix would be move this check to specific condition to avoid useless condition check.

After this change, the call number is same as v1.8.3
Chenxi

LZ4_FAST_DEC_LOOP #645 #707 was introduced to improve decompress
performance not only X86 but also ARM64.

On my ARM64 target(Cortex-a53) with LZ4_FAST_DEC_LOOP feature has
performance downgrade on lzbench test case.

Here are the change for ARM64
1. using variable assignment isstead of memset on ARM64.
2. Add offset >= 16 check and memory copy.

Generic:
3. reduce buffer outsiite check number.

After this change, both clang and GCC dec performance are imrpoved.

Test result:
lz4 1.8.3                  81 MB/s   447 MB/s     6428742  63.07 /sdcard/silesia/dickens
lz4 1.9.x                  80 MB/s   435 MB/s     6428742  63.07 /sdcard/silesia/dickens
lz4 1.9.x-pathced          82 MB/s   455 MB/s     6428742  63.07 /sdcard/silesia/dickens
@Cyan4973
Copy link
Member

Cyan4973 commented May 30, 2019

I've been testing these proposed changes on a Qualcom Kryo 2 cpu (aarch64), compiling using both clang v8.0 and gcc v8.3.

In comparing performance with dev branch,
clang results are virtually unchanged, which I suspect is normal since this patch targets gcc only,
but gcc results are unfortunately down, and by a very large margin.
We are talking 20-30% slower decoding speed, depending on file.

Note that, on this cpu at least, dev branch produces substantially better decompression speed than v1.8.3.

@parheliamm
Copy link
Author

parheliamm commented May 30, 2019

I've been testing these proposed changes on a Qualcom Kryo 2 cpu (aarch64), compiling using both clang v8.0 and gcc v8.3.

In comparing performance with dev branch,
clang results are virtually unchanged, which I suspect is normal since this patch targets gcc only,
but gcc results are unfortunately down, and by a very large margin.
We are talking 20-30% slower decoding speed, depending on file.

Note that, on this cpu at least, dev branch produces substantially better decompression speed than v1.8.3.

You meant to say GCC build is also has downgrade via your local test?
I am using lzbench with silesia for verification.
If possible would you please send your benchmark binary to me for verificaiton?
I can send my lzbench as well.

My test enviornment:

  1. QCom SDM630 (A53)
  2. Qcom SDM855 (A75) ==> TBD

Linaro GCC-6.3 or GCC7
Ubuntu Clang-6.0

@parheliamm
Copy link
Author

parheliamm commented May 30, 2019

Clang with FAST_DEC
On my test build, clang with fast_dec_loop improved as well on patchset bfdb5db
QCOM SDM630(A53) Ubuntu-16.04 Clang-6.0 via lzbench.

lzbench 1.7.3 (64-bit Linux) Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 1851 MB/s 1868 MB/s 10192446 100.00 /sdcard/silesia/dickens
lz4 1.9.x 72 MB/s 452 MB/s 6428742 63.07 /sdcard/silesia/dickens
lz4 1.9.x 107 MB/s 568 MB/s 26435667 51.61 /sdcard/silesia/mozilla
lz4 1.9.x 110 MB/s 566 MB/s 5440937 54.57 /sdcard/silesia/mr
lz4 1.9.x 288 MB/s 610 MB/s 5533040 16.49 /sdcard/silesia/nci
lz4 1.9.x 74 MB/s 548 MB/s 4338918 70.53 /sdcard/silesia/ooffice
lz4 1.9.x 79 MB/s 554 MB/s 5256666 52.12 /sdcard/silesia/osdb
lz4 1.9.x 103 MB/s 416 MB/s 3181387 48.00 /sdcard/silesia/reymont
lz4 1.9.x 158 MB/s 578 MB/s 7716839 35.72 /sdcard/silesia/samba
lz4 1.9.x 60 MB/s 599 MB/s 6790273 93.63 /sdcard/silesia/sao
lz4 1.9.x 92 MB/s 485 MB/s 20139988 48.58 /sdcard/silesia/webster
lz4 1.9.x 141 MB/s 1079 MB/s 8390195 99.01 /sdcard/silesia/x-ray
lz4 1.9.x 218 MB/s 599 MB/s 1227495 22.96 /sdcard/silesia/xml

dev branch:
lzbench 1.7.3 (64-bit Linux) Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 1896 MB/s 1904 MB/s 10192446 100.00 /sdcard/silesia/dickens
lz4 1.9.x 72 MB/s 449 MB/s 6428742 63.07 /sdcard/silesia/dickens
lz4 1.9.x 108 MB/s 562 MB/s 26435667 51.61 /sdcard/silesia/mozilla
lz4 1.9.x 112 MB/s 554 MB/s 5440937 54.57 /sdcard/silesia/mr
lz4 1.9.x 286 MB/s 587 MB/s 5533040 16.49 /sdcard/silesia/nci
lz4 1.9.x 75 MB/s 547 MB/s 4338918 70.53 /sdcard/silesia/ooffice
lz4 1.9.x 80 MB/s 542 MB/s 5256666 52.12 /sdcard/silesia/osdb
lz4 1.9.x 104 MB/s 416 MB/s 3181387 48.00 /sdcard/silesia/reymont
lz4 1.9.x 158 MB/s 568 MB/s 7716839 35.72 /sdcard/silesia/samba
lz4 1.9.x 60 MB/s 589 MB/s 6790273 93.63 /sdcard/silesia/sao
lz4 1.9.x 93 MB/s 477 MB/s 20139988 48.58 /sdcard/silesia/webster
lz4 1.9.x 143 MB/s 1080 MB/s 8390195 99.01 /sdcard/silesia/x-ray
lz4 1.9.x 217 MB/s 590 MB/s 1227495 22.96 /sdcard/silesia/xml
done... (cIters=1 dIters=1 cTime=1.0 dTime=2.0 chunkSize=1706MB cSpeed=0MB)

@Cyan4973
Copy link
Member

You meant to say GCC build is also has downgrade via your local test?

Yes.

My feeling is that the aarch64 environment is not uniform,
and that a performance change observed in one environment
does not guarantee an equivalent change in another aarch64 environment.

Therefore, we can separately observe radically different outcome for the same code modification.

I'll have to be cautious on this front, and only consider changes with clear performance benefits across the board. This can happen sometimes, when removing a useless / redundant operation for example. Another good example is offering code simplification at no performance cost.

I'll have to pass, as this patch does not belong to any of these categories, unfortunately.

@Cyan4973 Cyan4973 closed this May 30, 2019
@parheliamm
Copy link
Author

I thought you should review the checkoffset change which should be a bug fix for fast_dec_loop.

@Cyan4973
Copy link
Member

I thought the checkoffset change was a performance enhancement one ?
You mean it affects correctness too ?

@parheliamm
Copy link
Author

parheliamm commented May 31, 2019

I thought the checkoffset change was a performance enhancement one ?
You mean it affects correctness too ?

Yes, as I mentioned previously, the checkoffset invoke number is much more than v1.8.3
My pull request has 2 parts.

  1. aarch64 changes
  2. checkoffset calls reductions which could improve performance obviously especially on aarch64

If you don't want to enable aarch64 for fastloop, we could pick up checkoffset change to dev branch.

Chenxi

@Cyan4973
Copy link
Member

OK, so this is a performance enhancement suggestion, not a correctness issue.

However, it seems to belong to the category "reduce nb of operations", so could end up being a general benefit on any platform.

I'll have to test this change in isolation though, and come back with numbers.

@parheliamm
Copy link
Author

parheliamm commented May 31, 2019

OK, so this is a performance enhancement suggestion, not a correctness issue.

However, it seems to belong to the category "reduce nb of operations", so could end up being a general benefit on any platform.

I'll have to test this change in isolation though, and come back with numbers.

One more comments, I am not sure whether X86 could see the benefit directly.
aarch64 with gcc build would be better choice.

I can push a new pull request for this change today.

@Cyan4973
Copy link
Member

Cyan4973 commented May 31, 2019

Sure !

I'll try all architectures and compilers with the new PR.

I only have perf for x64 though.

@parheliamm
Copy link
Author

new request #719

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants