Enable fast decoding on Apple/AArch64 builds (18-25% faster decompression) #1040

zeux · 2021-11-22T19:27:44Z

This makes decoding significantly faster on M1; measured on compressed source
code, decompressing 294 MB to 1301 MB takes 513 ms (2.53 GB/s) before, and
406 ms (3.2 GB/s) after this change on M1 Pro.

There's no way to check if the target architecture is M1 specifically but the
gains are likely to be similar on recent iterations of Apple processors, and
the original performance issue was probably more specific to Qualcomm.

This makes decoding significantly faster on M1; measured on compressed source code across 8 hardware threads, decompressing 294 MB to 1301 MB takes 513 ms of cumulative work (2.53 GB/s) before, and 406 ms (3.2 GB/s) after this change on M1 Pro. There's no way to check if the target architecture is M1 specifically but the gains are likely to be similar on recent iterations on Apple processors, and the original performance issue was probably more specific to Qualcomm.

Cyan4973 · 2021-11-22T21:35:11Z

Well, hopefully, the condition defined(__aarch64__) && defined(__APPLE__) should make it more specific to Apple Silicon implementations of ARM64, and therefore avoid bringing in Qualcomm's (and possibly other vendors) issues.

terrelln · 2021-11-22T21:42:12Z

I measured on my M1 MacBook Air, and see decompression speed go from 3.9 GB/s -> 4.6 GB/s.

Well, hopefully, the condition defined(aarch64) && defined(APPLE) should make it more specific to Apple Silicon implementations of ARM64, and therefore avoid bringing in Qualcomm's (and possibly other vendors) issues.

This will also enable it for older iPhones, which ran on Qualcomm chips. I'm not sure how much we care about that. Or even if older iPhones had the mentioned performance issues. It may have been Android devices.

zeux · 2021-11-22T22:36:00Z

I wouldn't be too worried about this - I don't think Apple ever used Qualcomm CPUs, but the last ARM Cortex CPU they used was Apple A5 (launched in 2011 and discontinued in 2016). It's hard to know for certain without measuring this but I wouldn't expect this to cause a regression on Apple's more recent hardware.

Cyan4973 · 2021-11-23T00:48:50Z

Thanks @zeux, this looks like a good trade off

This includes the M1 optimization PR: lz4/lz4#1040 As a result, qgrep bruteforce queries run 10-15% faster on M1 Pro.

Cyan4973 requested a review from terrelln November 22, 2021 21:27

Cyan4973 approved these changes Nov 22, 2021

View reviewed changes

zeux changed the title ~~Enable fast decoding on Apple/AArch64 builds (25% faster decompression)~~ Enable fast decoding on Apple/AArch64 builds (18-25% faster decompression) Nov 22, 2021

Cyan4973 merged commit db57809 into lz4:dev Nov 23, 2021

zeux deleted the m1-fastdec branch November 23, 2021 04:45

zeux added a commit to zeux/qgrep that referenced this pull request Nov 24, 2021

extern: Update lz4 to latest

c21defe

This includes the M1 optimization PR: lz4/lz4#1040 As a result, qgrep bruteforce queries run 10-15% faster on M1 Pro.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable fast decoding on Apple/AArch64 builds (18-25% faster decompression) #1040

Enable fast decoding on Apple/AArch64 builds (18-25% faster decompression) #1040

zeux commented Nov 22, 2021 •

edited

Cyan4973 commented Nov 22, 2021

terrelln commented Nov 22, 2021

zeux commented Nov 22, 2021 •

edited

Cyan4973 commented Nov 23, 2021 •

edited

Enable fast decoding on Apple/AArch64 builds (18-25% faster decompression) #1040

Enable fast decoding on Apple/AArch64 builds (18-25% faster decompression) #1040

Conversation

zeux commented Nov 22, 2021 • edited

Cyan4973 commented Nov 22, 2021

terrelln commented Nov 22, 2021

zeux commented Nov 22, 2021 • edited

Cyan4973 commented Nov 23, 2021 • edited

zeux commented Nov 22, 2021 •

edited

zeux commented Nov 22, 2021 •

edited

Cyan4973 commented Nov 23, 2021 •

edited