enable LZ4_FAST_DEC_LOOP build macro on aarch64 by default #707

prekageo · 2019-05-06T14:30:20Z

Pull request #645 has introduced the build macro LZ4_FAST_DEC_LOOP which by default enables an optimization only for x86/x64.

I propose to enable this optimization for aarch64 as well. Here are the benchmark results for this pull request running on a1.4xlarge AWS EC2 instance. The final percent is how much faster this patchset is vs. the current dev branch.

./lzbench -elz4 silesia/*

lz4 1.8.3                 148 MB/s  1335 MB/s     6428742  63.07 silesia/dickens    
lz4 1.8.3                 147 MB/s  1319 MB/s     6428742  63.07 silesia/dickens    -1%

lz4 1.8.3                 229 MB/s  1503 MB/s    26435667  51.61 silesia/mozilla    
lz4 1.8.3                 229 MB/s  1630 MB/s    26435667  51.61 silesia/mozilla    8%

lz4 1.8.3                 216 MB/s  1350 MB/s     5440937  54.57 silesia/mr         
lz4 1.8.3                 216 MB/s  1416 MB/s     5440937  54.57 silesia/mr         5%

lz4 1.8.3                 406 MB/s  1686 MB/s     5533040  16.49 silesia/nci        
lz4 1.8.3                 408 MB/s  1766 MB/s     5533040  16.49 silesia/nci        5%

lz4 1.8.3                 190 MB/s  1314 MB/s     4338918  70.53 silesia/ooffice    
lz4 1.8.3                 193 MB/s  1481 MB/s     4338918  70.53 silesia/ooffice    13%

lz4 1.8.3                 198 MB/s  1238 MB/s     5256666  52.12 silesia/osdb       
lz4 1.8.3                 199 MB/s  1433 MB/s     5256666  52.12 silesia/osdb       16%

lz4 1.8.3                 167 MB/s  1194 MB/s     3181387  48.00 silesia/reymont    
lz4 1.8.3                 168 MB/s  1137 MB/s     3181387  48.00 silesia/reymont    -5%

lz4 1.8.3                 265 MB/s  1493 MB/s     7716839  35.72 silesia/samba      
lz4 1.8.3                 262 MB/s  1565 MB/s     7716839  35.72 silesia/samba      5%

lz4 1.8.3                 191 MB/s  1379 MB/s     6790273  93.63 silesia/sao        
lz4 1.8.3                 205 MB/s  1570 MB/s     6790273  93.63 silesia/sao        14%

lz4 1.8.3                 173 MB/s  1296 MB/s    20139988  48.58 silesia/webster    
lz4 1.8.3                 173 MB/s  1350 MB/s    20139988  48.58 silesia/webster    4%

lz4 1.8.3                 379 MB/s  2524 MB/s     8390195  99.01 silesia/x-ray      
lz4 1.8.3                 392 MB/s  2675 MB/s     8390195  99.01 silesia/x-ray      6%

lz4 1.8.3                 335 MB/s  1412 MB/s     1227495  22.96 silesia/xml        
lz4 1.8.3                 336 MB/s  1511 MB/s     1227495  22.96 silesia/xml        7%

Cyan4973 · 2019-05-06T17:30:13Z

Your results are in line with several of our observations.

However, the issue is, aarch64 is not a "unified" world, and outcome varies, depending on exact chipset and compiler.

In general, server-class aarch64 tend to benefit, but mobile-class depends. A particular bad case is obtained when compiling with clang on mobile Qualcomm chipset. In which case, performance gets down by up to 30%. But on the same chipset, gcc performance is neutral. While on a different mobile chipset (Exynos), the same clang compiler gives some small speed benefits.

So I believe we need something more accurate than just aarch64, which encompasses a too large family of cases.

prekageo · 2019-05-06T20:14:56Z

I see your point. What about if we enable the build macro for gcc && aarch64?

Cyan4973 · 2019-05-06T20:50:48Z

Well, at least it would match our experiments.
This doesn't guarantee that it's always a good choice, but I guess we have to start "somewhere".

parheliamm · 2019-05-13T07:15:18Z

I will try this on kernel Lz4 module with GCC build to see the behavior.

prekageo · 2019-05-13T13:39:10Z

Sounds like a good idea. Let us know how it goes.
@Cyan4973: did you have some time to review the updated patch?

Cyan4973 · 2019-05-14T03:35:04Z

The patch looks fine @prekageo .
Sorry for the delay, I'm a bit overwhelmed these days.

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sdm845 with LLVM Clang 15, this patch does offer a nice 5-10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec) - lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sdm845 with LLVM Clang 15, this patch does offer a nice 5-10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec) - lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Change-Id: Ie5c1671068770758d0557f3ec00f1e7545d28b4e Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sdm845 with LLVM Clang 15, this patch does offer a nice 5-10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec) - lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Change-Id: Ib284c1688942109ec12ccf62998d70f55cbc7296

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com> Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sdm845 with LLVM Clang 15, this patch does offer a nice 5-10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec) - lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com> Signed-off-by: atndko <z1281552865@gmail.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sdm845 with LLVM Clang 15, this patch does offer a nice 5-10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec) - lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Change-Id: Ie5c1671068770758d0557f3ec00f1e7545d28b4e Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com> Signed-off-by: HeroBuxx <me@herobuxx.me>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Change-Id: Ie5c1671068770758d0557f3ec00f1e7545d28b4e Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com> Change-Id: I94a04ce371b2459db59be44e35fbaae14f35b941 Signed-off-by: Pranav Vashi <neobuddy89@gmail.com> (cherry picked from commit 9debe32) Signed-off-by: TogoFire <togofire@mailfence.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

Upstream lz4 mentioned a performance regression on Qualcomm SoCs when built with Clang, but not with GCC [1]. However, according to my testing on sm8350 with LLVM Clang 15, this patch does offer a nice 10% boost in decompression, so enable the fast dec loop for Clang as well. Testing procedure: - pre-fill zram with 1GB of real-word zram data dumped under memory pressure, for example $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000 - $ fio --readonly --name=randread --direct=1 --rw=randread \ --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \ --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M Results: - vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec) - lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec) [1] lz4/lz4#707 Change-Id: Ie5c1671068770758d0557f3ec00f1e7545d28b4e Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com> Signed-off-by: HeroBuxx <me@herobuxx.me>

enable LZ4_FAST_DEC_LOOP build macro on aarch64/GCC by default

605d811

Cyan4973 approved these changes May 14, 2019

View reviewed changes

Cyan4973 merged commit df24514 into lz4:dev May 14, 2019

parheliamm mentioned this pull request May 24, 2019

LZ4_decompress_generic: perofmrance improvement on AARCH64 #713

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable LZ4_FAST_DEC_LOOP build macro on aarch64 by default #707

enable LZ4_FAST_DEC_LOOP build macro on aarch64 by default #707

prekageo commented May 6, 2019

Cyan4973 commented May 6, 2019 •

edited

prekageo commented May 6, 2019

Cyan4973 commented May 6, 2019 •

edited

parheliamm commented May 13, 2019

prekageo commented May 13, 2019

Cyan4973 commented May 14, 2019

enable LZ4_FAST_DEC_LOOP build macro on aarch64 by default #707

enable LZ4_FAST_DEC_LOOP build macro on aarch64 by default #707

Conversation

prekageo commented May 6, 2019

Cyan4973 commented May 6, 2019 • edited

prekageo commented May 6, 2019

Cyan4973 commented May 6, 2019 • edited

parheliamm commented May 13, 2019

prekageo commented May 13, 2019

Cyan4973 commented May 14, 2019

Cyan4973 commented May 6, 2019 •

edited

Cyan4973 commented May 6, 2019 •

edited