Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable LZ4_FAST_DEC_LOOP build macro on aarch64 by default #707

Merged
merged 1 commit into from
May 14, 2019
Merged

enable LZ4_FAST_DEC_LOOP build macro on aarch64 by default #707

merged 1 commit into from
May 14, 2019

Conversation

prekageo
Copy link
Contributor

@prekageo prekageo commented May 6, 2019

Pull request #645 has introduced the build macro LZ4_FAST_DEC_LOOP which by default enables an optimization only for x86/x64.

I propose to enable this optimization for aarch64 as well. Here are the benchmark results for this pull request running on a1.4xlarge AWS EC2 instance. The final percent is how much faster this patchset is vs. the current dev branch.

./lzbench -elz4 silesia/*

lz4 1.8.3                 148 MB/s  1335 MB/s     6428742  63.07 silesia/dickens    
lz4 1.8.3                 147 MB/s  1319 MB/s     6428742  63.07 silesia/dickens    -1%

lz4 1.8.3                 229 MB/s  1503 MB/s    26435667  51.61 silesia/mozilla    
lz4 1.8.3                 229 MB/s  1630 MB/s    26435667  51.61 silesia/mozilla    8%

lz4 1.8.3                 216 MB/s  1350 MB/s     5440937  54.57 silesia/mr         
lz4 1.8.3                 216 MB/s  1416 MB/s     5440937  54.57 silesia/mr         5%

lz4 1.8.3                 406 MB/s  1686 MB/s     5533040  16.49 silesia/nci        
lz4 1.8.3                 408 MB/s  1766 MB/s     5533040  16.49 silesia/nci        5%

lz4 1.8.3                 190 MB/s  1314 MB/s     4338918  70.53 silesia/ooffice    
lz4 1.8.3                 193 MB/s  1481 MB/s     4338918  70.53 silesia/ooffice    13%

lz4 1.8.3                 198 MB/s  1238 MB/s     5256666  52.12 silesia/osdb       
lz4 1.8.3                 199 MB/s  1433 MB/s     5256666  52.12 silesia/osdb       16%

lz4 1.8.3                 167 MB/s  1194 MB/s     3181387  48.00 silesia/reymont    
lz4 1.8.3                 168 MB/s  1137 MB/s     3181387  48.00 silesia/reymont    -5%

lz4 1.8.3                 265 MB/s  1493 MB/s     7716839  35.72 silesia/samba      
lz4 1.8.3                 262 MB/s  1565 MB/s     7716839  35.72 silesia/samba      5%

lz4 1.8.3                 191 MB/s  1379 MB/s     6790273  93.63 silesia/sao        
lz4 1.8.3                 205 MB/s  1570 MB/s     6790273  93.63 silesia/sao        14%

lz4 1.8.3                 173 MB/s  1296 MB/s    20139988  48.58 silesia/webster    
lz4 1.8.3                 173 MB/s  1350 MB/s    20139988  48.58 silesia/webster    4%

lz4 1.8.3                 379 MB/s  2524 MB/s     8390195  99.01 silesia/x-ray      
lz4 1.8.3                 392 MB/s  2675 MB/s     8390195  99.01 silesia/x-ray      6%

lz4 1.8.3                 335 MB/s  1412 MB/s     1227495  22.96 silesia/xml        
lz4 1.8.3                 336 MB/s  1511 MB/s     1227495  22.96 silesia/xml        7%

@Cyan4973
Copy link
Member

Cyan4973 commented May 6, 2019

Your results are in line with several of our observations.

However, the issue is, aarch64 is not a "unified" world, and outcome varies, depending on exact chipset and compiler.

In general, server-class aarch64 tend to benefit, but mobile-class depends. A particular bad case is obtained when compiling with clang on mobile Qualcomm chipset. In which case, performance gets down by up to 30%. But on the same chipset, gcc performance is neutral. While on a different mobile chipset (Exynos), the same clang compiler gives some small speed benefits.

So I believe we need something more accurate than just aarch64, which encompasses a too large family of cases.

@prekageo
Copy link
Contributor Author

prekageo commented May 6, 2019

I see your point. What about if we enable the build macro for gcc && aarch64?

@Cyan4973
Copy link
Member

Cyan4973 commented May 6, 2019

Well, at least it would match our experiments.
This doesn't guarantee that it's always a good choice, but I guess we have to start "somewhere".

@parheliamm
Copy link

I will try this on kernel Lz4 module with GCC build to see the behavior.

@prekageo
Copy link
Contributor Author

Sounds like a good idea. Let us know how it goes.
@Cyan4973: did you have some time to review the updated patch?

@Cyan4973
Copy link
Member

The patch looks fine @prekageo .
Sorry for the delay, I'm a bit overwhelmed these days.

@Cyan4973 Cyan4973 merged commit df24514 into lz4:dev May 14, 2019
EcrosoftXiao pushed a commit to EcrosoftXiao/kernel_xiaomi_mars that referenced this pull request Jun 28, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
jjpprrrr added a commit to PixelExperience-Devices/kernel_xiaomi_polaris that referenced this pull request Jun 28, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
sileshn pushed a commit to sileshn/android_kernel_oneplus_sdm845 that referenced this pull request Jul 1, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
unknownbaka pushed a commit to unknownbaka/utopia_kernel_polaris that referenced this pull request Jul 2, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
unknownbaka pushed a commit to unknownbaka/utopia_kernel_polaris that referenced this pull request Jul 4, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
unknownbaka pushed a commit to unknownbaka/utopia_kernel_polaris that referenced this pull request Jul 4, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
unknownbaka pushed a commit to unknownbaka/utopia_kernel_polaris that referenced this pull request Jul 4, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Kurosweet04 pushed a commit to KuroSeinenbutV2/Quantum-Kyaru-sm6150 that referenced this pull request Jul 4, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
unknownbaka pushed a commit to unknownbaka/utopia_kernel_polaris that referenced this pull request Jul 4, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
unknownbaka pushed a commit to unknownbaka/utopia_kernel_polaris that referenced this pull request Jul 4, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Kyuofox pushed a commit to xdroid-devices/xd_kernel_xiaomi_dipper that referenced this pull request Jul 6, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Ultra119 pushed a commit to Ultra119/kernel_lenovo_sdm710 that referenced this pull request Jul 6, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Ultra119 pushed a commit to Ultra119/kernel_lenovo_sdm710 that referenced this pull request Jul 7, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Kyuofox pushed a commit to Kyuofox/android_kernel_xiaomi_sdm845 that referenced this pull request Jul 8, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Kyuofox pushed a commit to Kyuofox/android_kernel_xiaomi_sdm845 that referenced this pull request Jul 9, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Kyuofox pushed a commit to Kyuofox/android_kernel_xiaomi_sdm845 that referenced this pull request Jul 9, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Kyuofox pushed a commit to Kyuofox/android_kernel_xiaomi_sdm845 that referenced this pull request Jul 10, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Kyuofox pushed a commit to Kyuofox/android_kernel_xiaomi_sdm845 that referenced this pull request Jul 10, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Kyuofox pushed a commit to Kyuofox/android_kernel_xiaomi_sdm845 that referenced this pull request Jul 13, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Kaz205 pushed a commit to Kaz205/renoir that referenced this pull request Jul 15, 2022
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
sabarop pushed a commit to sabarop/kernel_xiaomi_sdm660 that referenced this pull request Jun 15, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Change-Id: Ie5c1671068770758d0557f3ec00f1e7545d28b4e
Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
sabarop pushed a commit to sabarop/kernel_xiaomi_sdm660 that referenced this pull request Jun 15, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Change-Id: Ie5c1671068770758d0557f3ec00f1e7545d28b4e
Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
ptxxp pushed a commit to ptxxp/qcom_sdm845 that referenced this pull request Jun 15, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
clarencelol pushed a commit to clarencekopitiam/kernel_xiaomi_sm6250 that referenced this pull request Jun 15, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
ElectroPerf pushed a commit to aospa-pong/msm-5.10 that referenced this pull request Jun 16, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Change-Id: Ib284c1688942109ec12ccf62998d70f55cbc7296
ElectroPerf pushed a commit to aospa-pong/msm-5.10 that referenced this pull request Jun 16, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
clarencelol pushed a commit to clarencekopitiam/kernel_xiaomi_sm6250 that referenced this pull request Jun 16, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
ptxxp pushed a commit to ptxxp/qcom_sdm845 that referenced this pull request Jun 16, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Pzqqt pushed a commit to Pzqqt/android_kernel_xiaomi_marble that referenced this pull request Jun 16, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
chiteroman pushed a commit to chiteroman/kernel_vayu that referenced this pull request Jun 16, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Signed-off-by: atndko <z1281552865@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
chiteroman pushed a commit to chiteroman/kernel_vayu that referenced this pull request Jun 16, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Signed-off-by: atndko <z1281552865@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
chiteroman pushed a commit to chiteroman/kernel_vayu that referenced this pull request Jun 16, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Signed-off-by: atndko <z1281552865@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
chiteroman pushed a commit to chiteroman/kernel_vayu that referenced this pull request Jun 16, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Signed-off-by: atndko <z1281552865@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
chiteroman pushed a commit to chiteroman/kernel_vayu that referenced this pull request Jun 16, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Signed-off-by: atndko <z1281552865@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
chiteroman pushed a commit to chiteroman/kernel_vayu that referenced this pull request Jun 16, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Signed-off-by: atndko <z1281552865@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
chiteroman pushed a commit to chiteroman/kernel_vayu that referenced this pull request Jun 16, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Signed-off-by: atndko <z1281552865@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
chiteroman pushed a commit to chiteroman/kernel_vayu that referenced this pull request Jun 16, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Signed-off-by: atndko <z1281552865@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
chiteroman pushed a commit to chiteroman/kernel_vayu that referenced this pull request Jun 16, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Signed-off-by: atndko <z1281552865@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
ptxxp pushed a commit to ptxxp/qcom_sdm845 that referenced this pull request Jun 17, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sdm845 with LLVM Clang 15, this patch does offer a nice
5-10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1282k, BW=5006MiB/s (5249MB/s)(4000MiB/799msec)
- lz4 fast dec: read: IOPS=1382k, BW=5398MiB/s (5660MB/s)(4000MiB/741msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
clarencelol pushed a commit to clarencekopitiam/kernel_xiaomi_sm6250 that referenced this pull request Jun 17, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
clarencelol pushed a commit to clarencekopitiam/kernel_xiaomi_sm6250 that referenced this pull request Jun 17, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
herobuxx pushed a commit to liliumproject/kernel_xiaomi_vayre that referenced this pull request Jun 19, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Change-Id: Ie5c1671068770758d0557f3ec00f1e7545d28b4e
Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Signed-off-by: HeroBuxx <me@herobuxx.me>
TIMISONG-dev pushed a commit to TIMISONG-dev/kernel_xiaomi_sm8250 that referenced this pull request Jun 19, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Change-Id: Ie5c1671068770758d0557f3ec00f1e7545d28b4e
Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
TIMISONG-dev pushed a commit to TIMISONG-dev/kernel_xiaomi_sm8250 that referenced this pull request Jun 20, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Change-Id: Ie5c1671068770758d0557f3ec00f1e7545d28b4e
Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
TIMISONG-dev pushed a commit to TIMISONG-dev/kernel_xiaomi_sm8250 that referenced this pull request Jun 20, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Change-Id: Ie5c1671068770758d0557f3ec00f1e7545d28b4e
Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
TogoFire pushed a commit to dev-sm8350/kernel_oneplus_sm8350 that referenced this pull request Jun 21, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Change-Id: I94a04ce371b2459db59be44e35fbaae14f35b941
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
(cherry picked from commit 9debe32)
Signed-off-by: TogoFire <togofire@mailfence.com>
TogoFire pushed a commit to dev-sm8350/kernel_oneplus_sm8350 that referenced this pull request Jun 21, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Change-Id: I94a04ce371b2459db59be44e35fbaae14f35b941
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
(cherry picked from commit 9debe32)
Signed-off-by: TogoFire <togofire@mailfence.com>
Hanamizaki pushed a commit to Hanamizaki/android_kernel_oneplus_sm8450 that referenced this pull request Jun 21, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
Tkiliay pushed a commit to Amiya-project/android_kernel_xiaomi_sm8250 that referenced this pull request Jun 22, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
henricaodopao1 pushed a commit to Zenfone-5-X00QD-4-19/kernel_asus_sdm660 that referenced this pull request Jun 22, 2024
Upstream lz4 mentioned a performance regression on Qualcomm SoCs
when built with Clang, but not with GCC [1]. However, according to my
testing on sm8350 with LLVM Clang 15, this patch does offer a nice
10% boost in decompression, so enable the fast dec loop for Clang
as well.

Testing procedure:
- pre-fill zram with 1GB of real-word zram data dumped under memory
  pressure, for example
  $ dd if=/sdcard/zram.test of=/dev/block/zram0 bs=1m count=1000
- $ fio --readonly --name=randread --direct=1 --rw=randread \
  --ioengine=psync --randrepeat=0 --numjobs=4 --iodepth=1 \
  --group_reporting=1 --filename=/dev/block/zram0 --bs=4K --size=1000M

Results:
- vanilla lz4: read: IOPS=1646k, BW=6431MiB/s (6743MB/s)(4000MiB/622msec)
- lz4 fast dec: read: IOPS=1775k, BW=6932MiB/s (7269MB/s)(4000MiB/577msec)

[1] lz4/lz4#707

Change-Id: Ie5c1671068770758d0557f3ec00f1e7545d28b4e
Signed-off-by: Chenyang Zhong <zhongcy95@gmail.com>
Signed-off-by: HeroBuxx <me@herobuxx.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants