Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable LZ4_FAST_DEC_LOOP build macro on aarch64 by default #707

Merged
merged 1 commit into from May 14, 2019

Conversation

Projects
None yet
3 participants
@prekageo
Copy link
Contributor

commented May 6, 2019

Pull request #645 has introduced the build macro LZ4_FAST_DEC_LOOP which by default enables an optimization only for x86/x64.

I propose to enable this optimization for aarch64 as well. Here are the benchmark results for this pull request running on a1.4xlarge AWS EC2 instance. The final percent is how much faster this patchset is vs. the current dev branch.

./lzbench -elz4 silesia/*

lz4 1.8.3                 148 MB/s  1335 MB/s     6428742  63.07 silesia/dickens    
lz4 1.8.3                 147 MB/s  1319 MB/s     6428742  63.07 silesia/dickens    -1%

lz4 1.8.3                 229 MB/s  1503 MB/s    26435667  51.61 silesia/mozilla    
lz4 1.8.3                 229 MB/s  1630 MB/s    26435667  51.61 silesia/mozilla    8%

lz4 1.8.3                 216 MB/s  1350 MB/s     5440937  54.57 silesia/mr         
lz4 1.8.3                 216 MB/s  1416 MB/s     5440937  54.57 silesia/mr         5%

lz4 1.8.3                 406 MB/s  1686 MB/s     5533040  16.49 silesia/nci        
lz4 1.8.3                 408 MB/s  1766 MB/s     5533040  16.49 silesia/nci        5%

lz4 1.8.3                 190 MB/s  1314 MB/s     4338918  70.53 silesia/ooffice    
lz4 1.8.3                 193 MB/s  1481 MB/s     4338918  70.53 silesia/ooffice    13%

lz4 1.8.3                 198 MB/s  1238 MB/s     5256666  52.12 silesia/osdb       
lz4 1.8.3                 199 MB/s  1433 MB/s     5256666  52.12 silesia/osdb       16%

lz4 1.8.3                 167 MB/s  1194 MB/s     3181387  48.00 silesia/reymont    
lz4 1.8.3                 168 MB/s  1137 MB/s     3181387  48.00 silesia/reymont    -5%

lz4 1.8.3                 265 MB/s  1493 MB/s     7716839  35.72 silesia/samba      
lz4 1.8.3                 262 MB/s  1565 MB/s     7716839  35.72 silesia/samba      5%

lz4 1.8.3                 191 MB/s  1379 MB/s     6790273  93.63 silesia/sao        
lz4 1.8.3                 205 MB/s  1570 MB/s     6790273  93.63 silesia/sao        14%

lz4 1.8.3                 173 MB/s  1296 MB/s    20139988  48.58 silesia/webster    
lz4 1.8.3                 173 MB/s  1350 MB/s    20139988  48.58 silesia/webster    4%

lz4 1.8.3                 379 MB/s  2524 MB/s     8390195  99.01 silesia/x-ray      
lz4 1.8.3                 392 MB/s  2675 MB/s     8390195  99.01 silesia/x-ray      6%

lz4 1.8.3                 335 MB/s  1412 MB/s     1227495  22.96 silesia/xml        
lz4 1.8.3                 336 MB/s  1511 MB/s     1227495  22.96 silesia/xml        7%
@Cyan4973

This comment has been minimized.

Copy link
Member

commented May 6, 2019

Your results are in line with several of our observations.

However, the issue is, aarch64 is not a "unified" world, and outcome varies, depending on exact chipset and compiler.

In general, server-class aarch64 tend to benefit, but mobile-class depends. A particular bad case is obtained when compiling with clang on mobile Qualcomm chipset. In which case, performance gets down by up to 30%. But on the same chipset, gcc performance is neutral. While on a different mobile chipset (Exynos), the same clang compiler gives some small speed benefits.

So I believe we need something more accurate than just aarch64, which encompasses a too large family of cases.

@prekageo

This comment has been minimized.

Copy link
Contributor Author

commented May 6, 2019

I see your point. What about if we enable the build macro for gcc && aarch64?

@Cyan4973

This comment has been minimized.

Copy link
Member

commented May 6, 2019

Well, at least it would match our experiments.
This doesn't guarantee that it's always a good choice, but I guess we have to start "somewhere".

@prekageo prekageo force-pushed the prekageo:dev branch from 7fb6215 to 605d811 May 7, 2019

@parheliamm

This comment has been minimized.

Copy link

commented May 13, 2019

I will try this on kernel Lz4 module with GCC build to see the behavior.

@prekageo

This comment has been minimized.

Copy link
Contributor Author

commented May 13, 2019

Sounds like a good idea. Let us know how it goes.
@Cyan4973: did you have some time to review the updated patch?

@Cyan4973

This comment has been minimized.

Copy link
Member

commented May 14, 2019

The patch looks fine @prekageo .
Sorry for the delay, I'm a bit overwhelmed these days.

@Cyan4973 Cyan4973 merged commit df24514 into lz4:dev May 14, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.