Learning About Vector API

Proof of concept test using new Vector API (JEP 338). Vectorized code is compared against already optimized code from EJML and BoofCV.

Matrix Multiplication IKJ Order (double)
Image Convolution (float)
Image Thresholding (unsigned byte)

To run the benchmark just type the command below. The first time you run it there will be a lot of downloads. If you don't have JDK 16 installed it will download it for you automatically. Once it starts running the actual benchmark that will take about 12 minutes to complete.

./gradlew runtimeBenchmark

If you load this up in your favorite IDE (in my case IntelliJ) you're highly likely to experience issues. This is using bleeding edge version of Gradle with a bleeding edge JDK, and a new API.

Learning About Vector API

https://richardstartin.github.io/posts/vectorised-algorithms-in-java

Results

Setup

OpenJDK 64-Bit Server VM AdoptOpenJDK (build 16+36, mixed mode, sharing)
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Ubuntu 18.04.5 LTS

Summary

Operation                    | Data |     Size     |  Relative   |
                             | Type |              | Performance |
-----------------------------------------------------------------------------------------
Matrix Mult IKJ Real         |   D  | Large Matrix |    1.84     | [1]
Matrix Mult IKJ Real         |   D  | Small Matrix |     .86     | [2]
Matrix Mult IKJ Complex      |   D  | Large Matrix |             | Vector code needed
Matrix Mult IKJ Complex      |   D  | Small Matrix |             | Vector code needed
Image 1D Conv                |   F  | Large kernel |    1.82     | 
Image 1D Conv                |   F  | Small kernel |    1.86     |
Image 1D Conv  BoofCV        |   F  | Small kernel |     .41     | [3] Compared to unrolled
Image 1D Mean                |   F  |              |             | Vector code needed
Image Threshold              |  U8  |              |    6.78     | [4]
Image Histogram              |  U16 |              |             | Vector code needed
YUV 420 888 to RGB           |   D  |              |             | Vector code needed
Image Debayer                |   D  |              |             | Vector code needed

Unless otherwise stated, all performance is baseline code over vectorized code. Values > 1 mean vectorized code was faster and values < 1 mean vectorized was slower. In some cases unrolled code from EJML and BoofCV have been included to provide a point of comparison.

Benchmark                                       (kernelSize)  (size)  Mode  Cnt           Score           Error  Units
BenchmarkOperations.convolve_horizontal                    5     N/A  avgt    5     8080015.121 ±    169251.559  ns/op
BenchmarkOperations.convolve_horizontal                   31     N/A  avgt    5    24767084.561 ±    462767.053  ns/op
BenchmarkOperations.convolve_horizontal_boofcv             5     N/A  avgt    5     1775128.816 ±     13315.269  ns/op
BenchmarkOperations.convolve_horizontal_boofcv            31     N/A  avgt    5    24833110.727 ±    265814.061  ns/op
BenchmarkOperations.convolve_horizontal_vector             5     N/A  avgt    5     4351633.285 ±     29120.843  ns/op
BenchmarkOperations.convolve_horizontal_vector            31     N/A  avgt    5    13615354.944 ±    263422.696  ns/op
BenchmarkOperations.image_threshold                      N/A     N/A  avgt    5      345424.878 ±      6195.410  ns/op
BenchmarkOperations.image_threshold_vector_v1            N/A     N/A  avgt    5      580158.660 ±      8812.190  ns/op
BenchmarkOperations.image_threshold_vector_v2            N/A     N/A  avgt    5       50925.242 ±      2032.203  ns/op
BenchmarkOperations.matrix_mult                          N/A       4  avgt    5         104.410 ±         5.414  ns/op
BenchmarkOperations.matrix_mult                          N/A    1000  avgt    5   606881005.900 ±   3875032.130  ns/op
BenchmarkOperations.matrix_mult_complex                  N/A       4  avgt    5         202.456 ±         1.172  ns/op
BenchmarkOperations.matrix_mult_complex                  N/A    1000  avgt    5  1616543112.600 ± 500480900.857  ns/op
BenchmarkOperations.matrix_mult_ejml                     N/A       4  avgt    5          84.286 ±         2.964  ns/op
BenchmarkOperations.matrix_mult_ejml                     N/A    1000  avgt    5   611016316.100 ±  13669452.444  ns/op
BenchmarkOperations.matrix_mult_vectors                  N/A       4  avgt    5         121.820 ±         1.471  ns/op
BenchmarkOperations.matrix_mult_vectors                  N/A    1000  avgt    5   329232594.200 ±  10054084.639  ns/op
BenchmarkOperations.mean_horizontal                      N/A     N/A  avgt    5     2188929.877 ±     19603.001  ns/op

[1] I would expect a well writen C++ port of that same function to run about 2.5x faster than pure Java on large matrices. That's about the performance different you get when you compare the top performing pure Java libraries against Eigen or LAPACK. The code used is designed for medium sized matrices.

[2] This result isn't surprising. Optimizing for small matrices requires very different approaches than large ones. One potential improvement for Vector API would be to allow recycling of memory. More hand optimization of the loops could reduce the gap. While the current API is easy to use it's clobbering the innermost loop with calls to new. That's a big no in writing high performance code. I could be wrong, maybe there's some specialized code that recognizes what's going on and recycles memory. Small matrix perform is critical in computer vision and signal processing.

[3] BoofCV includes code where if the kernel is small, it will invoke code which is unrolled. This typically results in massive speed up. I wish the JVM was better is at recognizing when to unroll a loop, so I don't need to write all this auto generated code.

[4] Vector doesn't support unsigned bytes yet and the Vector implementation fails the unit test. Based on comments in the JDK looks like that is will be added.

Author: Peter Abeles

https://twitter.com/NotSoOptimal

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
gradle/wrapper		gradle/wrapper
src		src
README.MD		README.MD
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gradle/wrapper

gradle/wrapper

src

src

README.MD

README.MD

build.gradle

build.gradle

gradlew

gradlew

gradlew.bat

gradlew.bat

Repository files navigation

Learning About Vector API

Results

About

Releases

Packages

Languages

lessthanoptimal/VectorPerformance

Folders and files

Latest commit

History

Repository files navigation

Learning About Vector API

Results

About

Resources

Stars

Watchers

Forks

Languages