Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First SVE kernels #556

Merged
merged 78 commits into from
Mar 31, 2022
Merged

First SVE kernels #556

merged 78 commits into from
Mar 31, 2022

Conversation

AntonioNoack
Copy link
Contributor

Hello,

I'm currently working on SVE support for the TPP kernels. I've heard from Alex Breuer, that there will be a second team working on SVE, so I shouldn't wait with pushing my progress until all kernels are working.

I've implemented set-zero, copy, increment, square, 1/x and sqrt currently.
This pull request also adds a few SVE instructions for these, which haven't been part of LIBXSMM yet.

As a testing suite for performance and correctness, I added a sample/sve directory, but Mr. Breuer told me, that it may be possible to integrate the tests into the samples/eltwise directory in the future.

Known/found issues:

  • the A64FX architecture is not automatically detected
  • only the unary SVE kernels, that I've implemented, are currently working with the A64FX architecture
  • the 1/x function for ASIMD seems to be only an approximation
  • when switching architectures, the kernel cache is not cleared automatically

Thank you,
Antonio Noack

@FreddieWitherden
Copy link
Contributor

Some of the instructions overlap with the PR of #552 . I am also led to believe that several of the added intrinsics have extremely poor performance on A64FX (sqrt in particular).

samples/sve/sve.cpp Outdated Show resolved Hide resolved
src/generator_aarch64_instructions.c Outdated Show resolved Hide resolved
src/generator_aarch64_instructions.c Show resolved Hide resolved
src/generator_aarch64_instructions.h Outdated Show resolved Hide resolved
src/generator_mateltwise.c Outdated Show resolved Hide resolved
/* First, determine the vlen compute based on the add the architecture check in here D-, move it to the microkernel config */
if ( LIBXSMM_DATATYPE_I8 == LIBXSMM_GETENUM_INP( i_mateltwise_desc->datatype2 )) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess with these changes all the unit test in sample/eltwise pass for ASIMD

@alheinecke
Copy link
Collaborator

I still several items of the discussion open, can we please address them, then I do another review

@AntonioNoack
Copy link
Contributor Author

@alheinecke what do you suggest regarding the precision issues of 1/x? I found the same issues for 1/sqrt.
Both functions currently use an estimation-only function with ASIMD, and have relative errors of about 5% without Newton iterations.

For SVE, I've implemented a single iteration for 1/x, which gives close to perfect results.
For 1/sqrt, the error roughly reduces by 50x with every iteration (fp32; <= 3 iterations, 4th doesn't bring improvements).

What is the target for the relative error within LIBXSMM? Or should the precision be user-configurable maybe?

@AntonioNoack
Copy link
Contributor Author

I hope most things are resolved now :)

Copy link
Collaborator

@alheinecke alheinecke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the performance tester to the eltwise perf tester, that would also benefit x86 and neon.

samples/sve/sve_performance_test.c Outdated Show resolved Hide resolved
@alheinecke
Copy link
Collaborator

Regarding Newton iterations: even x86 doesn't use all bits... I think we need a comparison between all of them (right now we just use approximations on all and we are even not numerical compatible between Intel and AMD chips) but for DL applications we are fine. We have Flags, so we could specify ULP here. That would be exactly what you suggest with user defined. That means if the flag is not set, it's just approximation, and then we can specify number Newton steps or ULP etc. (needs to be defined).

@AntonioNoack
Copy link
Contributor Author

For the iterations/ulp precision, I've added the function libxsmm_get_ulp_precision() to generator_common.c. I've placed it into common, because other kernels and other architectures might use it as well.
The function uses the newly defined flag LIBXSMM_ULP_PRECISION, which would be an estimated target precision.

As a default, I've set 1e4 currently, which kind-of expresses that the default implementation doesn't care that much about high precision. I'm just not that sure about that magic value. Should I/we use infinity instead?

In my mind, infinity just would be a little awkward, because it might imply that the user doesn't care about the result at all (which would be false ofc, but idk)

Copy link
Collaborator

@alheinecke alheinecke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments and I would prefer to integrating the benchmark directly into the unary correctness tester

src/generator_aarch64_instructions.h Outdated Show resolved Hide resolved
src/generator_common.c Outdated Show resolved Hide resolved
…struction aliases without P for predicated functions
Copy link
Collaborator

@alheinecke alheinecke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the issue you opened and after reviewing this one. It might be good to open 2 PRs against master ASAP:
a) The CPUID changes and the broken incl. (I can also fix with a commit if you don't want credit for it)
b) the benchmarking infrastructure for eltwise. We have more changes coming as we make all test weakly typed right now. So there is another round of changes needed. We hope we have the weakly typed eltwise tester in master by Sunday.

@alheinecke
Copy link
Collaborator

are we ready to resolve the conflicts and attempt a merge?

@AntonioNoack
Copy link
Contributor Author

Yes :). I already did, but I am not yet 100% sure that everything still works. It compiles, and sqrt still works (so maybe it's fine).

@AntonioNoack
Copy link
Contributor Author

About my performance tests, I am not really sure. Currently, because there was tons of changes in the correctness checks, I just replaced them with what was in the aarch64_eltwise branch. (except that I didn't delete eltwise_perf_tester.h yet)

@AntonioNoack
Copy link
Contributor Author

AntonioNoack commented Mar 24, 2022

For example I saw abe03b5, and it would have to be integrated back in again.
Which probably means that the switch-case statement (over the kernel types, around like 660 in generator_mateltwise_unary_binary_aarch64.c) should become a else-if cascade again too.

@AntonioNoack
Copy link
Contributor Author

I've fixed the tanh issue, and the switch-case->else-if issue.
I also added the tests back in, but as a part of eltwise_common.h like discussed.

I know the tests would have been better as a separate pull request, but it's quite tedious to find out which exact commands to use to create a branch from a remote repo 😅 (and unfortunately, last time I didn't write it down). If it needs to be separate, so you can merge the SVE stuff into aarch64_eltwise and the tests into main, I could take a look at it as well.

Copy link
Collaborator

@alheinecke alheinecke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the minor issues and please put a summary comment what yet doesn't on SVE

@@ -381,7 +373,7 @@ int main(int argc, char* argv[])
}
for (_i = 0; _i < m; _i++) {
i = _i;
if (bcast_factor > 0) {
if (bcast_factor > 1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug?

Copy link
Contributor Author

@AntonioNoack AntonioNoack Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is a beautification or simplification.
We can skip dividing by 1 (because it doesn't change anything). So we divide if bcast_factor > 1 ( or >= 2 ).

@@ -390,73 +382,62 @@ int main(int argc, char* argv[])
vec_in = inp_matrix2;
}

if (op != OP_COPY) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I"m not sure if I understand the point of all these changes, can you elaborate, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of those is just superficial. The old if-else-layout was something like

if(!copy){
  if(add) { ... }
  if(mul){ ... }
  if(max){ ... }
} else { copy }

and I replaced it with

if(copy){}
else if(add){}
else if(mul){}
else if(max){}

These branches are all mutually exclusive, and I think my layout better represents that "copy" is just a type like all others (e.g. "mul" or "add"). The first one looks like "copy" is special, and that multiple cases might be true (which is false).

if( (l_vec_instr & LIBXSMM_AARCH64_INSTR_SVE_IS_DESTRUCTIVE) == LIBXSMM_AARCH64_INSTR_SVE_IS_DESTRUCTIVE ){
if( i_vec_reg_src_0 != i_vec_reg_dst ){
if(i_vec_reg_src_1 == i_vec_reg_dst &&
(l_vec_instr == LIBXSMM_AARCH64_INSTR_SVE_FMAX_V_P ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to avoid checking the instruction type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of.
We can force the programmer to use the correct argument order, and just crash if it is incorrect.

But I'd prefer if we detect that it doesn't matter (for div that another instruction can be used), and continue.

In the reduce kernels, there is a lot of calls to these instructions, where ASIMD didn't care about argument order. It might be fixable in those kernels, but every case would have to be checked.

/* arguments can be switched around */
/* we assign 0 <- 1 anyways, so we just skip that */
} else if(i_vec_reg_src_1 == i_vec_reg_dst &&
(l_vec_instr == LIBXSMM_AARCH64_INSTR_SVE_FDIV_V_P ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito (same arguments as fmul,fmax,fmin)

@AntonioNoack
Copy link
Contributor Author

Summary on what is (or should be) working:

Unary:

  • copy
  • xor (set zero)
  • inc
  • neg
  • 1/x
  • 1/sqrt(x)
  • sqrt(x)
  • tanh, tanh-inv
  • sigmoid, sigmoid-inv
  • exponential
  • gelu, gelu-inv (currently only on 512 bit vector length; should be relatively easy to adapt with TBX)
  • elu, elu-inv
  • relu, relu-inv
  • leaky-relu, leaky-relu-inv
  • dropout, dropout-inv

Binary:

  • add
  • mul
  • sub
  • div
  • muladd

Reduce:

  • sum, sum-squared, max

For all kernels, SVE should have the same support for additional flags, and should support everything that ASIMD supports.

Examples, on what neither ASIMD nor SVE support:

  • reduction with MIN (should be relatively easy to add, if needed)
  • binary type DOT
  • double precision in some cases

@alheinecke alheinecke merged commit 0080d03 into libxsmm:aarch64_eltwise Mar 31, 2022
@alheinecke
Copy link
Collaborator

Thanks for your contribution!

alheinecke added a commit that referenced this pull request Apr 14, 2022
* First SVE kernels (#556)

* test whether libxsmm compiles files automatically; also added files for first sve functions

* working on sve; also improved readability of some asimd functions

* first sve commands integrated into TPP workflow: load+store for copy

* implemented SVE kernels: set zero, x², increment (v+=1), reciprocal; added benchmarking

* added sqrt sve kernel; fixed benchmark: libxsmm was not switching architectures, somehow

* removed those functions, that already existed in libxsmm

* added from-x86-missing-code back in

* undoing my simplification, as this complexity may be needed in the future

* removed switch-case from instruction-builder: replaced it with flags

* fixed instruction generation for sqrt and inc

* implemented 1/sqrt and found another precision issue

* tested more iterations for 1/sqrt

* converted performance tests to C, and patched Makefile for it

* added missing x86 block back in

* fixed typo & changed order for sve-add-with-immediate

* moved performance tests into eltwise directory, and made them work with other architectures as well

* added ulp-precision standard/function for future kernels

* added clamp that may be necessary, if the log-result would be negative

* worked a bit on getting the broadcast working for sve; removed old instruction aliases without P for predicated functions

* rewrote benchmark to be a part of the correctness tests; added it to unary-simple and binary-simple

* changed ulp_precision to only have 3 states instead of being a continous value

* renamed eltwise_perf_tester.c to .h

* converted the benchmark defines into macro arguments

* added a README, so others can use these tests easier

* marked environment variables in bold for better readability

* tried to understand and implement bcastload; found bug in ASIMD implementation and fixed it (when i_masked_elems=2, it set elem 1&2 to zero, not 2&3)

* implemented bcastload

* fixed speeup-log indentation

* preparing for more supported sve-kernels

* reverted change (I was incorrect about the bug)

* added some instructions for exp(x)

* reverted change (wasn't actually a bug, I was wrong)

* implemented exp(x) for SVE

* set predicate register for exp(x)

* implemented tanh(x) for SVE

* implemented tanh_inv for sve

* fixed instructions in tanh,tanh-inv,exp,exp-inv

* started GeLU, implemented Xoshiro for Dropout

* added performance tests to eltwise/unary/relu

* replaced 0/1/2 with named constants for readability

* improving the relu tests a bit more: added forgotten row to tests & fixed spelling mistake

* implemented 2bytemult-flag (imperfect at the moment)

* implemented 2bytemult-flag (imperfect at the moment)

* adjusted instruction names for spgemm_csr_asparse

* worked on gelu/leaky relu a little

* added performance tests to dropout

* fixed elu

* merged load/store functions for masking

* refactoring: splitting math functions, which contain asimd & sve types

* fixed small memory leak in testing & removed warnings

* made loading/storing in sve more efficient by setting the predicate registers less often

* fixed typo & gave reduce ops names

* working on gelu and dropout

* added missing functions in header file

* fixed two issues that resulted from the merge

* fixed two issues from merging: unused variables & invalid include

* fixed SVE LSR instruction encoding, got GeLU-inverse working

* GeLU is now working for SVE

* implemented dropout for SVE - working except for masks in forward-mode; added memcpy sve function for ReLU fix

* fixed dropout-forward-mask-issue: the masks need to be tested before the benchmark runs a random number of turns, and changes the seed state every time...

* multiple tries to get quick loading/storing of masks working with SVE

* removed a non-working just-tested implementation

* improved inc-sve performance, and added approximate sqrt for better performance

* worked on reduce kernels for sve and got the first working; also wrote a small line of help for opreduce-test

* removed debug printfs, tested reduce kernels & fixed issue with max-type

* removed debug printfs, tested reduce kernels & fixed issue with max-type

* found & fixed another sve-reduce bug (was caused by max of neg numbers using zero-inited elements)

* adjusted license notes

* coverted huge switch-statement back to if-else-cascade

* added changes to asimd-1/x and tanh, that were committed to master for better accuracy

* added performance tests back in

Co-authored-by: Antonio Noack <ri-anoack@c5n0.head.gw4.metoffice.gov.uk>

* attempt to fix CI issues

* more CI fixes

* more CI fixes

* Fixing bug in dropout eltwise test (#619)

SVE would otherwise be detected as vector length 4. This should theoretically be correct for all architectures, but I'm not sure for x86.

* Changed some ENUM_INP/OUT to the new API call

* FIxed compilation issue

* fixed outstanding issues after merge of master into aarch64_eltwise

* updated SVE test script

* attempt to apply #628 manually

* updated aarch64_sve scripts

Co-authored-by: Antonio <antonio-noack@gmx.de>
Co-authored-by: Antonio Noack <ri-anoack@c5n0.head.gw4.metoffice.gov.uk>
Co-authored-by: egeorgan <evangelos.georganas@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants