Intrinsic for the high-performance templated operator [x86] #1047

camierjs · 2019-08-28T17:50:00Z

Enables the performance templated classes to use specific vector intrinsics on the following architectures:

x86 (SSE/AVX/AVX2/AVX512),
Power8 & Power9 (VSX), tested on Lassen with xlc++,
BG/Q (QPX), tested on Vulcan.

It can be enabled with MFEM_USE_SIMD=YES.

Results on Vulcan, master:

Vulcan, x86, with the SIMD enabled:

Vulcan, x86, with SIMD and JIT to use vectorization only when needed:

PR	Author	Editor	Reviewers	Assignment	Approval	Merge
#1047	@camierjs	@tzanio	@tzanio + @v-dobrev	03/12/20	05/18/20	05/25/20

…-peel-times option - Brought bp1p from github.com/CEED/benchmarks/blob/master/tests/mfem_bps to test the kernels

Some updates to the SIMD intrinsics branch

tzanio · 2020-05-18T23:54:52Z

Merged in next for testing...

tzanio · 2020-05-19T15:07:19Z

There are errors from this PR in the autotest runs on tux429, e.g.

nvcc  -O3 -std=c++11 -x=cu --expt-extended-lambda -arch=sm_60 -ccbin g++ -Xcompiler="-march=native  -Wall" -I../.. -I../../../occa/include -I../../../libCEED/include -I../../../raja/include -Xcompiler=-fopenmp ex1.cpp -o ex1 -L../.. -lmfem -Xlinker=-rpath,../../../occa/lib -L../../../occa/lib -locca -Xlinker=-rpath,../../../libCEED/lib -L../../../libCEED/lib -lceed -Xlinker=-rpath,../../../raja/lib -L../../../raja/lib -lRAJA   -lrt
../../linalg/ttensor.hpp(300): error: "mfem::AutoSIMD<double, 8, 64> &(int)" contains a vector, which is not supported in device code
          detected during:
            instantiation of class "mfem::TVector<S, data_t, align> [with S=750, data_t=mfem::AutoSIMD<double, 8, 64>, align=false]" 
(350): here
            instantiation of class "mfem::TMatrix<N1, N2, data_t, align> [with N1=125, N2=6, data_t=mfem::AutoSIMD<double, 8, 64>, align=false]" 
../../general/mem_manager.hpp(780): here
            instantiation of "void mfem::Memory<T>::Delete() [with T=mfem::TMatrix<125, 6, mfem::AutoSIMD<double, 8, 64>, false>]" 
../../fem/tbilinearform.hpp(129): here
            instantiation of "mfem::TBilinearForm<meshType, solFESpace, IR, IntegratorType, solVecLayout_t, complex_t, real_t, impl_traits_t>::~TBilinearForm() [with meshType=mesh_t, solFESpace=sol_fes_t, IR=int_rule_t, IntegratorType=integ_t, solVecLayout_t=mfem::ScalarLayout, complex_t=double, real_t=double, impl_traits_t=mfem::AutoSIMDTraits<double, double>]" 
../../fem/tbilinearform.hpp(125): here
            instantiation of "mfem::TBilinearForm<meshType, solFESpace, IR, IntegratorType, solVecLayout_t, complex_t, real_t, impl_traits_t>::TBilinearForm(const IntegratorType &, const mfem::FiniteElementSpace &) [with meshType=mesh_t, solFESpace=sol_fes_t, IR=int_rule_t, IntegratorType=integ_t, solVecLayout_t=mfem::ScalarLayout, complex_t=double, real_t=double, impl_traits_t=mfem::AutoSIMDTraits<double, double>]" 
ex1.cpp(288): here

v-dobrev · 2020-05-19T16:05:44Z

I'm not sure why, on my desktop, I did not get this error.

It maybe okay to remove the MFEM_HOST_DEVICE from the methods that generate the error:

mfem/linalg/ttensor.hpp

Lines 300 to 301 in 069ad59

    
           MFEM_HOST_DEVICE data_t &operator[](int i) { return data[i]; } 
        
           MFEM_HOST_DEVICE const data_t &operator[](int i) const { return data[i]; }

Resolved conflicts: CHANGELOG

in class TVector -- this was causing compilation errors when CUDA is enabled.

v-dobrev · 2020-05-20T04:27:34Z

Re-merged in next.

tzanio · 2020-05-20T15:25:03Z

The tux429 runs look OK now, but there are new errors on tux426 ☹️

g++  -O3 -std=c++11 -I..  ex21.cpp -o ex21 -L.. -lmfem -lrt
g++ -g -c /home/kolev1/autotest-mfem/tux426/mfem/tests/unit/unit_test_main.cpp  -O3 -std=c++11 -I../..  -I. -I../.. -o unit_test_main.o
make[1]: Entering directory `/home/kolev1/autotest-mfem/tux426/mfem/miniapps/performance'
g++  -O3 -std=c++11 -march=native -pedantic -Wall -I../..  ex1.cpp -o ex1 -L../.. -lmfem -lrt
In file included from ../../linalg/../linalg/simd/x86.hpp:19:0,
                 from ../../linalg/../linalg/simd.hpp:25,
                 from ../../linalg/ttensor.hpp:16,
                 from ../../mfem-performance.hpp:19,
                 from ex1.cpp:32:
../../linalg/../linalg/simd/m512.hpp: In member function ‘mfem::AutoSIMD<double, 8, 64> mfem::AutoSIMD<double, 8, 64>::operator-() const’:
../../linalg/../linalg/simd/m512.hpp:115:58: error: ‘_mm512_xor_pd’ was not declared in this scope
       r.m512d = _mm512_xor_pd(_mm512_set1_pd(-0.0), m512d);
                                                          ^
make[1]: *** [ex1] Error 1
make[1]: Leaving directory `/home/kolev1/autotest-mfem/tux426/mfem/miniapps/performance'

v-dobrev · 2020-05-20T16:39:47Z

It's good we are testing AVX512 too. 😄

It looks like this particular intrinsic is available only when __AVX512DQ__ is defined. I'm trying to figure out what alternative intrinsic can be used when __AVX512DQ__ is not defined.

v-dobrev · 2020-05-20T16:45:58Z

I guess we can always fall back on return (0.0-(*this)); which will be the same as

   AutoSIMD<double,8,64> r;
   r.m512d = _mm512_sub_pd(_mm512_set1_pd(0.0),v.m512d);
   return r;

However, there may be a better/faster alternative.

v-dobrev · 2020-05-20T17:19:51Z

Here is one alternative:
https://github.com/vectorclass/version2/blob/c1e56e55371eab3e423efd90f4bdd66bef0d75b2/vectorf512.h#L900-L904

which expanded becomes:

_mm512_castsi512_pd(
   _mm512_xor_epi32(
      _mm512_castpd_si512(a),
      _mm512_set1_epi64(0x8000000000000000)));

I'll push this in a moment.

@tzanio, can you try it on tux426?

not available.

only when needed.

v-dobrev · 2020-05-21T04:14:50Z

Re-merged in next.

tzanio · 2020-05-21T17:38:59Z

Re-merged in next for testing ...

camierjs added 30 commits September 19, 2017 15:39

x86 intrinsic for the high-performance templated operator

1fed674

x86 scalar/sse/avx/avx2/avx512 header files

8c10419

Merge branch 'master' of https://github.com/mfem/mfem into okina

fffcf07

Merge branch 'master' of https://github.com/mfem/mfem into okina

6709f15

GCC, ICC & Clang alignment sanitization for TBilinearForm root class

bf8ac96

Cleanup & MFEM_USE_X86INTRIN ifdefs

af53d47

Merge branch 'master' into x86

8bb7165

[x86] merge addon to get tensor alignment, not yet matrix-free

4a234bc

[x86] before SIMD/BATCH patch

e192295

[batch] applied

89db8f1

[simd] auto working with posix_memalign

eff3732

[simd] ex1 w/ & w/o x86

8af1fe9

[x86] cleanup

bccbe14

[x86] INSTALL and force inline for the auto class

299e555

[x86] - MFEM_USE_X86INTRIN makefile if for GCC's param max-completely…

0f9a7fe

…-peel-times option - Brought bp1p from github.com/CEED/benchmarks/blob/master/tests/mfem_bps to test the kernels

[x86] miss miniapps/performance/ex1.cpp

5d7177f

[x86] add inline MFEM_ALWAYS_INLINE in each simd header files

65778b5

[x86] perf vs master

7981683

[x86] makefile cleanup

d565111

[vsx] (2x) double vector for __VSX__ Power8 architecture

ec9a490

[BG/Q] QPX vectorization

0042234

[qpx] changed / to vec_swdiv and remove #warnings

c0ba409

Merge branch 'master' into x86

127a60e

qpx, qpx64 & ex1 SIMD vs scalar test

02e9929

qpx64 size fix

4fc053f

okrtc + ex1

091cedf

ex1rtc defines tweaks

a9935c7

Trying runtime compilation on miniapps/performance/ex1.cpp

965e5c8

Merge branch 'master' into x86

d53c004

Merge branch 'master' into x86

19003df

v-dobrev and others added 3 commits May 18, 2020 10:56

In the performance miniapps, print the SIMD width in terms of "doubles".

1be972b

Merge pull request #1485 from mfem/x86-updates

b86a6ab

Some updates to the SIMD intrinsics branch

Merge branch 'master' into x86

0a10196

tzanio self-requested a review May 18, 2020 22:47

v-dobrev approved these changes May 18, 2020

View reviewed changes

minor

069ad59

tzanio approved these changes May 18, 2020

View reviewed changes

tzanio added the in-next label May 18, 2020

v-dobrev added 4 commits May 19, 2020 20:27

Merge branch 'master' into x86

5a37ae4

Resolved conflicts: CHANGELOG

Remove repeated CHANGELOG entry.

5f0ff9d

Remove the MFEM_HOST_DEVICE specifiers from the operator[] methods

b28b537

in class TVector -- this was causing compilation errors when CUDA is enabled.

Restore a CHANGELOG entry.

44837bb

v-dobrev added 2 commits May 20, 2020 10:20

Alternative definition for unary minus with AVX512 when AVX512DQ is

7206a67

not available.

In the miniapps/performance makefile, run compiler auto-detection

13a2b36

only when needed.

Fix fo unary minus with AVX512 when AVX512DQ is not available.

51dfbfe

Small tweak to avoid using a function before it is declared.

6cb5a2a

tzanio merged commit 63b73b2 into master May 25, 2020

Pull Requests automation moved this from Review Now to Merged May 25, 2020

tzanio deleted the x86 branch May 25, 2020 17:06

tzanio mentioned this pull request Jun 3, 2020

Update templated code doc with intrinsics mfem/web#55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intrinsic for the high-performance templated operator [x86] #1047

Intrinsic for the high-performance templated operator [x86] #1047

camierjs commented Aug 28, 2019 •

edited by tzanio

Loading

tzanio commented May 18, 2020

tzanio commented May 19, 2020

v-dobrev commented May 19, 2020 •

edited

Loading

v-dobrev commented May 20, 2020

tzanio commented May 20, 2020 •

edited

Loading

v-dobrev commented May 20, 2020

v-dobrev commented May 20, 2020

v-dobrev commented May 20, 2020

v-dobrev commented May 21, 2020

tzanio commented May 21, 2020

Intrinsic for the high-performance templated operator [x86] #1047

Intrinsic for the high-performance templated operator [x86] #1047

Conversation

camierjs commented Aug 28, 2019 • edited by tzanio Loading

tzanio commented May 18, 2020

tzanio commented May 19, 2020

v-dobrev commented May 19, 2020 • edited Loading

v-dobrev commented May 20, 2020

tzanio commented May 20, 2020 • edited Loading

v-dobrev commented May 20, 2020

v-dobrev commented May 20, 2020

v-dobrev commented May 20, 2020

v-dobrev commented May 21, 2020

tzanio commented May 21, 2020

camierjs commented Aug 28, 2019 •

edited by tzanio

Loading

v-dobrev commented May 19, 2020 •

edited

Loading

tzanio commented May 20, 2020 •

edited

Loading