Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intrinsic for the high-performance templated operator [x86] #1047

Merged
merged 104 commits into from
May 25, 2020
Merged

Intrinsic for the high-performance templated operator [x86] #1047

merged 104 commits into from
May 25, 2020

Conversation

camierjs
Copy link
Member

@camierjs camierjs commented Aug 28, 2019

Enables the performance templated classes to use specific vector intrinsics on the following architectures:

  • x86 (SSE/AVX/AVX2/AVX512),
  • Power8 & Power9 (VSX), tested on Lassen with xlc++,
  • BG/Q (QPX), tested on Vulcan.

It can be enabled with MFEM_USE_SIMD=YES.

Results on Vulcan, master:
plot2_MFEM_bp1_v1_vulcan_xlc_N016_pn32

Vulcan, x86, with the SIMD enabled:
plot2_MFEM_bp1_v1_vulcan_xlc_N016_pn32_x86

Vulcan, x86, with SIMD and JIT to use vectorization only when needed:
plot2_MFEM_bp1_v1_vulcan_xlc_N016_pn32_x86_hybrid

PR Author Editor Reviewers Assignment Approval Merge
#1047 @camierjs @tzanio @tzanio + @v-dobrev 03/12/20 05/18/20 05/25/20

camierjs added 30 commits September 19, 2017 15:39
…-peel-times option

      - Brought bp1p from github.com/CEED/benchmarks/blob/master/tests/mfem_bps to test the kernels
@tzanio tzanio self-requested a review May 18, 2020 22:47
@tzanio tzanio added the in-next label May 18, 2020
@tzanio
Copy link
Member

tzanio commented May 18, 2020

Merged in next for testing...

@tzanio
Copy link
Member

tzanio commented May 19, 2020

There are errors from this PR in the autotest runs on tux429, e.g.

nvcc  -O3 -std=c++11 -x=cu --expt-extended-lambda -arch=sm_60 -ccbin g++ -Xcompiler="-march=native  -Wall" -I../.. -I../../../occa/include -I../../../libCEED/include -I../../../raja/include -Xcompiler=-fopenmp ex1.cpp -o ex1 -L../.. -lmfem -Xlinker=-rpath,../../../occa/lib -L../../../occa/lib -locca -Xlinker=-rpath,../../../libCEED/lib -L../../../libCEED/lib -lceed -Xlinker=-rpath,../../../raja/lib -L../../../raja/lib -lRAJA   -lrt
../../linalg/ttensor.hpp(300): error: "mfem::AutoSIMD<double, 8, 64> &(int)" contains a vector, which is not supported in device code
          detected during:
            instantiation of class "mfem::TVector<S, data_t, align> [with S=750, data_t=mfem::AutoSIMD<double, 8, 64>, align=false]" 
(350): here
            instantiation of class "mfem::TMatrix<N1, N2, data_t, align> [with N1=125, N2=6, data_t=mfem::AutoSIMD<double, 8, 64>, align=false]" 
../../general/mem_manager.hpp(780): here
            instantiation of "void mfem::Memory<T>::Delete() [with T=mfem::TMatrix<125, 6, mfem::AutoSIMD<double, 8, 64>, false>]" 
../../fem/tbilinearform.hpp(129): here
            instantiation of "mfem::TBilinearForm<meshType, solFESpace, IR, IntegratorType, solVecLayout_t, complex_t, real_t, impl_traits_t>::~TBilinearForm() [with meshType=mesh_t, solFESpace=sol_fes_t, IR=int_rule_t, IntegratorType=integ_t, solVecLayout_t=mfem::ScalarLayout, complex_t=double, real_t=double, impl_traits_t=mfem::AutoSIMDTraits<double, double>]" 
../../fem/tbilinearform.hpp(125): here
            instantiation of "mfem::TBilinearForm<meshType, solFESpace, IR, IntegratorType, solVecLayout_t, complex_t, real_t, impl_traits_t>::TBilinearForm(const IntegratorType &, const mfem::FiniteElementSpace &) [with meshType=mesh_t, solFESpace=sol_fes_t, IR=int_rule_t, IntegratorType=integ_t, solVecLayout_t=mfem::ScalarLayout, complex_t=double, real_t=double, impl_traits_t=mfem::AutoSIMDTraits<double, double>]" 
ex1.cpp(288): here

@v-dobrev
Copy link
Member

v-dobrev commented May 19, 2020

I'm not sure why, on my desktop, I did not get this error.

It maybe okay to remove the MFEM_HOST_DEVICE from the methods that generate the error:

mfem/linalg/ttensor.hpp

Lines 300 to 301 in 069ad59

MFEM_HOST_DEVICE data_t &operator[](int i) { return data[i]; }
MFEM_HOST_DEVICE const data_t &operator[](int i) const { return data[i]; }

@v-dobrev
Copy link
Member

Re-merged in next.

@tzanio
Copy link
Member

tzanio commented May 20, 2020

The tux429 runs look OK now, but there are new errors on tux426 ☹️

g++  -O3 -std=c++11 -I..  ex21.cpp -o ex21 -L.. -lmfem -lrt
g++ -g -c /home/kolev1/autotest-mfem/tux426/mfem/tests/unit/unit_test_main.cpp  -O3 -std=c++11 -I../..  -I. -I../.. -o unit_test_main.o
make[1]: Entering directory `/home/kolev1/autotest-mfem/tux426/mfem/miniapps/performance'
g++  -O3 -std=c++11 -march=native -pedantic -Wall -I../..  ex1.cpp -o ex1 -L../.. -lmfem -lrt
In file included from ../../linalg/../linalg/simd/x86.hpp:19:0,
                 from ../../linalg/../linalg/simd.hpp:25,
                 from ../../linalg/ttensor.hpp:16,
                 from ../../mfem-performance.hpp:19,
                 from ex1.cpp:32:
../../linalg/../linalg/simd/m512.hpp: In member function ‘mfem::AutoSIMD<double, 8, 64> mfem::AutoSIMD<double, 8, 64>::operator-() const’:
../../linalg/../linalg/simd/m512.hpp:115:58: error: ‘_mm512_xor_pd’ was not declared in this scope
       r.m512d = _mm512_xor_pd(_mm512_set1_pd(-0.0), m512d);
                                                          ^
make[1]: *** [ex1] Error 1
make[1]: Leaving directory `/home/kolev1/autotest-mfem/tux426/mfem/miniapps/performance'

@v-dobrev
Copy link
Member

It's good we are testing AVX512 too. 😄

It looks like this particular intrinsic is available only when __AVX512DQ__ is defined. I'm trying to figure out what alternative intrinsic can be used when __AVX512DQ__ is not defined.

@v-dobrev
Copy link
Member

I guess we can always fall back on return (0.0-(*this)); which will be the same as

   AutoSIMD<double,8,64> r;
   r.m512d = _mm512_sub_pd(_mm512_set1_pd(0.0),v.m512d);
   return r;

However, there may be a better/faster alternative.

@v-dobrev
Copy link
Member

Here is one alternative:
https://github.com/vectorclass/version2/blob/c1e56e55371eab3e423efd90f4bdd66bef0d75b2/vectorf512.h#L900-L904

which expanded becomes:

_mm512_castsi512_pd(
   _mm512_xor_epi32(
      _mm512_castpd_si512(a),
      _mm512_set1_epi64(0x8000000000000000)));

I'll push this in a moment.

@tzanio, can you try it on tux426?

@v-dobrev
Copy link
Member

Re-merged in next.

@tzanio
Copy link
Member

tzanio commented May 21, 2020

Re-merged in next for testing ...

@tzanio tzanio merged commit 63b73b2 into master May 25, 2020
Pull Requests automation moved this from Review Now to Merged May 25, 2020
@tzanio tzanio deleted the x86 branch May 25, 2020 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Pull Requests
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

4 participants