Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to solve cross compile EMLL problem below #9

Closed
xuhaoguang opened this issue Oct 9, 2021 · 18 comments
Closed

how to solve cross compile EMLL problem below #9

xuhaoguang opened this issue Oct 9, 2021 · 18 comments

Comments

@xuhaoguang
Copy link

/EMLL/src/arm_neon/ARMCompareAndSwap.c:1:0: error: invalid feature modifier in '-march=armv8.2-a+dotprod+fp16'
/*****************************************************************************/

CMakeFiles/eml-armneon.dir/build.make:62: recipe for target 'CMakeFiles/eml-armneon.dir/src/arm_neon/ARMCompareAndSwap.c.o' failed
make[2]: *** [CMakeFiles/eml-armneon.dir/src/arm_neon/ARMCompareAndSwap.c.o] Error 1
CMakeFiles/Makefile2:109: recipe for target 'CMakeFiles/eml-armneon.dir/all' failed
make[1]: *** [CMakeFiles/eml-armneon.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

@netease-youdao
Copy link
Owner

It seems that your toolchain doesn't support ARMv8.2-a architecture.

You can try with this EMLL source package (fp16 GEMM and sdot/udot disabled):
EMLL.tar.gz

@xuhaoguang
Copy link
Author

Will the performance of EMLL(fp16 GEMM and sdot/udot disabled) be much worse than EMLL(fp16 GEMM and sdot/udot enable)?

@netease-youdao
Copy link
Owner

It depends on CPU type. For aarch64 processors supporting armv8.2a-dotprod like cortex-A55/A75/A76/A77/A78, you may see performance degradation in (u)int8->(u)int32 GEMM tasks, but the speed of fp32 GEMM will not be affected. For other processors (cortex-A53/A35/A72) there's no difference.

@xuhaoguang
Copy link
Author

thanks very much, I will test performance between EMLL and openblas in my device, and I'll consult you if there are other problems

@xuhaoguang
Copy link
Author

How to let C is row-major in sgemm(A, B, C) ?and not manual conversion after sgemm.

@netease-youdao
Copy link
Owner

If C is row-major, calling sgemm(!b_rowmajor, !a_rowmajor, B, A, C, N, M, K, beta, num_threads) will do the job.

@xuhaoguang
Copy link
Author

Emll sgemm not support CblasTrans for A/B,we need manual trans before call sgemm func?

@netease-youdao
Copy link
Owner

Let C[MxN] = A[MxK] B[KxN], here is a summary for doing sgemm with all cases of matrix orders(NO NEED FOR additional transposition works):

A B C how_to_call
row major row major row major sgemm(0, 0, B, A, C, N, M, K, beta, num_threads)
row major row major column major sgemm(1, 1, A, B, C, M, N, K, beta, num_threads)
column major row major row major sgemm(0, 1, B, A, C, N, M, K, beta, num_threads)
column major row major column major sgemm(0, 1, A, B, C, M, N, K, beta, num_threads)
row major column major row major sgemm(1, 0, B, A, C, N, M, K, beta, num_threads)
row major column major column major sgemm(1, 0, A, B, C, M, N, K, beta, num_threads)
column major column major row major sgemm(1, 1, B, A, C, N, M, K, beta, num_threads)
column major column major column major sgemm(0, 0, A, B, C, M, N, K, beta, num_threads)

@xuhaoguang
Copy link
Author

Thnaks, I means EMLL sgemm do not have "CblasTrans" param for matrix B like openblas sgemm func below? but not row-major or col-major.
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc);

@netease-youdao
Copy link
Owner

The orders of matrix A-C can be determined from input parameters to cblas_sgemm:

layout transa transb order of A order of B order of C
CblasColMajor CblasNoTrans CblasNoTrans column major column major column major
CblasColMajor CblasTrans CblasNoTrans row major column major column major
CblasColMajor CblasNoTrans CblasTrans column major row major column major
CblasColMajor CblasTrans CblasTrans row major row major column major
CblasRowMajor CblasNoTrans CblasNoTrans row major row major row major
CblasRowMajor CblasTrans CblasNoTrans column major row major row major
CblasRowMajor CblasNoTrans CblasTrans row major column major row major
CblasRowMajor CblasTrans CblasTrans column major column major row major

@netease-youdao
Copy link
Owner

Please note that EMLL doesn't support padding currently, which means
(1) lda must be K for row-major A, or M for column-major A;
(2) ldb must be N for row-major B, or K for column-major B;
(3) ldc must be N for row-major C, or M for column-major C.

And currently EMLL doesn't support alpha != 1.

@xuhaoguang
Copy link
Author

xuhaoguang commented Oct 12, 2021

Let C[MxN] = A[MxK] B[KxN], and A/B/C are row-major,
If I call sgemm(0, 0, A_f, B_f, C_f, M, N, K, 0, 3), my program run normally just result is not correct,
But if I call sgemm(0, 0, B_f, A_f, C_f, N, M, K, 0, 3), my program run with coredump, Looks like it's out of memory.

Program received signal SIGSEGV, Segmentation fault.
[Switching to LWP 9937]
0x0000007fb7fda8bc in do_lookup_x () from /lib/ld-linux-aarch64.so.1
(gdb) bt
bt
#0 0x0000007fb7fda8bc in do_lookup_x () from /lib/ld-linux-aarch64.so.1
#1 0x0000007fb7fdb094 in _dl_lookup_symbol_x ()
from /lib/ld-linux-aarch64.so.1
#2 0x0000007fb7fde36c in _dl_fixup () from /lib/ld-linux-aarch64.so.1
#3 0x0000007fb7fe3ee4 in _dl_runtime_resolve ()
from /lib/ld-linux-aarch64.so.1
#4 0x0000007fa653f4d0 in sgemm._omp_fn () from ./libproject.so
#5 0x0000007fa6146ee4 in gomp_thread_start () from /lib/libgomp.so.1
#6 0x0000007fb7d81f4c in start_thread () from /lib/libpthread.so.0
#7 0x0000007fb7cee190 in thread_start () from /lib/libc.so.6
(gdb)

So I don't know why this phenomenon occurs

@netease-youdao
Copy link
Owner

Please show your test code (and maybe compiled executable) to help us solve the problem:)

@xuhaoguang
Copy link
Author

int emll_sgemm_thread_count = 0;
if (transposed_a == DU_NOTRANS && transposed_b == DU_NOTRANS) {
    // output A[13, 384], B[384,384], C[13, 384]
    fprintf(stderr, "2222 A[%d, %d], B[%d, %d], C[%d, %d]", a->_n, a->_m, b->_n, b->_m, c->_n, c->_m);

    // below can run  normally just result is not correct,
    sgemm(0, 0, (DTYPE*)a->_data, (DTYPE*)b->_data, (DTYPE*)c->_data, a->_n, b->_m, a->_m, beta, emll_sgemm_thread_count);

    // below run occurs coredump
    //sgemm(0, 0, (DTYPE*)b->_data, (DTYPE*)a->_data, (DTYPE*)c->_data, b->_m, a->_n, a->_m, beta, emll_sgemm_thread_count);

    fprintf(stderr, "finish sgemm multiply\n");
}

device cpu: https://www.allwinnertech.com/index.php?c=product&a=index&id=92

@xuhaoguang
Copy link
Author

Do we have a wechat communication group for EMLL MEN?

@xuhaoguang
Copy link
Author

This is a gdb info for coredumps.
image

@netease-youdao
Copy link
Owner

netease-youdao commented Oct 13, 2021

This looks like a thread-local storage issue. You can try to modify codes as suggested in #8 to move buffers from TLS to stack, or set the environment variable OMP_STACKSIZE to increase the memory threshold for child threads.

@netease-youdao
Copy link
Owner

netease-youdao commented Oct 13, 2021

Also the file include/common/CommonSkinnyGer.h needs modifications to move its buffer from TLS to stack:

diff --git a/include/common/CommonSkinnyGer.h b/include/common/CommonSkinnyGer.h
index b0af350..802ebf6 100644
--- a/include/common/CommonSkinnyGer.h
+++ b/include/common/CommonSkinnyGer.h
@@ -326,8 +326,6 @@ static inline void inline_##gemm##_acolmajor_bskinny_beta_##n_dim(\
   k_mask, m_mask, stack_size, atype, btype) \
 GEMM_SKINNY_GER_BETA_FUNC(gemm, n_dim)\
 GEMM_SKINNY_GER_INLINE_FUNCS(gemm, n_dim, k_mask, m_mask)\
-__attribute__((aligned(4096))) static __thread gemm##_skinnyger_cscalar\
-  gemm##_acolmajor_bskinny_a##atype##_b##btype##_##n_dim##_cscratch[stack_size];\
 GEMM_SKINNY_GER_INLINE_DEPACK_FUNC(gemm, m_mask, n_dim)\
 void gemm##_acolmajor_bskinny_a##atype##_b##btype##_n##n_dim(\
   const gemm##_skinnyger_ascalar *A,\
@@ -335,6 +333,9 @@ void gemm##_acolmajor_bskinny_a##atype##_b##btype##_n##n_dim(\
   gemm##_skinnyger_cscalar *C,\
   uint32_t M, uint32_t K, uint8_t b_c_order,\
   gemm##_skinnyger_cscalar beta_inp) {\
+\
+  __attribute__((aligned(4096))) gemm##_skinnyger_cscalar\
+    gemm##_acolmajor_bskinny_a##atype##_b##btype##_##n_dim##_cscratch[stack_size];\
 \
   const bool b_rowmajor = b_c_order & 1;\
   const bool c_rowmajor = b_c_order & 2;\
@@ -431,6 +432,8 @@ void gemm##_acolmajor_bskinny_a##atype##_b##btype##_n##n_dim##_omp(\
   omp_set_num_threads(num_threads);\
   _Pragma("omp parallel")\
   {\
+    __attribute__((aligned(4096))) gemm##_skinnyger_cscalar\
+      gemm##_acolmajor_bskinny_a##atype##_b##btype##_##n_dim##_cscratch[stack_size];\
     const gemm##_skinnyger_ascalar * const A = task_info.m_A;\
     const gemm##_skinnyger_bscalar * const B = task_info.m_B;\
     gemm##_skinnyger_cscalar * const C = task_info.m_C;\

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants