Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad speed for f32:s8:f32 matmul #1893

Closed
WilliamTambellini opened this issue May 3, 2024 · 11 comments
Closed

Bad speed for f32:s8:f32 matmul #1893

WilliamTambellini opened this issue May 3, 2024 · 11 comments
Assignees
Labels
Milestone

Comments

@WilliamTambellini
Copy link
Contributor

Hello 1dnn team,
Just wondering if we are missing something:

$ ONEDNN_VERBOSE=1 OMP_NUM_THREADS=1  ./benchdnn --mode=P --matmul --dt=f32:s8:f32 --attr-fpmath=strict:true --attr-scales=wei:common:1.25:f32,wei:per_oc:f32,wei:per_ocic:f32:1x1 --attr-zero-points=wei:common:-1:s8,wei:per_oc:s8,wei:per_ocic:s8:1x1  16x100x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,exec,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,2.8811
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,,,16x100x640,0.487793
onednn_verbose,primitive,exec,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s8 ,,16x100x640:1x640x1920,58724.3

ran on recent CPU

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8488C
    CPU family:          6
    Model:               143
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            8
    BogoMIPS:            4800.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_
                         perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe p
                         opcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_
                         adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec x
                         getbv1 xsaves avx_vnni avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme 
                         avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

@WilliamTambellini WilliamTambellini changed the title Bad speed for f32:s8:f2 matmul Bad speed for f32:s8:f32 matmul May 3, 2024
@igorsafo
Copy link
Contributor

igorsafo commented May 3, 2024

Hi @WilliamTambellini ,
Could you please run oneDNN with ONEDNN_VERBOSE=all? It should help a lot with information on why the optimized versions were skipped.
From what I see s8 zero points is something with very limited support, so this might be the reason. Please try s32 as a zero point data type.

@WilliamTambellini
Copy link
Contributor Author

Tks @igorsafo

$ ONEDNN_VERBOSE=all OMP_NUM_THREADS=1  ./benchdnn --mode=P --matmul --dt=f32:s8:f32 --attr-fpmath=f32:true --attr-scales=wei:common:1.25:f32,wei:per_oc:f32,wei:per_ocic:f32:1x1 --attr-zero-points=wei:common:-1:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x100x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx_fp16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_fp16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_vnni,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni_2,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:f32,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_f32_matmul.cpp:93
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_x8s8s32x_matmul.cpp:110
onednn_verbose,primitive,create:cache_miss,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.0090332
onednn_verbose,primitive,create:cache_hit,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.000976562
onednn_verbose,primitive,create:cache_miss,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,0.0129395
onednn_verbose,primitive,exec,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,2.89478
onednn_verbose,primitive,create:cache_miss,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,,,16x100x640,0.0100098
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,,,16x100x640,0.497803
onednn_verbose,primitive,exec,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,58143.1

Does the fpmath really matters?

@igorsafo
Copy link
Contributor

igorsafo commented May 3, 2024

Thanks for the verbose log.
Yes, it is required to enforce floating point computation for an integral primitive (primitive is considered integral if weights are integer).
Here is an example of weights decompression: https://oneapi-src.github.io/oneDNN/page_weights_decompression_matmul_cpp.html#doxid-weights-decompression-matmul-cpp
Here is the documentation page for fpmath: https://oneapi-src.github.io/oneDNN/dev_guide_attributes_fpmath_mode.html#enforcing-the-floating-point-math-mode-to-an-integral-primitive

@igorsafo igorsafo self-assigned this May 3, 2024
@WilliamTambellini
Copy link
Contributor Author

Tks. We have followed these examples but atm the speed is still bad even with bf16 src/dst:

$ ONEDNN_VERBOSE=all OMP_NUM_THREADS=1  ./benchdnn --mode=P --matmul --dt=bf16:s8:bf16 --attr-fpmath=bf16:true --attr-scales=wei:common:1.25:bf16,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:-1:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x100x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_vnni,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni_2,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:f32,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_f32_matmul.cpp:93
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_x8s8s32x_matmul.cpp:110
onednn_verbose,primitive,create:cache_miss,cpu,matmul,ref:any,undef,src_bf16:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.0090332
onednn_verbose,primitive,create:cache_hit,cpu,matmul,ref:any,undef,src_bf16:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.00195312
onednn_verbose,primitive,create:cache_miss,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,0.0119629
onednn_verbose,primitive,exec,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,2.85718
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x100x640,0.0888672
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x100x640,0.393799
onednn_verbose,primitive,exec,cpu,matmul,ref:any,undef,src_bf16:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,60269.7

Seen in your src code:

const bool is_bf16_with_int_wei = src_dt == bf16 && one_of(wei_dt, s8, u8) && one_of(dst_dt, bf16, f32);

@jkbeavers
Copy link

If possible, I think it would be helpful for the example page (or doc on the matmul primitive page) to specify that this is only supported with a non-reference implementation for bf16:s8 to bf16 or f32.

Still it's hard to see what is going wrong when using these datatypes given what seems to be saying the datatype combination is invalid in brgremm_matmul.cpp

const bool is_bf16_with_int_wei = src_dt == bf16 && one_of(wei_dt, s8, u8) && one_of(dst_dt, bf16, f32);
...
const bool problem_dt_correct = one_of( true, is_int8, is_bf16, is_f32, is_f16, is_bf16_with_int_wei);
....
VDISPATCH_MATMUL(problem_dt_correct, VERBOSE_UNSUPPORTED_DT_CFG);

See

onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99

from

ONEDNN_VERBOSE=all OMP_NUM_THREADS=1  ./benchdnn --mode=P --matmul --dt=bf16:s8:bf16 --attr-fpmath=struct:true --attr-scales=wei:common:1.25:f32,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:0:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x1x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99

@xuxinzen
Copy link
Contributor

xuxinzen commented May 4, 2024

Hi! I saw you're using v3.4.0 while optimized version is in v3.5. Could you please update the version and check it again?
Note: For zero points, we can only support common policy for optimized version at this time.

@WilliamTambellini
Copy link
Contributor Author

WilliamTambellini commented May 6, 2024

Tks @xuxinzen : we ll try again with the main branch.
@vpirogov could you inform when will onednn 3.5 be released?
Best

@jkbeavers
Copy link

jkbeavers commented May 6, 2024

aha yes that was certainly the main issue. Thanks @xuxinzen -- I now see the diff on brgremm_matmul.cpp not accepting datatypes for weight decompression.

We were copying some values for --attr-zero-points from recent commits, which was also a problem due to a zero point of -1; need to specify 0.

./benchdnn --mode=P --matmul --dt=bf16:u8:bf16 --attr-fpmath=bf16:true --attr-scales=wei:common:1.25:f32,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:0:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x1x640:1x640x1920
onednn_verbose,info,oneDNN v3.5.0 (commit 242d4d9222cf7162927de60116bafc646bd0941a)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512_amx_fp16,undef,src_bf16:a:any:any::f0 wei_u8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:101
onednn_verbose,primitive,create:cache_miss,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.331787
onednn_verbose,primitive,create:cache_hit,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.00219727
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,0.124023
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,2.05298
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.0400391
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.00292969
onednn_verbose,primitive,exec,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.304932

Now that I see this working I wanted to confirm expected behavior for 3.4 and 3.5:

  1. Is the reference (slow) implementation the only one available in 3.4?
  2. Are there plans for a f32:s8:f32 (or u8) brgemm implementation in 3.5?
  3. Is there support for this in the graph API?

@xuxinzen
Copy link
Contributor

xuxinzen commented May 6, 2024

aha yes that was certainly the main issue. Thanks @xuxinzen -- I now see the diff on brgremm_matmul.cpp not accepting datatypes for weight decompression.

We were copying some values for --attr-zero-points from recent commits, which was also a problem due to a zero point of -1; need to specify 0.

./benchdnn --mode=P --matmul --dt=bf16:u8:bf16 --attr-fpmath=bf16:true --attr-scales=wei:common:1.25:f32,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:0:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x1x640:1x640x1920
onednn_verbose,info,oneDNN v3.5.0 (commit 242d4d9222cf7162927de60116bafc646bd0941a)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512_amx_fp16,undef,src_bf16:a:any:any::f0 wei_u8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:101
onednn_verbose,primitive,create:cache_miss,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.331787
onednn_verbose,primitive,create:cache_hit,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.00219727
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,0.124023
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,2.05298
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.0400391
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.00292969
onednn_verbose,primitive,exec,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.304932

Now that I see this working I wanted to confirm expected behavior for 3.4 and 3.5:

  1. Is the reference (slow) implementation the only one available in 3.4?
  2. Are there plans for a f32:s8:f32 (or u8) brgemm implementation in 3.5?
  3. Is there support for this in the graph API?
  1. Yes, the reference implementation is the only one available in 3.4.
  2. As far as I know, we do not have any plans for f32:s8:f32 at this time, at least no for this quarter.
  3. From the supported operations for graph API, we do not support weights decompression yet.

@vpirogov vpirogov added this to the v3.5 milestone May 7, 2024
@vpirogov
Copy link
Member

vpirogov commented May 7, 2024

@WilliamTambellini, will do! The release is planned for May 30. You can find oneDNN release schedule here.

@jkbeavers
Copy link

Thanks for all the help @xuxinzen and @vpirogov ! You can go ahead and close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants