Bad speed for f32:s8:f32 matmul #1893

WilliamTambellini · 2024-05-03T17:45:18Z

Hello 1dnn team,
Just wondering if we are missing something:

$ ONEDNN_VERBOSE=1 OMP_NUM_THREADS=1  ./benchdnn --mode=P --matmul --dt=f32:s8:f32 --attr-fpmath=strict:true --attr-scales=wei:common:1.25:f32,wei:per_oc:f32,wei:per_ocic:f32:1x1 --attr-zero-points=wei:common:-1:s8,wei:per_oc:s8,wei:per_ocic:s8:1x1  16x100x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,exec,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,2.8811
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,,,16x100x640,0.487793
onednn_verbose,primitive,exec,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s8 ,,16x100x640:1x640x1920,58724.3

ran on recent CPU

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8488C
    CPU family:          6
    Model:               143
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            8
    BogoMIPS:            4800.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_
                         perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe p
                         opcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_
                         adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec x
                         getbv1 xsaves avx_vnni avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme 
                         avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

The text was updated successfully, but these errors were encountered:

igorsafo · 2024-05-03T18:00:29Z

Hi @WilliamTambellini ,
Could you please run oneDNN with ONEDNN_VERBOSE=all? It should help a lot with information on why the optimized versions were skipped.
From what I see s8 zero points is something with very limited support, so this might be the reason. Please try s32 as a zero point data type.

WilliamTambellini · 2024-05-03T18:14:02Z

Tks @igorsafo

$ ONEDNN_VERBOSE=all OMP_NUM_THREADS=1  ./benchdnn --mode=P --matmul --dt=f32:s8:f32 --attr-fpmath=f32:true --attr-scales=wei:common:1.25:f32,wei:per_oc:f32,wei:per_ocic:f32:1x1 --attr-zero-points=wei:common:-1:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x100x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx_fp16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_fp16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_vnni,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni_2,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:f32,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_f32_matmul.cpp:93
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_x8s8s32x_matmul.cpp:110
onednn_verbose,primitive,create:cache_miss,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.0090332
onednn_verbose,primitive,create:cache_hit,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.000976562
onednn_verbose,primitive,create:cache_miss,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,0.0129395
onednn_verbose,primitive,exec,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,2.89478
onednn_verbose,primitive,create:cache_miss,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,,,16x100x640,0.0100098
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,,,16x100x640,0.497803
onednn_verbose,primitive,exec,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,58143.1

Does the fpmath really matters?

igorsafo · 2024-05-03T18:25:08Z

Thanks for the verbose log.
Yes, it is required to enforce floating point computation for an integral primitive (primitive is considered integral if weights are integer).
Here is an example of weights decompression: https://oneapi-src.github.io/oneDNN/page_weights_decompression_matmul_cpp.html#doxid-weights-decompression-matmul-cpp
Here is the documentation page for fpmath: https://oneapi-src.github.io/oneDNN/dev_guide_attributes_fpmath_mode.html#enforcing-the-floating-point-math-mode-to-an-integral-primitive

WilliamTambellini · 2024-05-03T18:33:03Z

Tks. We have followed these examples but atm the speed is still bad even with bf16 src/dst:

$ ONEDNN_VERBOSE=all OMP_NUM_THREADS=1  ./benchdnn --mode=P --matmul --dt=bf16:s8:bf16 --attr-fpmath=bf16:true --attr-scales=wei:common:1.25:bf16,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:-1:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x100x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_vnni,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni_2,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:f32,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_f32_matmul.cpp:93
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_x8s8s32x_matmul.cpp:110
onednn_verbose,primitive,create:cache_miss,cpu,matmul,ref:any,undef,src_bf16:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.0090332
onednn_verbose,primitive,create:cache_hit,cpu,matmul,ref:any,undef,src_bf16:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.00195312
onednn_verbose,primitive,create:cache_miss,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,0.0119629
onednn_verbose,primitive,exec,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,2.85718
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x100x640,0.0888672
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x100x640,0.393799
onednn_verbose,primitive,exec,cpu,matmul,ref:any,undef,src_bf16:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,60269.7

Seen in your src code:

const bool is_bf16_with_int_wei = src_dt == bf16 && one_of(wei_dt, s8, u8) && one_of(dst_dt, bf16, f32);

jkbeavers · 2024-05-03T18:34:06Z

If possible, I think it would be helpful for the example page (or doc on the matmul primitive page) to specify that this is only supported with a non-reference implementation for bf16:s8 to bf16 or f32.

Still it's hard to see what is going wrong when using these datatypes given what seems to be saying the datatype combination is invalid in brgremm_matmul.cpp

const bool is_bf16_with_int_wei = src_dt == bf16 && one_of(wei_dt, s8, u8) && one_of(dst_dt, bf16, f32);
...
const bool problem_dt_correct = one_of( true, is_int8, is_bf16, is_f32, is_f16, is_bf16_with_int_wei);
....
VDISPATCH_MATMUL(problem_dt_correct, VERBOSE_UNSUPPORTED_DT_CFG);

See

onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99

from

ONEDNN_VERBOSE=all OMP_NUM_THREADS=1  ./benchdnn --mode=P --matmul --dt=bf16:s8:bf16 --attr-fpmath=struct:true --attr-scales=wei:common:1.25:f32,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:0:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x1x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99

xuxinzen · 2024-05-04T00:02:43Z

Hi! I saw you're using v3.4.0 while optimized version is in v3.5. Could you please update the version and check it again?
Note: For zero points, we can only support common policy for optimized version at this time.

WilliamTambellini · 2024-05-06T15:44:17Z

Tks @xuxinzen : we ll try again with the main branch.
@vpirogov could you inform when will onednn 3.5 be released?
Best

jkbeavers · 2024-05-06T18:30:39Z

aha yes that was certainly the main issue. Thanks @xuxinzen -- I now see the diff on brgremm_matmul.cpp not accepting datatypes for weight decompression.

We were copying some values for --attr-zero-points from recent commits, which was also a problem due to a zero point of -1; need to specify 0.

./benchdnn --mode=P --matmul --dt=bf16:u8:bf16 --attr-fpmath=bf16:true --attr-scales=wei:common:1.25:f32,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:0:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x1x640:1x640x1920
onednn_verbose,info,oneDNN v3.5.0 (commit 242d4d9222cf7162927de60116bafc646bd0941a)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512_amx_fp16,undef,src_bf16:a:any:any::f0 wei_u8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:101
onednn_verbose,primitive,create:cache_miss,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.331787
onednn_verbose,primitive,create:cache_hit,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.00219727
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,0.124023
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,2.05298
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.0400391
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.00292969
onednn_verbose,primitive,exec,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.304932

Now that I see this working I wanted to confirm expected behavior for 3.4 and 3.5:

Is the reference (slow) implementation the only one available in 3.4?
Are there plans for a f32:s8:f32 (or u8) brgemm implementation in 3.5?
Is there support for this in the graph API?

xuxinzen · 2024-05-06T23:08:00Z

aha yes that was certainly the main issue. Thanks @xuxinzen -- I now see the diff on brgremm_matmul.cpp not accepting datatypes for weight decompression.

We were copying some values for --attr-zero-points from recent commits, which was also a problem due to a zero point of -1; need to specify 0.

./benchdnn --mode=P --matmul --dt=bf16:u8:bf16 --attr-fpmath=bf16:true --attr-scales=wei:common:1.25:f32,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:0:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x1x640:1x640x1920
onednn_verbose,info,oneDNN v3.5.0 (commit 242d4d9222cf7162927de60116bafc646bd0941a)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512_amx_fp16,undef,src_bf16:a:any:any::f0 wei_u8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:101
onednn_verbose,primitive,create:cache_miss,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.331787
onednn_verbose,primitive,create:cache_hit,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.00219727
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,0.124023
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,2.05298
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.0400391
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.00292969
onednn_verbose,primitive,exec,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.304932

Now that I see this working I wanted to confirm expected behavior for 3.4 and 3.5:

Is the reference (slow) implementation the only one available in 3.4?
Are there plans for a f32:s8:f32 (or u8) brgemm implementation in 3.5?
Is there support for this in the graph API?

Yes, the reference implementation is the only one available in 3.4.
As far as I know, we do not have any plans for f32:s8:f32 at this time, at least no for this quarter.
From the supported operations for graph API, we do not support weights decompression yet.

vpirogov · 2024-05-07T15:47:41Z

@WilliamTambellini, will do! The release is planned for May 30. You can find oneDNN release schedule here.

jkbeavers · 2024-05-07T16:15:55Z

Thanks for all the help @xuxinzen and @vpirogov ! You can go ahead and close this.

WilliamTambellini added the question label May 3, 2024

WilliamTambellini changed the title ~~Bad speed for f32:s8:f2 matmul~~ Bad speed for f32:s8:f32 matmul May 3, 2024

igorsafo self-assigned this May 3, 2024

igorsafo mentioned this issue May 5, 2024

How can I create a matmul primitive with A16W8 (active 16bits, weight 8bits) configuration? #1895

Closed

vpirogov added this to the v3.5 milestone May 7, 2024

dzarukin closed this as completed May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad speed for f32:s8:f32 matmul #1893

Bad speed for f32:s8:f32 matmul #1893

WilliamTambellini commented May 3, 2024

igorsafo commented May 3, 2024

WilliamTambellini commented May 3, 2024

igorsafo commented May 3, 2024

WilliamTambellini commented May 3, 2024

jkbeavers commented May 3, 2024

xuxinzen commented May 4, 2024

WilliamTambellini commented May 6, 2024 •

edited

Loading

jkbeavers commented May 6, 2024 •

edited

Loading

xuxinzen commented May 6, 2024

vpirogov commented May 7, 2024

jkbeavers commented May 7, 2024

Bad speed for f32:s8:f32 matmul #1893

Bad speed for f32:s8:f32 matmul #1893

Comments

WilliamTambellini commented May 3, 2024

igorsafo commented May 3, 2024

WilliamTambellini commented May 3, 2024

igorsafo commented May 3, 2024

WilliamTambellini commented May 3, 2024

jkbeavers commented May 3, 2024

xuxinzen commented May 4, 2024

WilliamTambellini commented May 6, 2024 • edited Loading

jkbeavers commented May 6, 2024 • edited Loading

xuxinzen commented May 6, 2024

vpirogov commented May 7, 2024

jkbeavers commented May 7, 2024

WilliamTambellini commented May 6, 2024 •

edited

Loading

jkbeavers commented May 6, 2024 •

edited

Loading