Skip to content

[ARK] Support gemm using sycl-tla#1968

Draft
Zhenzhong1 wants to merge 7 commits into
mainfrom
zhenzhong/sycltla-gemm
Draft

[ARK] Support gemm using sycl-tla#1968
Zhenzhong1 wants to merge 7 commits into
mainfrom
zhenzhong/sycltla-gemm

Conversation

@Zhenzhong1

@Zhenzhong1 Zhenzhong1 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

fp16

auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt0-4096-4096-1] 
  [oneDNN]                  :    0.187 ms     0.179 TFLOPS
  [matmul_sycl_tla_fused]   :    0.108 ms     0.311 TFLOPS  speedup= 1.74x

PASSED
auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt0-4096-4096-8] 
  [oneDNN]                  :    0.082 ms     3.259 TFLOPS
  [matmul_sycl_tla_fused]   :    0.114 ms     2.352 TFLOPS  speedup= 0.72x

PASSED
auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt0-4096-4096-16] 
  [oneDNN]                  :    0.081 ms     6.591 TFLOPS
  [matmul_sycl_tla_fused]   :    0.114 ms     4.721 TFLOPS  speedup= 0.72x

PASSED
auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt0-4096-4096-32] 
  [oneDNN]                  :    0.084 ms    12.727 TFLOPS
  [matmul_sycl_tla_fused]   :    0.113 ms     9.500 TFLOPS  speedup= 0.75x

PASSED
auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt0-4096-4096-128] 
  [oneDNN]                  :    0.133 ms    32.222 TFLOPS
  [matmul_sycl_tla_fused]   :    0.127 ms    33.801 TFLOPS  speedup= 1.05x

PASSED
auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt0-4096-4096-1024] 
  [oneDNN]                  :    0.425 ms    80.767 TFLOPS
  [matmul_sycl_tla_fused]   :    0.455 ms    75.486 TFLOPS  speedup= 0.93x

PASSED
auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt0-4096-4096-2048] 
  [oneDNN]                  :    0.774 ms    88.751 TFLOPS
  [matmul_sycl_tla_fused]   :    0.878 ms    78.234 TFLOPS  speedup= 0.88x

PASSED
auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt0-4096-4096-4096] 
  [oneDNN]                  :    1.494 ms    91.973 TFLOPS
  [matmul_sycl_tla_fused]   :    1.645 ms    83.568 TFLOPS  speedup= 0.91x

torch.bf16

auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt1-4096-4096-1] 
  [oneDNN]                  :    0.161 ms     0.208 TFLOPS
  [matmul_sycl_tla_fused]   :    0.107 ms     0.314 TFLOPS  speedup= 1.51x

PASSED
auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt1-4096-4096-8] 
  [oneDNN]                  :    0.083 ms     3.245 TFLOPS
  [matmul_sycl_tla_fused]   :    0.107 ms     2.503 TFLOPS  speedup= 0.77x

PASSED
auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt1-4096-4096-16] 
  [oneDNN]                  :    0.082 ms     6.567 TFLOPS
  [matmul_sycl_tla_fused]   :    0.111 ms     4.835 TFLOPS  speedup= 0.74x

PASSED
auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt1-4096-4096-32] 
  [oneDNN]                  :    0.083 ms    12.903 TFLOPS
  [matmul_sycl_tla_fused]   :    0.112 ms     9.625 TFLOPS  speedup= 0.75x

PASSED
auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt1-4096-4096-128] 
  [oneDNN]                  :    0.120 ms    35.932 TFLOPS
  [matmul_sycl_tla_fused]   :    0.120 ms    35.670 TFLOPS  speedup= 0.99x

PASSED
auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt1-4096-4096-1024] 
  [oneDNN]                  :    0.409 ms    83.996 TFLOPS
  [matmul_sycl_tla_fused]   :    0.453 ms    75.779 TFLOPS  speedup= 0.90x

PASSED
auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt1-4096-4096-2048] 
  [oneDNN]                  :    0.807 ms    85.108 TFLOPS
  [matmul_sycl_tla_fused]   :    0.868 ms    79.201 TFLOPS  speedup= 0.93x

PASSED
auto_round_extension/ark/test/test_matmul.py::test_xpu_compare_dnnl_vs_sycl_tla[dt1-4096-4096-4096] 
  [oneDNN]                  :    1.495 ms    91.915 TFLOPS
  [matmul_sycl_tla_fused]   :    1.618 ms    84.918 TFLOPS  speedup= 0.92x

Zhenzhong1 and others added 7 commits June 30, 2026 09:43
Signed-off-by: Zhenzhong1 <zhenzhong.xu@intel.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Zhenzhong1 <zhenzhong.xu@intel.com>
Signed-off-by: Zhenzhong1 <zhenzhong.xu@intel.com>
Signed-off-by: Zhenzhong1 <zhenzhong.xu@intel.com>
Signed-off-by: Zhenzhong1 <zhenzhong.xu@intel.com>
Signed-off-by: Zhenzhong1 <zhenzhong.xu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant