[MLAS] add q4 quantize and transpose kernel to support MatMulNBits QDQ fuse #21054

fajin-corp · 2024-06-15T00:17:50Z

Description

added kernel to quantize matmul B tensor to q4, and store in the same shape as original tensor. scales and zero points are calculated as well. scales and zero points have the same shape.
added kernel to transpose q4 B tensor to B tensor in MatMulNBits. Scales and zero points are transposed as well.

Benchmark
<1024 x 4096 input, 64 quant block, 8 threads>:

quantize: 23035923 ns
transpose: 718635 ns

<1024 x 4095 input, 64 quant block, 8 threads>:

quantize: 26759319 ns
transpose: 1279064 ns

Motivation and Context

The MatMulNbits tool chain current only supports converting a MatMul op direct to MatMulNBits op. MatMulNbits op is not an ONNX standard op.
Therefore, we need the tool chain to support converting MatMul to Q/DQ format, and later in the transform step converts DQ + MatMul to MatMulNBits. The tensors stored in DQ are the quantized constants and will be stored in the MatMulNBits.

onnxruntime/core/mlas/lib/q4_dq.cpp

yufenglee · 2024-06-19T02:26:52Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

yufenglee · 2024-06-19T02:27:04Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models

yufenglee · 2024-06-19T02:27:15Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

azure-pipelines · 2024-06-19T02:27:28Z

Azure Pipelines successfully started running 10 pipeline(s).

azure-pipelines · 2024-06-19T02:27:39Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2024-06-19T02:27:43Z

Azure Pipelines successfully started running 10 pipeline(s).

onnxruntime/core/mlas/lib/q4_dq.cpp

onnxruntime/test/mlas/unittest/test_blockq4.cpp

onnxruntime/core/util/qmath.h

yufenglee

fajin-corp · 2024-06-19T22:01:35Z

/azp run Windows CPU CI Pipeline,orttraining-linux-ci-pipeline

azure-pipelines · 2024-06-19T22:01:49Z

Azure Pipelines successfully started running 2 pipeline(s).

fajin-corp requested a review from a team as a code owner June 15, 2024 00:17

fajin-corp force-pushed the fajin/qdqmatmulnbitskkernels branch from 14fcdb7 to ed9421a Compare June 17, 2024 17:39