Skip to content

Conversation

@LeiWang1999
Copy link
Contributor

This pull request includes several changes across multiple files to enhance error handling, improve GPU matrix multiplication logic, and update integration benchmarks. The most important changes include increasing the maximum error message length, refining the logic for GPU matrix operations, and updating integration benchmarks.

Error Handling Improvements:

  • Increased MAX_ERROR_MESSAGE_LENGTH from 200 to 500 in bitblas/common.py.

GPU Matrix Multiplication Logic Enhancements:

  • Refined the condition to check block_reduction_depth and added a default value of 1 if block_reduction_depth is None in bitblas/gpu/matmul_mma_dequantize.py. [1] [2]
  • Updated thread binding and loop splitting logic based on the reduce_k value in bitblas/gpu/matmul_mma_dequantize.py. [1] [2] [3] [4] [5] [6] [7] [8]

Integration Benchmark Updates:

  • Updated integration benchmarks to use model.quantize() and torch.compile(model) in integration/BitNet/benchmark_inference_latency.py.

Import Optimization:

  • Optimized imports in integration/pytorch/bitblas_linear.py by updating the import statement for MatmulConfig and Matmul.

Submodule Update:

  • Updated the submodule commit for 3rdparty/tvm.

Ref to Issue #218

@LeiWang1999 LeiWang1999 marked this pull request as ready for review October 11, 2024 11:06
@LeiWang1999 LeiWang1999 merged commit 5f10b44 into microsoft:main Oct 11, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant