[Dev][TL] Implement MMA INT4 Tensor Core and Correctness Test Case. #232

LeiWang1999 · 2024-11-01T15:56:44Z

This pull request introduces several enhancements and bug fixes to the bitblas and testing/python/tilelang modules, including the addition of new classes for tensor core intrinsics, improvements to matrix multiplication functions, and updates to memory allocation in test files.

Enhancements to tensor core intrinsics:

Added INT4TensorCoreIntrinEmitter and INT4TensorCoreIntrinEmitterWithLadderTransform classes in bitblas/tl/macro_generator.py to support matrix multiplication with int4 data type.

Improvements to matrix multiplication functions:

Updated tl_matmul_with_ladder_weight_only_transform and tl_matmul_with_ladder_weight_only_transform_block_reduce_int4 functions in testing/python/tilelang/test_tilelang_macro_gemm.py to use separate local sizes for A, B, and C matrices. [1] [2]
Modified the main function in testing/python/tilelang/test_tilelang_macro_gemm.py to allocate local memory using the new local size variables. [1] [2]

Updates to utility functions:

Enhanced make_swizzle_layout function in bitblas/tl/utils.py to include an optional is_smooth parameter for smoother layout transformations.

Subproject updates:

Updated the subproject commit for 3rdparty/tvm.

- Adjusted the local fragment sizes for tensor core memory allocation in the MatmulFineGrainScheduler class. - Updated the allocation sizes for A_local, B_local, and C_local variables based on the new fragment sizes. - The changes ensure efficient memory utilization and improve performance. Refactor tensor core memory allocation in MatmulDequantizeFineGrainedScheduler - Modified the fragment sizes for tensor core memory allocation in the MatmulDequantizeFineGrainedScheduler class. - Updated the allocation sizes for A_frag, B_frag, and C_frag variables based on the new fragment sizes. - The changes optimize memory usage and enhance the efficiency of the dequantization process. Refactor tensor core memory allocation in MatmulDequantizeWeightPropagationScheduler - Adjusted the fragment sizes for tensor core memory allocation in the MatmulDequantizeWeightPropagationScheduler class. - Updated the allocation sizes for A_frag, B_frag, B_dequantize_frag, and C_frag variables based on the new fragment sizes. - The changes improve memory utilization and optimize the weight propagation process.

…xtent

LeiWang1999 added 9 commits October 4, 2024 05:49

Merge TL Update

acb4aa4

Merge branch 'main' of https://github.com/microsoft/BitBLAS into main

e305a72

submodule update

4aa081c

Re-implement macro with sub function.

811e5c7

lint fix

4a0afc9

Implement int4 tensorcore

2af586d

Merge branch 'main' of https://github.com/microsoft/BitBLAS into tl_e…

fd4973c

…xtent

lint fix

8f7767b

LeiWang1999 merged commit e94f65d into microsoft:main Nov 1, 2024
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dev][TL] Implement MMA INT4 Tensor Core and Correctness Test Case. #232

[Dev][TL] Implement MMA INT4 Tensor Core and Correctness Test Case. #232

Uh oh!

LeiWang1999 commented Nov 1, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Dev][TL] Implement MMA INT4 Tensor Core and Correctness Test Case. #232

[Dev][TL] Implement MMA INT4 Tensor Core and Correctness Test Case. #232

Uh oh!

Conversation

LeiWang1999 commented Nov 1, 2024

Enhancements to tensor core intrinsics:

Improvements to matrix multiplication functions:

Updates to utility functions:

Subproject updates:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant