[TL] Support GEMM_SS Macro to perform gemm directly from shared memory #176

LeiWang1999 · 2024-09-04T04:02:13Z

Clean the Interface.

With GEMM_SS, a gemm can be:

ptx_macro_generator = TensorCorePTXMacroGenerator(
        a_dtype=dtypeAB, b_dtype=dtypeAB, accum_dtype=accum_dtype,
        a_transposed=False, b_transposed=True, block_row_warps=block_row_warps,
        block_col_warps=block_col_warps, warp_row_tiles=warp_row_tiles,
        warp_col_tiles=warp_col_tiles, chunk=chunk, threads=threads
    )
    @T.prim_func
    def main(
        A: T.Buffer(A_shape, dtypeAB),
        B: T.Buffer(B_shape, dtypeAB),
        C: T.Buffer((M, N), dtypeC),
    ):
        with T.Kernel(
            T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=threads
        ) as (bx, by):

            A_shared = T.alloc_shared(A_shared_shape, dtypeAB, scope=shared_scope)
            B_shared = T.alloc_shared(B_shared_shape, dtypeAB, scope=shared_scope)
            C_shared = T.alloc_shared(C_shared_shape, dtypeC, scope=shared_scope)
            C_local = T.alloc_fragment((warp_rows * warp_cols * local_size), accum_dtype, scope="local")
            thread_bindings = T.thread_binding(0, threads, "threadIdx.x")

            T.annotate_layout(
                {
                    A_shared: make_swizzle_layout(A_shared),
                    B_shared: make_swizzle_layout(B_shared),
                }
            )
            
            for i in T.serial(warp_rows * warp_cols * local_size):
                C_local[i] = 0

            for ko in T.Pipelined((K // block_K), num_stages=(stage - 1)):
                # TODO(lei): storage sync should be injected automatically by TVM Pass
                T.tvm_storage_sync("shared")

                # Load A into shared memory
                for i, k in T.Parallel(block_M, block_K):
                    A_shared[i, k] = A[by * block_M + i, ko * block_K + k]

                # Load B into shared memory
                for j, k in T.Parallel(block_N, block_K):
                    B_shared[j, k] = B[bx * block_N + j, ko * block_K + k]

                # TODO(lei): storage sync should be injected automatically by TVM Pass
                T.tvm_storage_sync("shared")

                # perform gemm computation
                ptx_macro_generator.GEMM_SS(
                    ptx_macro_generator,
                    A_shared,
                    B_shared,
                    C_local,
                    thread_bindings=thread_bindings,
                )

            ptx_macro_generator.STMATRIX(
                ptx_macro_generator,
                C_local,
                C_shared,
                thread_bindings=thread_bindings,
            )

            for i, j in T.Parallel(block_M, block_N):
                C[by * block_M + i, bx * block_N + j] = C_shared[i // micro_size_x, j // micro_size_y, i % micro_size_x, j % micro_size_y]

…ability and maintainability

…ayout

LeiWang1999 added 30 commits July 5, 2024 08:54

Refactor BatchMatMulEmitter and BatchMatMulSelector for improved read…

d8884e6

…ability and maintainability

Refactor import statements for improved readability and maintainability

fc84173

Refactor import statements for improved readability and maintainability

02f64de

disable failure email for ci

397eee6

remove email notifications.

20f6ad1

move relax pass from testing to mlc_llm

b93c394

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into main

ba6a6df

Refactor scripts with se check_eual_ref_scripts_with_emitter function

257693a

Lint Fix

9bb7f49

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into main

39e7614

Refactor scripts with se check_eual_ref_scripts_with_emitter function

93eb5a5

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into main

72b9740

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into main

5b65979

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into main

d9bd479

buf fix for matrix support

99515cb

lint fix

14406ef

dispatch tensor core based on shapes

d30ec4f

update install commands

fde4029

import scripts

6a04749

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into docs

9d90c40

remove shared mem hack

9ef14e9

revert change for swizzling

63f363e

bug fix

b29c66c

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into docs

4643dd9

tl examples

28beb13

Enhance Swizzle

c0b476f

lint fix

2bf14a8

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into tl-l…

52accbf

…ayout

test fix

19aa985

lint fix

ef8f93c

LeiWang1999 added 7 commits September 3, 2024 07:32

optimize layout

4015cc4

update tl utils.

5c5880c

macro optimization

1042ffd

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into tl-l…

1ecd76e

…ayout

test fix

7bb21e7

gemm_ss

6a22442

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into tl-l…

b9ea093

…ayout

LeiWang1999 merged commit b9fab25 into microsoft:main Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TL] Support GEMM_SS Macro to perform gemm directly from shared memory #176

[TL] Support GEMM_SS Macro to perform gemm directly from shared memory #176

Uh oh!

LeiWang1999 commented Sep 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[TL] Support GEMM_SS Macro to perform gemm directly from shared memory #176

[TL] Support GEMM_SS Macro to perform gemm directly from shared memory #176

Uh oh!

Conversation

LeiWang1999 commented Sep 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant