[TL] Allow T.clear be applied on a "local" Buffer and improve L2 Swizzle #178

LeiWang1999 · 2024-09-06T10:48:16Z

To use with L2 Swizzle

@T.prim_func
    def main(
        A: T.Buffer(A_shape, dtypeAB),
        B: T.Buffer(B_shape, dtypeAB),
        C: T.Buffer((M, N), dtypeC),
    ):
        with T.Kernel(
            T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=threads
        ) as (bx, by):

            A_shared = T.alloc_shared(A_shared_shape, dtypeAB, scope=shared_scope)
            B_shared = T.alloc_shared(B_shared_shape, dtypeAB, scope=shared_scope)
            C_shared = T.alloc_shared(C_shared_shape, dtypeC, scope=shared_scope)
            A_local = T.alloc_fragment((warp_rows * local_size), dtypeAB, scope="local")
            B_local = T.alloc_fragment((warp_cols * local_size), dtypeAB, scope="local")
            C_local = T.alloc_fragment((warp_rows * warp_cols * local_size), accum_dtype, scope="local")
            thread_bindings = T.thread_binding(0, threads, "threadIdx.x")

            T.annotate_layout(
                {
                    A_shared: make_swizzle_layout(A_shared),
                    B_shared: make_swizzle_layout(B_shared),
                }
            )

            T.use_swizzle(panel_size=10)
            ...

To init a local register:

 C_local = T.alloc_fragment((warp_rows * warp_cols * local_size), accum_dtype, scope="local")
T.clear(C_local)

Generated codes are:

  float C_local[32];
for (int i = 0; i < 8; ++i) {
    *(float4*)(C_local + (i * 4)) = make_float4(0.000000e+00f, 0.000000e+00f, 0.000000e+00f, 0.000000e+00f);
 }

Notes: vectorized initialization seems to be more efficient under my test case.

…ability and maintainability

…ayout

LeiWang1999 added 30 commits July 5, 2024 08:54

Refactor BatchMatMulEmitter and BatchMatMulSelector for improved read…

d8884e6

…ability and maintainability

Refactor import statements for improved readability and maintainability

fc84173

Refactor import statements for improved readability and maintainability

02f64de

disable failure email for ci

397eee6

remove email notifications.

20f6ad1

move relax pass from testing to mlc_llm

b93c394

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into main

ba6a6df

Refactor scripts with se check_eual_ref_scripts_with_emitter function

257693a

Lint Fix

9bb7f49

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into main

39e7614

Refactor scripts with se check_eual_ref_scripts_with_emitter function

93eb5a5

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into main

72b9740

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into main

5b65979

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into main

d9bd479

buf fix for matrix support

99515cb

lint fix

14406ef

dispatch tensor core based on shapes

d30ec4f

update install commands

fde4029

import scripts

6a04749

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into docs

9d90c40

remove shared mem hack

9ef14e9

revert change for swizzling

63f363e

bug fix

b29c66c

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into docs

4643dd9

tl examples

28beb13

Enhance Swizzle

c0b476f

lint fix

2bf14a8

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into tl-l…

52accbf

…ayout

test fix

19aa985

lint fix

ef8f93c

LeiWang1999 added 16 commits September 3, 2024 07:32

optimize layout

4015cc4

update tl utils.

5c5880c

macro optimization

1042ffd

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into tl-l…

1ecd76e

…ayout

test fix

7bb21e7

gemm_ss

6a22442

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into tl-l…

b9ea093

…ayout

doc fix

e9b56b4

lint fix

3eb6888

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into tl-l…

5322785

…ayout

lint fix

6f18d15

remove debug print

187f448

remove debug print

e1fac68

vectorization init

4f25626

Merge branch 'main' of https://github.com/Microsoft/BitBLAS into tl-l…

2686030

…ayout

lint fix

23a8e8b

LeiWang1999 merged commit 5e3da9b into microsoft:main Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TL] Allow T.clear be applied on a "local" Buffer and improve L2 Swizzle #178

[TL] Allow T.clear be applied on a "local" Buffer and improve L2 Swizzle #178

Uh oh!

LeiWang1999 commented Sep 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[TL] Allow T.clear be applied on a "local" Buffer and improve L2 Swizzle #178

[TL] Allow T.clear be applied on a "local" Buffer and improve L2 Swizzle #178

Uh oh!

Conversation

LeiWang1999 commented Sep 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant