for example, we can handle a small m with large BlockM with the following solution:
for i, j in T.Parallel(block_M, block_N):
m, n = by * block_M + i, bx * block_N + j
if m < M and n < N:
C[m, n] = C_shared[
i // micro_size_x,
j // micro_size_y,
i % micro_size_x,
j % micro_size_y,
]
It will be great to allow T.Parallel auto analysis the memory access and inject the if then else node by itself.