Fix the fixed warp tile used in global to shared memory load/store.

The current global-to-shared load uses a fixed 16x16 base tile to align with TensorCore's warp tile requirement. This approach results in inefficiency when the overall problem size is large enough to support a larger warp tile for coalescing memory access.