- Re-implement the loader to issue cp.async directly instead of using CuTe for ColMajor layout. - Re-implement the storer to issue the corresponding PTX for storing, avoiding CuTe for ColMajor layout.
Re-implement the loader to issue cp.async directly instead of using CuTe for ColMajor layout.
Re-implement the storer to issue the corresponding PTX for storing, avoiding CuTe for ColMajor layout.