cutlass calculate matrix size problem #75

wanghr323 · 2019-12-23T13:36:57Z

I wrote a function using cutlass to test the performance of cutlass calculation (int8, int8 to int), but I have now found a problem. M, N, and K in my parameters cannot be selected at random, where N and K must Multiples of 16. Choosing something else will cause an error. Is there something wrong with my writing of this function?

int Int8Operator:: cutlass_gemm32I_tensorop(const CBLAS_TRANSPOSE TransA,
    const CBLAS_TRANSPOSE TransB, const int M, const int N, const int K,
    const void *alpha, const void* A, const void* B, const void *beta,
    void* C,cublasGemmAlgo_t algo/*non used*/)
    {
            using A_Major = cutlass::layout::ColumnMajor;
            using B_Major = cutlass::layout::ColumnMajor;
            using ElementOutput = int32_t;
            using ElementAccumulator = int32_t;
            int lda = (TransA == CblasNoTrans) ? K : M;
            int ldb = (TransB == CblasNoTrans) ? N : K;
            int ldc = N;
            using Gemm = cutlass::gemm::device::Gemm<
            int8_t,
            A_Major,
            int8_t,
            B_Major,
            ElementOutput,
            cutlass::layout::RowMajor,
            ElementAccumulator,
            cutlass::arch::OpClassWmmaTensorOp,
            cutlass::arch::Sm75,
            cutlass::gemm::GemmShape<128, 128, 32>,
            cutlass::gemm::GemmShape<64, 64, 32>,
            cutlass::gemm::GemmShape<16, 16, 16>,
            cutlass::epilogue::thread::LinearCombination<
            ElementOutput,
            128 / cutlass::sizeof_bits<ElementOutput>::value,
            ElementAccumulator,
            ElementAccumulator
            >,
            cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle,
            2
        >;
        Gemm gemm_op;
        int alpha_ = *(static_cast<const int*>(alpha));
        int beta_ = *(static_cast<const int*>(beta));
        cutlass::Status status = gemm_op({
            {M, N, K},
            {static_cast<const int8_t *>(A), lda},
            {static_cast<const int8_t *>(B), ldb},
            {static_cast<int*>(C), ldc},
            {static_cast<int*>(C), ldc}, 
            {alpha_,beta_}
        });
        if (status != cutlass::Status::kSuccess) {
            return cudaErrorUnknown;
          }
          return cudaSuccess;
    }

The text was updated successfully, but these errors were encountered:

kerrmudgeon · 2019-12-23T13:54:47Z

This is expected behavior. As a performance optimization, most kernels targeting Tensor Cores require 128b aligned memory accesses. This enables vectorized memory accesses, fewer address instructions, and greater efficiency. It is a documented requirement in cublas, described here in greater detail <https://docs.nvidia.com/cuda/cublas/index.html#tensorop-restrictions>. Here are some additional suggestions for higher performance on the Turing architecture: 1. Target Tensor Cores natively with `cutlass::arch::OpClassTensorOp` rather than WMMA. This utilizes feature new in CUDA 10.2. using Gemm = cutlass::gemm::device::Gemm< int8_t, cutlass::layout::RowMajor, int8_t, cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor, ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm75, cutlass::gemm::GemmShape<128, 128, 32>, cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<8, 8, 16>

;

2. Adapt your application to structure the A matrix as row-major.

…

On Mon, Dec 23, 2019 at 5:36 AM wanghr323 ***@***.***> wrote: I wrote a function using cutlass to test the performance of cutlass calculation (int8, int8 to int), but I have now found a problem. M, N, and K in my parameters cannot be selected at random, where N and K must Multiples of 16. Choosing something else will cause an error. Is there something wrong with my writing of this function? int Int8Operator:: cutlass_gemm32I_tensorop(const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const void *alpha, const void* A, const void* B, const void *beta, void* C,cublasGemmAlgo_t algo/*non used*/) { using A_Major = cutlass::layout::ColumnMajor; using B_Major = cutlass::layout::ColumnMajor; using ElementOutput = int32_t; using ElementAccumulator = int32_t; int lda = (TransA == CblasNoTrans) ? K : M; int ldb = (TransB == CblasNoTrans) ? N : K; int ldc = N; using Gemm = cutlass::gemm::device::Gemm< int8_t, A_Major, int8_t, B_Major, ElementOutput, cutlass::layout::RowMajor, ElementAccumulator, cutlass::arch::OpClassWmmaTensorOp, cutlass::arch::Sm75, cutlass::gemm::GemmShape<128, 128, 32>, cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<16, 16, 16>, cutlass::epilogue::thread::LinearCombination< ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value, ElementAccumulator, ElementAccumulator >, cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle, 2 >; Gemm gemm_op; int alpha_ = *(static_cast<const int*>(alpha)); int beta_ = *(static_cast<const int*>(beta)); cutlass::Status status = gemm_op({ {M, N, K}, {static_cast<const int8_t *>(A), lda}, {static_cast<const int8_t *>(B), ldb}, {static_cast<int*>(C), ldc}, {static_cast<int*>(C), ldc}, {alpha_,beta_} }); if (status != cutlass::Status::kSuccess) { return cudaErrorUnknown; } return cudaSuccess; } — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#75?email_source=notifications&email_token=AAACX6ZQYN6CPASEURBHXF3Q2C5HVA5CNFSM4J6UH4V2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ICKSZOQ>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAACX64YETMNR2KKCM5IP63Q2C5HVANCNFSM4J6UH4VQ> .

wanghr323 · 2019-12-24T03:10:41Z

OK，that is to say , M ,N ,K at least two of them should be Multiples of 16.
Thank u , I will close the issue.

wanghr323 · 2019-12-24T07:17:13Z

thank u for your reply，Kerr，Then I have a need in my job now, calculating C (int) = A (int8) × B (int8), where I want A, B, and C to be Rowmajor matrices, the size of A is M × K, and the size of B is K × N, the size of C is M * N.
I can guarantee that K is a multiple of 16, and M can be converted to a multiple of 16 (if you can choose it arbitrarily, it is the best, if not, it is fine), but N must be a random number.
How do I achieve it with cutlass?
I tested all combinations in cutlass. If ABC is rowmajor, then N and K must be multiples of 16. If I convert my thoughts and convert A × B to B.trans * A.trans (ABC selects column_major, and brings it back in), then M becomes N and N becomes M, this time it becomes, N can be chosen at will, M must be a multiple of 16, still cannot solve my problem.
Can this problem be solved by cutlass? It's fine if you don't use tensorcoreop, or even wmma.

kerrmudgeon · 2019-12-24T14:31:31Z

Here are a three possible recourses:

1.) Padding.

Size the matrices such that they are divisible by 16 elements and initialize the extra elements with zero.

2.) Reduce the alignment requirement at the expense of performance.

The device-level GEMM API accepts an admittedly long list of template arguments including the alignment constraints.

https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L201

using Gemm = cutlass::gemm::device::Gemm<
      int8_t,
      cutlass::layout::RowMajor,
      int8_t,
      B_Major,
      cutlass::layout::ColumnMajor,
      cutlass::layout::RowMajor,
      ElementAccumulator,
      cutlass::arch::OpClassTensorOp,
      cutlass::arch::Sm75,
      cutlass::gemm::GemmShape<128, 128, 64>,
      cutlass::gemm::GemmShape<64, 64, 64>,
      cutlass::gemm::GemmShape<8, 8, 16>,
      cutlass::epilogue::thread::LinearCombination<
        ElementOutput,
        1,     // alignment of C units
        ElementAccumulator,
        ElementAccumulator
      >,
      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle,
      2,
      1,   // alignment of A in units of number of elements
      1    // alignment of B in units of number of elements
  >;

3.) Use the integer-valued SIMT kernels.

You may consider using a kernel targeting integer dot product "dp4" instructions, first available in the Pascal microarchitecture and beyond.

Here is the definition syntax, visible in unit tests for these kernels.
https://github.com/NVIDIA/cutlass/blob/master/test/unit/gemm/device/simt_int8_igemm_sm61.cu

  // Output data type - may be int8_t or int32_t
  using ElementOutput = int8_t;

  // Accumulator data type
  using ElementAccumulator = int32_t;

  // Scalar data type
  using ElementCompute = float;

  // Instruction shape - describes a 1x1x4 dot product computed by
  // the "dp4" instruction.
  using InstructionShape = cutlass::gemm::GemmShape<1, 1, 4>;

  using Gemm = cutlass::gemm::device::Gemm<
    int8_t,
    cutlass::layout::ColumnMajor,
    int8_t,
    cutlass::layout::ColumnMajor,
    ElementOutput,
    cutlass::layout::RowMajor,
    int32_t,
    cutlass::arch::OpClassSimt,
    cutlass::arch::Sm61,
    ThreadBlockShape,
    WarpShape,
    InstructionShape
  >;

There is no restriction on M, N, or K, but the matrices themselves must be 32b aligned. That is, pointers and leading dimensions must be divisible by 4 bytes.

wanghr323 · 2020-01-09T02:14:10Z

thank you for your help.I will close the question.

wanghr323 closed this as completed Dec 24, 2019

wanghr323 reopened this Dec 24, 2019

wanghr323 closed this as completed Jan 9, 2020

Ther-LF mentioned this issue Feb 20, 2023

[QST] How does CUTLASS solve the problem that the problem size cannot be divisible by ShapeMMAThreadBlock? #831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cutlass calculate matrix size problem #75

cutlass calculate matrix size problem #75

wanghr323 commented Dec 23, 2019

kerrmudgeon commented Dec 23, 2019 via email •

edited

Loading

wanghr323 commented Dec 24, 2019

wanghr323 commented Dec 24, 2019

kerrmudgeon commented Dec 24, 2019

wanghr323 commented Jan 9, 2020

cutlass calculate matrix size problem #75

cutlass calculate matrix size problem #75

Comments

wanghr323 commented Dec 23, 2019

kerrmudgeon commented Dec 23, 2019 via email • edited Loading

wanghr323 commented Dec 24, 2019

wanghr323 commented Dec 24, 2019

kerrmudgeon commented Dec 24, 2019

wanghr323 commented Jan 9, 2020

kerrmudgeon commented Dec 23, 2019 via email •

edited

Loading