cublaslt_gemm microbenchmark fails with running with large matrix sizes. #497

nishshah0 · 2023-03-25T05:35:21Z

When I run microbenchmark cublaslt_gemm with B=64, M=8192, K=8192, N=8192, it fails in cudaMalloc.

I debugged this and figured out the result of multiplication B * M * K and B * M * N and B * K * N are all more than 4GB in BF16 so 32 bit int data type cannot hold result of the multiplication. I managed to make local changes to fix this. But once I get past cudaMalloc, cublasCreate(&handle) call fails with CUBLAS_STATUS_NOT_INITIALIZED.

These are the steps reproduce the error
cd superbench/benchmarks/microbencmarks/cublaslt_gemm
cmake -S ./
make
./cublaslt_gemm -b 64 -m 8192 -k 8192 -n 8192 -i 1000 -t bf16

The text was updated successfully, but these errors were encountered:

abuccts · 2023-03-25T12:01:03Z

there's a size issue indeed, can you try whether 857a8ba fixes your case?

for the CUBLAS_STATUS_NOT_INITIALIZED issue, can you run successfully with smaller batch? what's the total memory size of your gpu, it may be insufficient for this large batch gemm

nishshah0 · 2023-03-26T04:32:45Z

It does work for smaller batch sizes. With batch=64, A, B and C matrices are each 8 GB with total of 24 GB. GPU I am running on has 80GB of memory. So it is well below its capacity. I was able verify using a different tool that batch = 64 does work. But it does not with superbench. It will be good to get if fixed in superbench as I can leverage FP8 support.

abuccts · 2023-03-29T13:46:12Z

I just tried 857a8ba on DGX H100 (driver 525.85.12, CUDA 12.0, Docker image superbench/superbench:v0.7.0-cuda11.8) and confirmed it works fine with batch size 64 and 128 by running /opt/superbench/bin/cublaslt_gemm -b [64/128] -m 8192 -k 8192 -n 8192 -i 1000 -t [bf16/fp16].

I was able verify using a different tool that batch = 64 does work

Does this tool leverage cutlass, cublas, or cublaslt?

But once I get past cudaMalloc, cublasCreate(&handle) call fails with CUBLAS_STATUS_NOT_INITIALIZED

I don't think cublaslt_gemm will call cublasCreate, do you mean cublasLtCreate?

superbenchmark/superbench/benchmarks/micro_benchmarks/cublaslt_gemm/cublaslt_utils.cc

Line 8 in 857a8ba

checkCublasStatus(cublasLtCreate(&handle));

Because CUBLAS_STATUS_NOT_INITIALIZED may indicate errors in CUDA Runtime API called by the cuBLASLt routine or hardware setup and I cannot reproduce the issue, can you set CUBLASLT_LOG_LEVEL=5 when running cublaslt_gemm to see whether you can get more information from your environment?

It would also be great if you can share more hardware/software information or try with above driver/cuda/image versions.

nishshah0 · 2023-03-29T21:00:04Z

Verified that patch works. Also I was able to run successfully batch = 64, it happened to be an issue with device.

I am attaching the patch which has changes that helped me use the utility better:

When error occurs, it prints exactly which function and line causes error instead of throwing std::logic_error which does not have any information where error occurred.
added support to specify device from commandline this way I can script around and control which device is used.
Added header printf to let user know what each output means, instead of looking at source file.

If you find it useful feel free to check it in. I am unable to push the branch with this changes.
cublaslt_gemm_upgrade.patch

abuccts · 2023-04-04T08:02:29Z

Verified that patch works. Also I was able to run successfully batch = 64, it happened to be an issue with device.

I am attaching the patch which has changes that helped me use the utility better:

When error occurs, it prints exactly which function and line causes error instead of throwing std::logic_error which does not have any information where error occurred.

added support to specify device from commandline this way I can script around and control which device is used.

Added header printf to let user know what each output means, instead of looking at source file.

If you find it useful feel free to check it in. I am unable to push the branch with this changes. cublaslt_gemm_upgrade.patch

Hi @nishshah-msft, I have created #503 to merge the size fix.

For your patch:

the PR will also refine error message with func name and line no, thanks for your contribution!
you can always use CUDA_VISIBLE_DEVICES to specify the device you want to use in any cuda program, adding an argument in command line seems unnecessary
if you run the benchmark through superbench cli instead of executing the binary directly, you will get parsed results with correct meanings, see cublaslt-gemm metrics

cp5555 closed this as completed Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cublaslt_gemm microbenchmark fails with running with large matrix sizes. #497

cublaslt_gemm microbenchmark fails with running with large matrix sizes. #497

nishshah0 commented Mar 25, 2023

abuccts commented Mar 25, 2023

nishshah0 commented Mar 26, 2023

abuccts commented Mar 29, 2023 •

edited

nishshah0 commented Mar 29, 2023

abuccts commented Apr 4, 2023 •

edited

cublaslt_gemm microbenchmark fails with running with large matrix sizes. #497

cublaslt_gemm microbenchmark fails with running with large matrix sizes. #497

Comments

nishshah0 commented Mar 25, 2023

abuccts commented Mar 25, 2023

nishshah0 commented Mar 26, 2023

abuccts commented Mar 29, 2023 • edited

nishshah0 commented Mar 29, 2023

abuccts commented Apr 4, 2023 • edited

abuccts commented Mar 29, 2023 •

edited

abuccts commented Apr 4, 2023 •

edited