Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cublaslt_gemm microbenchmark fails with running with large matrix sizes. #497

Closed
nishshah0 opened this issue Mar 25, 2023 · 5 comments
Closed

Comments

@nishshah0
Copy link

When I run microbenchmark cublaslt_gemm with B=64, M=8192, K=8192, N=8192, it fails in cudaMalloc.

I debugged this and figured out the result of multiplication B * M * K and B * M * N and B * K * N are all more than 4GB in BF16 so 32 bit int data type cannot hold result of the multiplication. I managed to make local changes to fix this. But once I get past cudaMalloc, cublasCreate(&handle) call fails with CUBLAS_STATUS_NOT_INITIALIZED.

These are the steps reproduce the error
cd superbench/benchmarks/microbencmarks/cublaslt_gemm
cmake -S ./
make
./cublaslt_gemm -b 64 -m 8192 -k 8192 -n 8192 -i 1000 -t bf16

@abuccts
Copy link
Member

abuccts commented Mar 25, 2023

there's a size issue indeed, can you try whether 857a8ba fixes your case?

for the CUBLAS_STATUS_NOT_INITIALIZED issue, can you run successfully with smaller batch? what's the total memory size of your gpu, it may be insufficient for this large batch gemm

@nishshah0
Copy link
Author

It does work for smaller batch sizes. With batch=64, A, B and C matrices are each 8 GB with total of 24 GB. GPU I am running on has 80GB of memory. So it is well below its capacity. I was able verify using a different tool that batch = 64 does work. But it does not with superbench. It will be good to get if fixed in superbench as I can leverage FP8 support.

@abuccts
Copy link
Member

abuccts commented Mar 29, 2023

I just tried 857a8ba on DGX H100 (driver 525.85.12, CUDA 12.0, Docker image superbench/superbench:v0.7.0-cuda11.8) and confirmed it works fine with batch size 64 and 128 by running /opt/superbench/bin/cublaslt_gemm -b [64/128] -m 8192 -k 8192 -n 8192 -i 1000 -t [bf16/fp16].

I was able verify using a different tool that batch = 64 does work

Does this tool leverage cutlass, cublas, or cublaslt?

But once I get past cudaMalloc, cublasCreate(&handle) call fails with CUBLAS_STATUS_NOT_INITIALIZED

I don't think cublaslt_gemm will call cublasCreate, do you mean cublasLtCreate?

Because CUBLAS_STATUS_NOT_INITIALIZED may indicate errors in CUDA Runtime API called by the cuBLASLt routine or hardware setup and I cannot reproduce the issue, can you set CUBLASLT_LOG_LEVEL=5 when running cublaslt_gemm to see whether you can get more information from your environment?

It would also be great if you can share more hardware/software information or try with above driver/cuda/image versions.

@nishshah0
Copy link
Author

Verified that patch works. Also I was able to run successfully batch = 64, it happened to be an issue with device.

I am attaching the patch which has changes that helped me use the utility better:

  1. When error occurs, it prints exactly which function and line causes error instead of throwing std::logic_error which does not have any information where error occurred.
  2. added support to specify device from commandline this way I can script around and control which device is used.
  3. Added header printf to let user know what each output means, instead of looking at source file.

If you find it useful feel free to check it in. I am unable to push the branch with this changes.
cublaslt_gemm_upgrade.patch

@cp5555 cp5555 closed this as completed Mar 31, 2023
@abuccts
Copy link
Member

abuccts commented Apr 4, 2023

Verified that patch works. Also I was able to run successfully batch = 64, it happened to be an issue with device.

I am attaching the patch which has changes that helped me use the utility better:

  1. When error occurs, it prints exactly which function and line causes error instead of throwing std::logic_error which does not have any information where error occurred.
  2. added support to specify device from commandline this way I can script around and control which device is used.
  3. Added header printf to let user know what each output means, instead of looking at source file.

If you find it useful feel free to check it in. I am unable to push the branch with this changes. cublaslt_gemm_upgrade.patch

Hi @nishshah-msft, I have created #503 to merge the size fix.

For your patch:

  1. the PR will also refine error message with func name and line no, thanks for your contribution!
  2. you can always use CUDA_VISIBLE_DEVICES to specify the device you want to use in any cuda program, adding an argument in command line seems unnecessary
  3. if you run the benchmark through superbench cli instead of executing the binary directly, you will get parsed results with correct meanings, see cublaslt-gemm metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants