-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cublaslt_gemm microbenchmark fails with running with large matrix sizes. #497
Comments
there's a size issue indeed, can you try whether 857a8ba fixes your case? for the |
It does work for smaller batch sizes. With batch=64, A, B and C matrices are each 8 GB with total of 24 GB. GPU I am running on has 80GB of memory. So it is well below its capacity. I was able verify using a different tool that batch = 64 does work. But it does not with superbench. It will be good to get if fixed in superbench as I can leverage FP8 support. |
I just tried 857a8ba on DGX H100 (driver 525.85.12, CUDA 12.0, Docker image superbench/superbench:v0.7.0-cuda11.8) and confirmed it works fine with batch size 64 and 128 by running
Does this tool leverage cutlass, cublas, or cublaslt?
I don't think superbenchmark/superbench/benchmarks/micro_benchmarks/cublaslt_gemm/cublaslt_utils.cc Line 8 in 857a8ba
Because It would also be great if you can share more hardware/software information or try with above driver/cuda/image versions. |
Verified that patch works. Also I was able to run successfully batch = 64, it happened to be an issue with device. I am attaching the patch which has changes that helped me use the utility better:
If you find it useful feel free to check it in. I am unable to push the branch with this changes. |
Hi @nishshah-msft, I have created #503 to merge the size fix. For your patch:
|
When I run microbenchmark cublaslt_gemm with B=64, M=8192, K=8192, N=8192, it fails in cudaMalloc.
I debugged this and figured out the result of multiplication B * M * K and B * M * N and B * K * N are all more than 4GB in BF16 so 32 bit int data type cannot hold result of the multiplication. I managed to make local changes to fix this. But once I get past cudaMalloc, cublasCreate(&handle) call fails with CUBLAS_STATUS_NOT_INITIALIZED.
These are the steps reproduce the error
cd superbench/benchmarks/microbencmarks/cublaslt_gemm
cmake -S ./
make
./cublaslt_gemm -b 64 -m 8192 -k 8192 -n 8192 -i 1000 -t bf16
The text was updated successfully, but these errors were encountered: