-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weird performance when using shared memory in GEMV #12
Comments
GEMV is different than GEMM, in that it is bandwidth bound instead of compute bound. This requires different types of GPU kernels to reach better performance. Your best bet is copying the GEMV kernels from LLaMA.cpp instead of the code from MFA. |
I looked at your issue a bit more closely, and I don't think I properly addressed your questions. I might get around to answering your questions another time. Generally, here are some quick tips for optimization. Always perform multiple trials, with ~100 commands encoded into a single command encoder. Divide the total execution time by 100 to get the per-kernel execution time. Also, tie the performance metric to something physical. What percent of max bandwidth is it reaching? Just saying one kernel is faster than the other provides no physical grounding. Perhaps both only reach 10% of maximum performance. Second, most optimizations people attempt will fail. You need a method to quickly evaluate whether a change consistently improves performance, or whether suspected speedups are simply random noise. Start with a well-known implementation and make changes line-by-line if possible. Also, rigorously test that your kernel is producing correct output. I made a mistake with a kernel for LLaMA.cpp, where I erroneously reported ~95% bandwidth. Only later, I realized there was an unidentified bug when speed exceeded 100% bandwidth on somebody else's machine. |
thank you for your reply, I found that M2 Ultra has 800 GB/s bandwidth of GPU Core, which means data transfer from global memory to register(I guess so), so when i use shared memory to optimize the bandwith of gloabl memory access, there is small gain because the BW gap between global & shared memory is small, so the BW gain is not large enough to cover the barrier sync latency of using shared memory. So the performance of sram_GEMV is not faster or even slower. I guess that's maybe the reason. I will try to test on some other machine like m2 pro ( which has 19 GPU core with 200GB/s bandwidth). BW related information are from: https://github.com/mikeroyal/Apple-Silicon-Guide |
The ratio of local to global memory bandwidth is the same across all Apple GPU architectures. All major performance parameters are, except absolute performance. This makes analysis easier - execution speed scales directly with GPU core count. |
so what you mean is that M2 Ultra has 800GB/s BW and M2 Pro has 200GB/s BW is only related with their GPU Core, it means how many bytes can be loading in one wave by all GPU cores?and the Bytes/Cycle is all the same across all Apple GPU arch? |
Yes. Bytes/core-cycle and similar metrics are normalized by:
When you account for those two factors, a lot of characteristics match across GPUs in the same family. In fact, I found almost identical quantities across different hardware vendors. Even numbers that are the same in a CPU core. For example, 64 bytes/core-cycle from L1 and 32 bytes/core-cycle from L2. Both a CPU and GPU core have an I/O bus with the same number of bits. The difference, is GPU cores have many more transistors for math/computation. |
oh i see... thank you, and I have one more question, accoding to your codes, I found it maybe possible to load vector and broadcast into simdgroup_matrix. as you can see the bottleneck of GEMV is bandwidth. Have you ever tried to load vector into simdgroup_matrix?simdgroup_load can get higher bandwidth or not(i think not, cause seems no special cache for simdgroup_matrix load\store)? |
SIMD-group matrix multiply is for GEMM, which is compute-bound. GEMM and GEMV have very different performance characteristics. For GEMV (MAT x VEC), the bottleneck is reading the matrix from memory. The matrix is only read a single time, so you want to maximize the bandwidth of reading the matrix. The vector is read several times. Usually you can arrange the GPU threads so the L1 subsystem coalesces memory reads to the vector. For GEMM, both the lhs and rhs are read multiple times. To minimize the number of read operations, you have to store the matrices on a local piece of SRAM. They transfer data between GPU threads in different directions that aren't as straightforward as GEMV. This is where hardware acceleration is helpful; SIMD matrix multiply. There's also instructions to minimize the latency when transfering data of a matrix from RAM to SRAM; SIMD async copy. These instructions only provide speedup for matrix-matrix multiplication (GEMM). |
I try to optimize GEMV using shared memory to speed up I\O,theoretically speaking,GEMV with sram will have better bandwidth. BUT here comes a weird performance result.
Device: M2 Ultra 128GB
kernel cost from: GPUEndTime and GPUStartTime
warpPerBlock: 4, GridSize: {UP_ROUND(K, warpPerBlock), 1, 1},GroupSize: {32 * warpPerBlock, 1, 1},
Question:
Thank you for your help!
The text was updated successfully, but these errors were encountered: