Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird performance when using shared memory in GEMV #12

Closed
FdyCN opened this issue Dec 11, 2023 · 8 comments
Closed

Weird performance when using shared memory in GEMV #12

FdyCN opened this issue Dec 11, 2023 · 8 comments

Comments

@FdyCN
Copy link

FdyCN commented Dec 11, 2023

I try to optimize GEMV using shared memory to speed up I\O,theoretically speaking,GEMV with sram will have better bandwidth. BUT here comes a weird performance result.

Device: M2 Ultra 128GB
kernel cost from: GPUEndTime and GPUStartTime

  1. Fistly, i build a xcode metal project for original GEMV(your codes) and sram GEMV(my codes),I found sram GEMV is 30% faster than original GEMV;
// transA = false, transB = true.
// and this optimization on performance is as my wish
gemv [1,2048] @ [4096,2048] **0.098ms(original) --> 0.068ms(sram)**
gemv [1,2048] @ [11001,2048] **0.271ms(original) --> 0.195ms(sram)**
  1. Secondly, I add my sram GEMV kernel into your project(because i want to combine them into one metal lib),and call then in my other C++\OC project,then comes the strange thing:
// original much fast than xcode project test, and even faster than sram GEMV
gemv [1,2048] @ [4096,2048] **0.040ms(original) vs. 0.047ms(sram)**
gemv [1,2048] @ [11001,2048] **0.175ms(original) vs. 0.173ms(sram)**
  1. my kernel code is :
    warpPerBlock: 4, GridSize: {UP_ROUND(K, warpPerBlock), 1, 1},GroupSize: {32 * warpPerBlock, 1, 1},
// ONLY support M = 1, tranA = false, transB = true now.
template <typename T, int Align>
void _gemv_sram_impl(device T *A [[buffer(0)]],
                device T *B [[buffer(1)]],
                device T *C [[buffer(2)]],
                device void *D [[buffer(3), function_constant(use_activation)]],
                
                threadgroup T *threadgroup_block [[threadgroup(0)]],
                constant ulong4 *matrix_offsets [[buffer(10), function_constant(batched)]],
                constant uint *activation_type [[buffer(13), function_constant(fused_activation)]],
                uint3 gid [[threadgroup_position_in_grid]],
                ushort warp_num [[dispatch_simdgroups_per_threadgroup]],
                ushort sidx [[simdgroup_index_in_threadgroup]],
                ushort lane_id [[thread_index_in_simdgroup]])
{
    if (gid.x * warp_num + sidx >= N || gid.y >= M) return;
    if (batched) {
        // TODO: Re-compute every inner loop iteration for FP64 accumulate.
        ulong3 offsets = matrix_offsets[gid.z].xyz;
        A = (device T*)((device uchar*)A + offsets[0]);
        B = (device T*)((device uchar*)B + offsets[1]);
        C = (device T*)((device uchar*)C + offsets[2]);
    }
    
    B += gid.x * warp_num * K;
    
    C += gid.y * N  + gid.x * warp_num + sidx;
    T acc_sum = 0;
    device vec<T, Align> * Aalign = (device vec<T, Align> *)A;
    device vec<T, Align> * Balign = (device vec<T, Align> *)B;
    // move data into smem
    threadgroup vec<T, Align> * smem = (threadgroup vec<T, Align> *)threadgroup_block;
    for (uint k = sidx * 32 + lane_id; k < K / Align; k += 32 * warp_num) {
        smem[k] = Aalign[k];
    }
    threadgroup_barrier(mem_flags::mem_threadgroup);
    
    for (uint k = lane_id; k < K / Align; k += 32) {
        device vec<T, Align> * BalignSIMD = Balign + K / Align * sidx;
        for (uint i = 0; i < Align; ++i) {
            acc_sum += smem[k][i] * BalignSIMD[k][i];
        }
    }
    T all_sum = simd_sum(acc_sum);
    if (lane_id == 0) {
        device T* BWarp = B + sidx * K;
        for (uint k = Align * (K / Align); k < K; ++k) {
            all_sum += A[k] * BWarp[k];
        }
        if (use_bias) {
            // not supported now...
        }
        if (fused_activation) {
            // not supported now...
        }
        *C = all_sum;
    }
}

Question:

  1. Why is different between xcode testbed and metallib call?sram GEMV basiclly the same in xcode and metallib call, but orignal GEMV is much better in metallib call
  2. Is there any compiling optimization i missed in sram GEMV?
  3. According to my code and the situation i described, could you give me some advise about the potential cause of the performance gap?

Thank you for your help!

@philipturner
Copy link
Owner

GEMV is different than GEMM, in that it is bandwidth bound instead of compute bound. This requires different types of GPU kernels to reach better performance. Your best bet is copying the GEMV kernels from LLaMA.cpp instead of the code from MFA.

@philipturner
Copy link
Owner

I looked at your issue a bit more closely, and I don't think I properly addressed your questions. I might get around to answering your questions another time.

Generally, here are some quick tips for optimization. Always perform multiple trials, with ~100 commands encoded into a single command encoder. Divide the total execution time by 100 to get the per-kernel execution time. Also, tie the performance metric to something physical. What percent of max bandwidth is it reaching? Just saying one kernel is faster than the other provides no physical grounding. Perhaps both only reach 10% of maximum performance.

Second, most optimizations people attempt will fail. You need a method to quickly evaluate whether a change consistently improves performance, or whether suspected speedups are simply random noise. Start with a well-known implementation and make changes line-by-line if possible. Also, rigorously test that your kernel is producing correct output. I made a mistake with a kernel for LLaMA.cpp, where I erroneously reported ~95% bandwidth. Only later, I realized there was an unidentified bug when speed exceeded 100% bandwidth on somebody else's machine.

@FdyCN
Copy link
Author

FdyCN commented Dec 12, 2023

I looked at your issue a bit more closely, and I don't think I properly addressed your questions. I might get around to answering your questions another time.

Generally, here are some quick tips for optimization. Always perform multiple trials, with ~100 commands encoded into a single command encoder. Divide the total execution time by 100 to get the per-kernel execution time. Also, tie the performance metric to something physical. What percent of max bandwidth is it reaching? Just saying one kernel is faster than the other provides no physical grounding. Perhaps both only reach 10% of maximum performance.

Second, most optimizations people attempt will fail. You need a method to quickly evaluate whether a change consistently improves performance, or whether suspected speedups are simply random noise. Start with a well-known implementation and make changes line-by-line if possible. Also, rigorously test that your kernel is producing correct output. I made a mistake with a kernel for LLaMA.cpp, where I erroneously reported ~95% bandwidth. Only later, I realized there was an unidentified bug when speed exceeded 100% bandwidth on somebody else's machine.

thank you for your reply, I found that M2 Ultra has 800 GB/s bandwidth of GPU Core, which means data transfer from global memory to register(I guess so), so when i use shared memory to optimize the bandwith of gloabl memory access, there is small gain because the BW gap between global & shared memory is small, so the BW gain is not large enough to cover the barrier sync latency of using shared memory. So the performance of sram_GEMV is not faster or even slower. I guess that's maybe the reason.

I will try to test on some other machine like m2 pro ( which has 19 GPU core with 200GB/s bandwidth).
and I will post the result later.

BW related information are from: https://github.com/mikeroyal/Apple-Silicon-Guide

@philipturner
Copy link
Owner

The ratio of local to global memory bandwidth is the same across all Apple GPU architectures. All major performance parameters are, except absolute performance. This makes analysis easier - execution speed scales directly with GPU core count.

@FdyCN
Copy link
Author

FdyCN commented Dec 13, 2023

The ratio of local to global memory bandwidth is the same across all Apple GPU architectures. All major performance parameters are, except absolute performance. This makes analysis easier - execution speed scales directly with GPU core count.

so what you mean is that M2 Ultra has 800GB/s BW and M2 Pro has 200GB/s BW is only related with their GPU Core, it means how many bytes can be loading in one wave by all GPU cores?and the Bytes/Cycle is all the same across all Apple GPU arch?

@philipturner
Copy link
Owner

Yes. Bytes/core-cycle and similar metrics are normalized by:

  • (1) number of GPU cores
  • (2) number of GHz

When you account for those two factors, a lot of characteristics match across GPUs in the same family. In fact, I found almost identical quantities across different hardware vendors. Even numbers that are the same in a CPU core. For example, 64 bytes/core-cycle from L1 and 32 bytes/core-cycle from L2. Both a CPU and GPU core have an I/O bus with the same number of bits. The difference, is GPU cores have many more transistors for math/computation.

@FdyCN
Copy link
Author

FdyCN commented Dec 14, 2023

Yes. Bytes/core-cycle and similar metrics are normalized by:

  • (1) number of GPU cores
  • (2) number of GHz

When you account for those two factors, a lot of characteristics match across GPUs in the same family. In fact, I found almost identical quantities across different hardware vendors. Even numbers that are the same in a CPU core. For example, 64 bytes/core-cycle from L1 and 32 bytes/core-cycle from L2. Both a CPU and GPU core have an I/O bus with the same number of bits. The difference, is GPU cores have many more transistors for math/computation.

oh i see... thank you, and I have one more question, accoding to your codes, I found it maybe possible to load vector and broadcast into simdgroup_matrix. as you can see the bottleneck of GEMV is bandwidth. Have you ever tried to load vector into simdgroup_matrix?simdgroup_load can get higher bandwidth or not(i think not, cause seems no special cache for simdgroup_matrix load\store)?
on the other hand, try to use simdgroup_matrix will bring about much waste of computation, [1,8] @ [8,8]--> [8,8] @ [8,8], 7x computation than original GEMV.

@philipturner
Copy link
Owner

SIMD-group matrix multiply is for GEMM, which is compute-bound. GEMM and GEMV have very different performance characteristics. For GEMV (MAT x VEC), the bottleneck is reading the matrix from memory. The matrix is only read a single time, so you want to maximize the bandwidth of reading the matrix. The vector is read several times. Usually you can arrange the GPU threads so the L1 subsystem coalesces memory reads to the vector.

For GEMM, both the lhs and rhs are read multiple times. To minimize the number of read operations, you have to store the matrices on a local piece of SRAM. They transfer data between GPU threads in different directions that aren't as straightforward as GEMV. This is where hardware acceleration is helpful; SIMD matrix multiply. There's also instructions to minimize the latency when transfering data of a matrix from RAM to SRAM; SIMD async copy. These instructions only provide speedup for matrix-matrix multiplication (GEMM).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants