Weird performance when using shared memory in GEMV #12

FdyCN · 2023-12-11T11:31:13Z

I try to optimize GEMV using shared memory to speed up I\O，theoretically speaking，GEMV with sram will have better bandwidth. BUT here comes a weird performance result.

Device: M2 Ultra 128GB
kernel cost from: GPUEndTime and GPUStartTime

Fistly, i build a xcode metal project for original GEMV(your codes) and sram GEMV(my codes)，I found sram GEMV is 30% faster than original GEMV；

// transA = false, transB = true.
// and this optimization on performance is as my wish
gemv [1,2048] @ [4096,2048] **0.098ms(original) --> 0.068ms(sram)**
gemv [1,2048] @ [11001,2048] **0.271ms(original) --> 0.195ms(sram)**

Secondly, I add my sram GEMV kernel into your project(because i want to combine them into one metal lib)，and call then in my other C++\OC project，then comes the strange thing：

// original much fast than xcode project test, and even faster than sram GEMV
gemv [1,2048] @ [4096,2048] **0.040ms(original) vs. 0.047ms(sram)**
gemv [1,2048] @ [11001,2048] **0.175ms(original) vs. 0.173ms(sram)**

my kernel code is :
warpPerBlock: 4, GridSize: {UP_ROUND(K, warpPerBlock), 1, 1}，GroupSize: {32 * warpPerBlock, 1, 1},

// ONLY support M = 1, tranA = false, transB = true now.
template <typename T, int Align>
void _gemv_sram_impl(device T *A [[buffer(0)]],
                device T *B [[buffer(1)]],
                device T *C [[buffer(2)]],
                device void *D [[buffer(3), function_constant(use_activation)]],
                
                threadgroup T *threadgroup_block [[threadgroup(0)]],
                constant ulong4 *matrix_offsets [[buffer(10), function_constant(batched)]],
                constant uint *activation_type [[buffer(13), function_constant(fused_activation)]],
                uint3 gid [[threadgroup_position_in_grid]],
                ushort warp_num [[dispatch_simdgroups_per_threadgroup]],
                ushort sidx [[simdgroup_index_in_threadgroup]],
                ushort lane_id [[thread_index_in_simdgroup]])
{
    if (gid.x * warp_num + sidx >= N || gid.y >= M) return;
    if (batched) {
        // TODO: Re-compute every inner loop iteration for FP64 accumulate.
        ulong3 offsets = matrix_offsets[gid.z].xyz;
        A = (device T*)((device uchar*)A + offsets[0]);
        B = (device T*)((device uchar*)B + offsets[1]);
        C = (device T*)((device uchar*)C + offsets[2]);
    }
    
    B += gid.x * warp_num * K;
    
    C += gid.y * N  + gid.x * warp_num + sidx;
    T acc_sum = 0;
    device vec<T, Align> * Aalign = (device vec<T, Align> *)A;
    device vec<T, Align> * Balign = (device vec<T, Align> *)B;
    // move data into smem
    threadgroup vec<T, Align> * smem = (threadgroup vec<T, Align> *)threadgroup_block;
    for (uint k = sidx * 32 + lane_id; k < K / Align; k += 32 * warp_num) {
        smem[k] = Aalign[k];
    }
    threadgroup_barrier(mem_flags::mem_threadgroup);
    
    for (uint k = lane_id; k < K / Align; k += 32) {
        device vec<T, Align> * BalignSIMD = Balign + K / Align * sidx;
        for (uint i = 0; i < Align; ++i) {
            acc_sum += smem[k][i] * BalignSIMD[k][i];
        }
    }
    T all_sum = simd_sum(acc_sum);
    if (lane_id == 0) {
        device T* BWarp = B + sidx * K;
        for (uint k = Align * (K / Align); k < K; ++k) {
            all_sum += A[k] * BWarp[k];
        }
        if (use_bias) {
            // not supported now...
        }
        if (fused_activation) {
            // not supported now...
        }
        *C = all_sum;
    }
}

Question:

Why is different between xcode testbed and metallib call？sram GEMV basiclly the same in xcode and metallib call, but orignal GEMV is much better in metallib call
Is there any compiling optimization i missed in sram GEMV？
According to my code and the situation i described, could you give me some advise about the potential cause of the performance gap？

Thank you for your help!

The text was updated successfully, but these errors were encountered:

philipturner · 2023-12-11T13:02:14Z

GEMV is different than GEMM, in that it is bandwidth bound instead of compute bound. This requires different types of GPU kernels to reach better performance. Your best bet is copying the GEMV kernels from LLaMA.cpp instead of the code from MFA.

philipturner · 2023-12-11T15:24:25Z

I looked at your issue a bit more closely, and I don't think I properly addressed your questions. I might get around to answering your questions another time.

Generally, here are some quick tips for optimization. Always perform multiple trials, with ~100 commands encoded into a single command encoder. Divide the total execution time by 100 to get the per-kernel execution time. Also, tie the performance metric to something physical. What percent of max bandwidth is it reaching? Just saying one kernel is faster than the other provides no physical grounding. Perhaps both only reach 10% of maximum performance.

Second, most optimizations people attempt will fail. You need a method to quickly evaluate whether a change consistently improves performance, or whether suspected speedups are simply random noise. Start with a well-known implementation and make changes line-by-line if possible. Also, rigorously test that your kernel is producing correct output. I made a mistake with a kernel for LLaMA.cpp, where I erroneously reported ~95% bandwidth. Only later, I realized there was an unidentified bug when speed exceeded 100% bandwidth on somebody else's machine.

FdyCN · 2023-12-12T03:47:09Z

I looked at your issue a bit more closely, and I don't think I properly addressed your questions. I might get around to answering your questions another time.

Generally, here are some quick tips for optimization. Always perform multiple trials, with ~100 commands encoded into a single command encoder. Divide the total execution time by 100 to get the per-kernel execution time. Also, tie the performance metric to something physical. What percent of max bandwidth is it reaching? Just saying one kernel is faster than the other provides no physical grounding. Perhaps both only reach 10% of maximum performance.

Second, most optimizations people attempt will fail. You need a method to quickly evaluate whether a change consistently improves performance, or whether suspected speedups are simply random noise. Start with a well-known implementation and make changes line-by-line if possible. Also, rigorously test that your kernel is producing correct output. I made a mistake with a kernel for LLaMA.cpp, where I erroneously reported ~95% bandwidth. Only later, I realized there was an unidentified bug when speed exceeded 100% bandwidth on somebody else's machine.

thank you for your reply, I found that M2 Ultra has 800 GB/s bandwidth of GPU Core, which means data transfer from global memory to register(I guess so), so when i use shared memory to optimize the bandwith of gloabl memory access, there is small gain because the BW gap between global & shared memory is small, so the BW gain is not large enough to cover the barrier sync latency of using shared memory. So the performance of sram_GEMV is not faster or even slower. I guess that's maybe the reason.

I will try to test on some other machine like m2 pro ( which has 19 GPU core with 200GB/s bandwidth).
and I will post the result later.

BW related information are from: https://github.com/mikeroyal/Apple-Silicon-Guide

philipturner · 2023-12-12T04:12:02Z

The ratio of local to global memory bandwidth is the same across all Apple GPU architectures. All major performance parameters are, except absolute performance. This makes analysis easier - execution speed scales directly with GPU core count.

FdyCN · 2023-12-13T03:18:27Z

The ratio of local to global memory bandwidth is the same across all Apple GPU architectures. All major performance parameters are, except absolute performance. This makes analysis easier - execution speed scales directly with GPU core count.

so what you mean is that M2 Ultra has 800GB/s BW and M2 Pro has 200GB/s BW is only related with their GPU Core, it means how many bytes can be loading in one wave by all GPU cores？and the Bytes/Cycle is all the same across all Apple GPU arch？

philipturner · 2023-12-13T03:24:03Z

Yes. Bytes/core-cycle and similar metrics are normalized by:

(1) number of GPU cores
(2) number of GHz

When you account for those two factors, a lot of characteristics match across GPUs in the same family. In fact, I found almost identical quantities across different hardware vendors. Even numbers that are the same in a CPU core. For example, 64 bytes/core-cycle from L1 and 32 bytes/core-cycle from L2. Both a CPU and GPU core have an I/O bus with the same number of bits. The difference, is GPU cores have many more transistors for math/computation.

FdyCN · 2023-12-14T10:39:13Z

Yes. Bytes/core-cycle and similar metrics are normalized by:

(1) number of GPU cores

(2) number of GHz

When you account for those two factors, a lot of characteristics match across GPUs in the same family. In fact, I found almost identical quantities across different hardware vendors. Even numbers that are the same in a CPU core. For example, 64 bytes/core-cycle from L1 and 32 bytes/core-cycle from L2. Both a CPU and GPU core have an I/O bus with the same number of bits. The difference, is GPU cores have many more transistors for math/computation.

oh i see... thank you, and I have one more question, accoding to your codes, I found it maybe possible to load vector and broadcast into simdgroup_matrix. as you can see the bottleneck of GEMV is bandwidth. Have you ever tried to load vector into simdgroup_matrix？simdgroup_load can get higher bandwidth or not(i think not, cause seems no special cache for simdgroup_matrix load\store)?
on the other hand, try to use simdgroup_matrix will bring about much waste of computation, [1,8] @ [8,8]--> [8,8] @ [8,8], 7x computation than original GEMV.

philipturner · 2023-12-14T14:55:37Z

SIMD-group matrix multiply is for GEMM, which is compute-bound. GEMM and GEMV have very different performance characteristics. For GEMV (MAT x VEC), the bottleneck is reading the matrix from memory. The matrix is only read a single time, so you want to maximize the bandwidth of reading the matrix. The vector is read several times. Usually you can arrange the GPU threads so the L1 subsystem coalesces memory reads to the vector.

For GEMM, both the lhs and rhs are read multiple times. To minimize the number of read operations, you have to store the matrices on a local piece of SRAM. They transfer data between GPU threads in different directions that aren't as straightforward as GEMV. This is where hardware acceleration is helpful; SIMD matrix multiply. There's also instructions to minimize the latency when transfering data of a matrix from RAM to SRAM; SIMD async copy. These instructions only provide speedup for matrix-matrix multiplication (GEMM).

philipturner closed this as completed Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird performance when using shared memory in GEMV #12

Weird performance when using shared memory in GEMV #12

FdyCN commented Dec 11, 2023 •

edited

Loading

philipturner commented Dec 11, 2023

philipturner commented Dec 11, 2023

FdyCN commented Dec 12, 2023 •

edited

Loading

philipturner commented Dec 12, 2023

FdyCN commented Dec 13, 2023

philipturner commented Dec 13, 2023

FdyCN commented Dec 14, 2023 •

edited

Loading

philipturner commented Dec 14, 2023

Weird performance when using shared memory in GEMV #12

Weird performance when using shared memory in GEMV #12

Comments

FdyCN commented Dec 11, 2023 • edited Loading

philipturner commented Dec 11, 2023

philipturner commented Dec 11, 2023

FdyCN commented Dec 12, 2023 • edited Loading

philipturner commented Dec 12, 2023

FdyCN commented Dec 13, 2023

philipturner commented Dec 13, 2023

FdyCN commented Dec 14, 2023 • edited Loading

philipturner commented Dec 14, 2023

FdyCN commented Dec 11, 2023 •

edited

Loading

FdyCN commented Dec 12, 2023 •

edited

Loading

FdyCN commented Dec 14, 2023 •

edited

Loading