Skip to content

Energy-Aware GPU Performance Prediction and Optimization Framework

Notifications You must be signed in to change notification settings

pavlyhalim/GPPerf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GPPerf

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

Our goal is to predict the runtime and energy comsumption of SGEMM on NVIDIA GPU given different matrix sizes, block sizes, and tile sizes. We implemented a naive tiled matrix multiplication kernel and used it to gather the data for different tile sizes. We also used cutlass to gather the data with more advanced configurations. We then trained a model to predict the performance and energy consumption of SGEMM given different configurations.

Docs

Installation

Our analysis used cuda5 node. For the Ada Lovelace GPU on cuda5, we have already built the cutlass_profiler binary. You can directly use it to profile the kernels, and skip the installation process. If you want to build the cutlass_profiler binary yourself, you can follow the instructions below. Clone the cutlass repo. Then install cutlass by running the following commands (this works for Ada Lovelace architecture. For other nodes, you need to change the architecture flag):

git clone https://github.com/NVIDIA/cutlass.git

export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc

mkdir build && cd build

cmake .. -DCUTLASS_NVCC_ARCHS=89             # compiles for NVIDIA Ada Lovelace GPU architecture

make cutlass_profiler -j12

The commands are adapted from the cutlass quick start guide.

Usage

Profiling using the cutlass profiler

export CUTLASS_PROFILER="YOUR_CUTLASS_DIRECTORY/build/tools/profiler/cutlass_profiler" #change YOUR_CUTLASS_DIRECTORY to your path to the cutlass profiler. For example, /home/username/cutlass/build/tools/profiler/cutlass_profiler

bash prof.sh

The results will be saved in the cutlass_profiling_20241118_191220/results.csv file. The results are then cleaned and reformatted in the following way.

Timestamp M N K Kernel Name Layout Blocksize1 Blocksize2 Blocksize3 Stage Combination Type Alpha Beta Runtime Power Clock SM Clocks Meme Temp GPU Util Mem Util GPU Name Version Clocks Max Mem Graphics Max Clock Power Limit State Total Memory Free Memory Used Memory GPU Util1 Mem Util2 Kernel Name Arithmetic Intensity Uses Shared Memory Computation Pattern Energy TFlops
1731372844 512 512 512 cutlass_simt_sgemm_128x128_8x2_nn_align1 nn 64 64 32 2 linear_combination 1 0 0.04352 70.17 2760 10251 47 100% 0% NVIDIA GeForce RTX 4070 8.9 10501 MHz 3105 MHz 200.00 W Default 12282 MiB 11827 MiB 176 MiB 100% 0% cutlass_simt_sgemm_128x128_8x2_nn_align1 85.33 0 GEMM 87902.151 6168.094

Then run the model to predict the performance and energy of the kernel.

python model.py

Profiling using our tiled matrix multiplication kernel

nvcc matmul.cu -lcublas -o matmul.o
./matmul.o 512 512 512 8 #M N K TILE_SIZE

Sample output:

Running tiled matrix multiplication with M=512, N=512, K=512, TILE_SIZE=8
Tiled MM result sample: 1024 1024 1024 1024 1024
cuBLAS result sample: 1024 1024 1024 1024 1024
Execution time: 5.63299 ms

The "Tiled MM result sample" are results from our kernel, and the "cuBLAS result sample" are results from cuBLAS, which serves as a verification of the correctness of our kernel. The "Execution time" is the runtime of our kernel. We used the cudaEventRecord API to measure the runtime.

You can use the script matmul_runtime_prof.sh to gather the runtime data, and matmul_power_prof.sh to gather the power data. The runtime data is stored in "execution_times.csv" with the following format:

M,N,K,Tile Size,Execution Time
512,512,512,8,6.56141
...

The power data is stored in "power_usage_results.csv" with the following format:

M,N,K,Tile Size,Average Power Usage (W)
512,512,512,1,32.038
...

About

Energy-Aware GPU Performance Prediction and Optimization Framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published