Skip to content

High-performance CUDA kernels with step-by-step optimization, profiling, and analysis. A growing collection of GPU solutions demonstrating warp-level tuning, memory optimization, and Tensor Core acceleration.

Notifications You must be signed in to change notification settings

keneoneth/leet_gpu_solution

Repository files navigation

leet_gpu_solution

leet_gpu_solution is a growing collection of optimized CUDA implementations for challenges on https://leetgpu.com/challenges.
The project demonstrates step-by-step optimization from basic to optimized solutions, making it a practical learning resource for anyone studying GPU programming, CUDA performance, or Nsight profiling.


Supported Solutions

Module Description Optimization Techniques Notable Speedup vs. Baseline Status
convolution_1d Convolution of 1D vector with 1D kernel Shared memory 1.40× on input_size = 1,500,000, kernel = 2047 ✅ Implemented
convolution_2d Convolution of 2D matrix with 2D kernel Shared memory, tiling, im2col 52.0× on 3072×3072 image (31×31 kernel) ✅ Implemented
matrix_multiplication 2D matrix × 2D matrix multiplication Shared memory tiling, float4 vectorized IO, WMMA/Tensor Cores, loop unrolling 4.08× on M = N = 2048 ✅ Implemented
matrix_transpose Transpose of a 2D matrix Shared memory tiling, coalesced memory access, bank-conflict avoidance 1.23× on M = N = 10 000 ✅ Implemented
quantized_matrix_multiplication INT8 quantized matrix × matrix multiplication Shared memory tiling, INT4 packing, QuantizeMultiplier requantization, Tensor Core acceleration 5.62× on M = N = 2048 ✅ Implemented
softmax Exponential normalization across vectors or matrix rows Warp-level reduction, online softmax, shared memory 2.06× on N = 1 048 576 ✅ Implemented
sparse_matrix_vector_multiplication Sparse matrix × dense vector multiplication CSR / ELL / Block-ELL / Merge-Path formats, merge-path SpMV 1.2x total spmv setup + runtime latency 316x spmv runtime latency on M = N = 4096 (60% sparsity) ✅ Implemented
vector_addition Sum of two vectors Float4 access 1.03× on 100 000 000 elements ✅ Implemented
vector_reduction Sum reduction over large arrays Shared memory, warp shuffle, sequential access pattern, loop unrolling 4.29× on N = 1 073 741 824 ✅ Implemented

NOTE: More detailed readme files are written in the folder of each problem

Benchmarking Note

All latency values in this repository are collected using Nsight Compute (gpu__time_duration.sum), which reports pure GPU kernel execution time.

These measurements exclude host-side setup, CUDA context initialization, and driver launch overhead.
Therefore, the reported numbers represent kernel latency (device-only performance), not full end-to-end or cold-start inference latency.

Upcoming Work

CUTLASS / cuBLAS Comparison

  • Benchmark matrix_multiplication and quantized_matrix_multiplication against NVIDIA cuBLAS and CUTLASS kernels.
  • Evaluate both FP32 and INT8 performance on identical input sizes.
  • Measure kernel-only time and end-to-end latency.

cuSPARSE Comparison

  • Benchmark all spmv variants (CSR, Vector CSR, ELLPACK, Block-ELL, Merge-Path) against cuSPARSE reference implementations.
  • Separate setup cost (memory allocation, temp buffer query) from runtime cost (actual SpMV kernel).

Architectural Scaling

  • Current architecture: NVIDIA Ada (SM 89).

About

High-performance CUDA kernels with step-by-step optimization, profiling, and analysis. A growing collection of GPU solutions demonstrating warp-level tuning, memory optimization, and Tensor Core acceleration.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published