leet_gpu_solution

leet_gpu_solution is a growing collection of optimized CUDA implementations for challenges on https://leetgpu.com/challenges.
The project demonstrates step-by-step optimization from basic to optimized solutions, making it a practical learning resource for anyone studying GPU programming, CUDA performance, or Nsight profiling.

Supported Solutions

Module	Description	Optimization Techniques	Notable Speedup vs. Baseline	Status
`convolution_1d`	Convolution of 1D vector with 1D kernel	Shared memory	1.40× on input_size = 1,500,000, kernel = 2047	✅ Implemented
`convolution_2d`	Convolution of 2D matrix with 2D kernel	Shared memory, tiling, im2col	52.0× on 3072×3072 image (31×31 kernel)	✅ Implemented
`matrix_multiplication`	2D matrix × 2D matrix multiplication	Shared memory tiling, float4 vectorized IO, WMMA/Tensor Cores, loop unrolling	4.08× on M = N = 2048	✅ Implemented
`matrix_transpose`	Transpose of a 2D matrix	Shared memory tiling, coalesced memory access, bank-conflict avoidance	1.23× on M = N = 10 000	✅ Implemented
`quantized_matrix_multiplication`	INT8 quantized matrix × matrix multiplication	Shared memory tiling, INT4 packing, QuantizeMultiplier requantization, Tensor Core acceleration	5.62× on M = N = 2048	✅ Implemented
`softmax`	Exponential normalization across vectors or matrix rows	Warp-level reduction, online softmax, shared memory	2.06× on N = 1 048 576	✅ Implemented
`sparse_matrix_vector_multiplication`	Sparse matrix × dense vector multiplication	CSR / ELL / Block-ELL / Merge-Path formats, merge-path SpMV	1.2x total spmv setup + runtime latency 316x spmv runtime latency on M = N = 4096 (60% sparsity)	✅ Implemented
`vector_addition`	Sum of two vectors	Float4 access	1.03× on 100 000 000 elements	✅ Implemented
`vector_reduction`	Sum reduction over large arrays	Shared memory, warp shuffle, sequential access pattern, loop unrolling	4.29× on N = 1 073 741 824	✅ Implemented

NOTE: More detailed readme files are written in the folder of each problem

Benchmarking Note

All latency values in this repository are collected using Nsight Compute (gpu__time_duration.sum), which reports pure GPU kernel execution time.

These measurements exclude host-side setup, CUDA context initialization, and driver launch overhead.
Therefore, the reported numbers represent kernel latency (device-only performance), not full end-to-end or cold-start inference latency.

Upcoming Work

CUTLASS / cuBLAS Comparison

Benchmark matrix_multiplication and quantized_matrix_multiplication against NVIDIA cuBLAS and CUTLASS kernels.
Evaluate both FP32 and INT8 performance on identical input sizes.
Measure kernel-only time and end-to-end latency.

cuSPARSE Comparison

Benchmark all spmv variants (CSR, Vector CSR, ELLPACK, Block-ELL, Merge-Path) against cuSPARSE reference implementations.
Separate setup cost (memory allocation, temp buffer query) from runtime cost (actual SpMV kernel).

Architectural Scaling

Current architecture: NVIDIA Ada (SM 89).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

leet_gpu_solution

Supported Solutions

Benchmarking Note

Upcoming Work

CUTLASS / cuBLAS Comparison

cuSPARSE Comparison

Architectural Scaling

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
convolution_1d		convolution_1d
convolution_2d		convolution_2d
int8_quantized_matmul		int8_quantized_matmul
matrix_multiplication		matrix_multiplication
matrix_transpose		matrix_transpose
softmax		softmax
sparse_matrix_vector_multiplication		sparse_matrix_vector_multiplication
vector_addition		vector_addition
vector_reduction		vector_reduction
README.md		README.md

keneoneth/leet_gpu_solution

Folders and files

Latest commit

History

Repository files navigation

leet_gpu_solution

Supported Solutions

Benchmarking Note

Upcoming Work

CUTLASS / cuBLAS Comparison

cuSPARSE Comparison

Architectural Scaling

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages