Attention in CUDA C

A forward pass implementation in Cuda C of a simplified (no layernorm, no mask) Attention layer beating PyTorch performance in forward only.

The operation consists of 3 Operations which we implement as seperate kernels:

Each of these is implemented in increasing complexity and performance. For testing we bind these kernels into pytorch and call them from there.

To profile the individual kernels run the bench.py files in the respective directory.
To benchmark the full attention pass, run the outermost bench.py file.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
matrix_multiply		matrix_multiply
softmax_CUDA		softmax_CUDA
transpose_cuda		transpose_cuda
README.md		README.md
attention.cu		attention.cu
bench.py		bench.py
main.cpp		main.cpp

Provide feedback