Parallel Prefix Sum (Scan) with CUDA

Pytorch Usage Note

Installation

python setup.py install

Usage

from prefix_sum import prefix_sum_cpu, prefix_sum_cuda
# assuming input is a torch.cuda.IntTensor, num_elements is an integer
# allocate output_array on cuda
# e.g. output = torch.zeros((num_elements,), dtype=torch.int, device=torch.device('cuda'))
prefix_sum_cuda(input, num_elements, output)

# similarly for the CPU version
# except that both input and output are torch.IntTensor now
prefix_sum_cpu(input, num_elements, output)

Original README

My implementation of parallel exclusive scan in CUDA, following this NVIDIA paper.

Parallel prefix sum, also known as parallel Scan, is a useful building block for many parallel algorithms including sorting and building data structures. In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA. We start with a basic naïve algorithm and proceed through more advanced techniques to obtain best performance. We then explain how to scan arrays of arbitrary size that cannot be processed with a single block of threads.

This implementation can handle very large arbitrary length vectors thanks to the recursively defined scan function.

Performance is increased with a memory-bank conflict avoidance optimization (BCAO).

See the timings for a performance comparison between:

Sequential scan run on the CPU
Parallel scan run on the GPU
Parallel scan with BCAO

For a vector of 10 million entries:

  CPU      : 20749 ms
  GPU      : 7.860768 ms
  GPU BCAO : 4.304064 ms

Intel Core i5-4670k @ 3.4GHz, NVIDIA GeForce GTX 760

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
parallel-scan		parallel-scan
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
prefix_sum.cu		prefix_sum.cu
prefix_sum.h		prefix_sum.h
setup.py		setup.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallel Prefix Sum (Scan) with CUDA

Pytorch Usage Note

Installation

Usage

Original README

About

Releases

Packages

Languages

lxxue/prefix_sum

Folders and files

Latest commit

History

Repository files navigation

Parallel Prefix Sum (Scan) with CUDA

Pytorch Usage Note

Installation

Usage

Original README

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages