CUDA Stream Compaction

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2

Michael Willett
Tested on: Windows 10, I5-4690k @ 3.50GHz 8.00GB, GTX 750-TI 2GB (Personal Computer)

## Introduction: Parallel Algorithms The goal of this project was to implement several versions of basic GPU algorithms to become familiar with parallel programming ideas, as well as compare the performance overhead of memory management on GPU vs traditional CPU implementations. ## Section 1: Scanning, Stream Compaction, and Sorting Three basic algorithms were implemented in this project

Scan - consective sum of all elements in the array prior to the current index.
Compaction - removal of all elements in an array that meet some boolean criteria.
Sort - arrange all elements in an array from low to high using a radix sort implementation.

Two implementations of GPU based scan were implemented: a naive version where elements were added in pairs until all sums were computed, and a work efficient version where the total number of operators is reduced at the cost of performing two separate stages of analysis. In the implemented version, the work-efficient scan performs slightly worse than the naive approach from the extra down-sweep phase, however, there is room for improvement with smarter array indexing to reduce the total number of GPU kernel calls. Both algorithms run in O(log(N)) which is theoretically faster than CPU implementations at O(N), but performance analysis shows that this is only significant for large arrays due to array access overhead on the GPU level.

Stream compaction shows similar trends in computation complexity in GPU vs. CPU performance, but catches up to CPU faster than the scan implementations since GPU accelerated reording of arrays is significantly faster than CPU implementations even for relatively small arrays.

The radix sort agorithm iterates over each bit in the input value from least significant to most significant, reordering the array at each step. Again since the reording step is highly efficient on a GPU, the major overhead is due to memory access time.

## Section 2: Verification and Performance Analysis *Code Correctness for Scan, Compact, and Sort Implementations:*

****************
** SCAN TESTS **
****************
    [  38  19  38  37   5  47  15  35   0  12   3   0  42 ...  35   0 ]
==== cpu scan, power-of-two ====
    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 1604374 1604409 ]
==== cpu scan, non-power-of-two ====
    passed
==== naive scan, power-of-two ====
    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 1604374 1604409 ]
    passed
==== naive scan, non-power-of-two ====
    passed
==== work-efficient scan, power-of-two ====
    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 1604374 1604409 ]
    passed
==== work-efficient scan, non-power-of-two ====
    passed
==== thrust scan, power-of-two ====
    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 1604374 1604409 ]
    passed
==== thrust scan, non-power-of-two ====
    passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
    [   2   3   2   1   3   1   1   1   2   0   1   0   2 ...   1   0 ]
==== cpu compact without scan, power-of-two ====
    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   1   1 ]
    passed
==== cpu compact without scan, non-power-of-two ====
    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   3   3 ]
    passed
==== cpu compact with scan ====
    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   1   1 ]
    passed
==== work-efficient compact, power-of-two ====
    passed
==== work-efficient compact, non-power-of-two ====
    passed

**********************
** RADIX SORT TESTS **
**********************
==== radix sort, power-of-two ====
    [  38 7719 21238 2437 8855 11797 8365 32285 ]
    [  38 2437 7719 8365 8855 11797 21238 32285 ]
    passed
==== radix sort, non-power-of-two ====
    [  38 7719 21238 2437 8855 11797 8365 ]
    [  38 2437 7719 8365 8855 11797 21238 ]
    passed

Sort passed 1000/1000 randomly generated verification tests.

Benchmarks

All performance measurements were averaged over 1000 samples on a single dataset Scan Performance Compact Performance

We can quickly see from the above charts that run time for the various GPU implementations for scan, compact, and stort all have logarithmic growth, while CPU implementations observe polynomial growth (for scan and compaction, CPU growth is linear). As a result, for smaller arrays the overhead of memory access and array manipulation results in significantly worse execution time, and it isn't until arrays of about ~50,000 elements that performance is comparable. Unfortunately due to implementation choices, the GPU implemenations crash for arrays larger than 2¹⁶, so performance analysis could not be run for these large arrays, and estimates must be extrapolated from the current dataset.

Scan Kernal Calls The above figure shows the room for improvement of the efficient scan quite easily. Compared to the naive scan steps, the up-sweep and down-sweep of the work-efficient implementation do not utilize nearly as much of the GPU per call as the naive scan. Saturating the GPU here could yeild significant improvements by reducing the total number of kernal calls required.

For the thrust library benchmarks, it is very interesting to see what appears to be standard linear growth due to the performance loss in large arrays. Preliminary research on the thrust sort implementation implies that for basic sorting of primitives like integers used in this example, it should also be using a radix sort implemented on the GPU. The charts suggest, however, that thrust either not properly parallelizing the data, or it is running on the CPU. Documentation states that method overloads that accept arrays located on the host will automatically handle data transfer to and from the GPU, but that may not be the case. Additional work to instantiate the data on the GPU prior to the sort results in an Abort() call from the thrust library, so additional investigation would be merited.

## Section 3: Development Process Development was fairly straight forward algorithmic implementation. Future work could be done in the work effecient scan implementation to better handle launching kernal functions at the high depth levels when only a couple of sums are being calculated for the whole array. This should improve peformance to be faster than the naive approach, but I exepect it to still be outperformed by the thrust implementation as well as the CPU version for arrays with fewer than 30,000 elements.

It was worth noting there was a bug in using std::pow for calculating the array index in each kernal invocation. For some unknown reason, it consisantly produced erronius values at 2¹¹, or 2048. This is odd since variables were being cast to double precision before the computation, but the reverse cast was incorrect. This bug may be compiler specific as running the index calculation on a separate machine result in accurate indexing in this range. Simply changing the code to use bitshift operations cleared the error entirely.

## Appendix: Build Instructions **CMakeLists.txt modified to include new sort class and update compute compatability**

src/ contains the source code.

CMake note: Do not change any build settings or add any files to your project directly (in Visual Studio, Nsight, etc.) Instead, edit the src/CMakeLists.txt file. Any files you add must be added here. If you edit it, just rebuild your VS/Nsight project to make it update itself.

Windows

In Git Bash, navigate to your cloned project directory.
Create a build directory: mkdir build
- (This "out-of-source" build makes it easy to delete the build directory and try again if something goes wrong with the configuration.)
Navigate into that directory: cd build
Open the CMake GUI to configure the project:
- cmake-gui .. or "C:\Program Files (x86)\cmake\bin\cmake-gui.exe" ..
  - Don't forget the .. part!
- Make sure that the "Source" directory is like .../Project2-Stream-Compaction.
- Click Configure. Select your version of Visual Studio, Win64. (NOTE: you must use Win64, as we don't provide libraries for Win32.)
- If you see an error like CUDA_SDK_ROOT_DIR-NOTFOUND, set CUDA_SDK_ROOT_DIR to your CUDA install path. This will be something like: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v7.5
- Click Generate.
If generation was successful, there should now be a Visual Studio solution (.sln) file in the build directory that you just created. Open this. (from the command line: explorer *.sln)
Build. (Note that there are Debug and Release configuration options.)
Run. Make sure you run the cis565_ target (not ALL_BUILD) by right-clicking it and selecting "Set as StartUp Project".
- If you have switchable graphics (NVIDIA Optimus), you may need to force your program to run with only the NVIDIA card. In NVIDIA Control Panel, under "Manage 3D Settings," set "Multi-display/Mixed GPU acceleration" to "Single display performance mode".

OS X & Linux

It is recommended that you use Nsight.

Open Nsight. Set the workspace to the one containing your cloned repo.
File->Import...->General->Existing Projects Into Workspace.
- Select the Project 0 repository as the root directory.
Select the cis565- project in the Project Explorer. From the Project menu, select Build All.
- For later use, note that you can select various Debug and Release build configurations under Project->Build Configurations->Set Active....
If you see an error like CUDA_SDK_ROOT_DIR-NOTFOUND:
- In a terminal, navigate to the build directory, then run: cmake-gui ..
- Set CUDA_SDK_ROOT_DIR to your CUDA install path. This will be something like: /usr/local/cuda
- Click Configure, then Generate.
Right click and Refresh the project.
From the Run menu, Run. Select "Local C/C++ Application" and the cis565_ binary.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
cmake		cmake
imgs		imgs
src		src
stream_compaction		stream_compaction
.cproject		.cproject
.gitignore		.gitignore
.project		.project
CMakeLists.txt		CMakeLists.txt
GNUmakefile		GNUmakefile
INSTRUCTION.md		INSTRUCTION.md
README.md		README.md
benchmark data.xlsx		benchmark data.xlsx
cis565_stream_compaction_test.launch		cis565_stream_compaction_test.launch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Stream Compaction