RLE compression in CUDA

This is mine version of compression algorithm based on run length encoding implemented for nVidia cards having CUDA compute capability.

I based my work on:

While compression algorithm might be farily complex it goes as follows (as a run I define a sequence of repeated character):

Compute mask out of input data - put 1 where new run starts, 0 otherwise
Compute inclusive prefix sum (scan) for the mask, which encodes the output location of all the compressed pairs.
Compute compacted mask which encodes starting index for all the runs.
Finally compute occurences (which are difference between elements in compacted mask) and save character to output matrix of each run (which can be found by using compacted mask value as index).

Decompression algorithm is a more straightforward:

Compute exclusive prefix sum out of occurences matrix, it's elemnts will point to the place of start of each run.
Compute orginal data, in my approach each thread is responsible for filling out output matrix for one run. (It might be inefficient approach for a highly RLE compressible data)

Comparison with simple sequential CPU RLE compression implementation (averages for a 256 colour 11520x6480 bitmap)

Device	Compression time	Decompression time
CPU	2229ms	7481ms
GPU	955ms	320ms

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
rle_compression_cuda		rle_compression_cuda
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
rle_compression_cuda.sln		rle_compression_cuda.sln

Provide feedback