This is mine version of compression algorithm based on run length encoding implemented for nVidia cards having CUDA compute capability.
I based my work on:
While compression algorithm might be farily complex it goes as follows (as a run I define a sequence of repeated character):
- Compute mask out of input data - put 1 where new run starts, 0 otherwise
- Compute inclusive prefix sum (scan) for the mask, which encodes the output location of all the compressed pairs.
- Compute compacted mask which encodes starting index for all the runs.
- Finally compute occurences (which are difference between elements in compacted mask) and save character to output matrix of each run (which can be found by using compacted mask value as index).
Decompression algorithm is a more straightforward:
- Compute exclusive prefix sum out of occurences matrix, it's elemnts will point to the place of start of each run.
- Compute orginal data, in my approach each thread is responsible for filling out output matrix for one run. (It might be inefficient approach for a highly RLE compressible data)
Comparison with simple sequential CPU RLE compression implementation (averages for a 256 colour 11520x6480 bitmap)
Device | Compression time | Decompression time |
---|---|---|
CPU | 2229ms | 7481ms |
GPU | 955ms | 320ms |