This repo evaluates the performance of portable libraries for constructing the suffix array (SA) of string sets. These libraries only include a few source files and have no dependencies. They can be building blocks of larger projects.
There are different ways to define the SA of a string set. We
here focus on the most common definition as follows. Let
Few SA construction libraries directly support string sets. Nonetheless, we can
achieve the goal for any libraries that support integer alphabets, such as
libsais, by converting
Then the SA of
To alleviate the issue, I developed ksa in 2011 by adapting an old
version of Yuta Mori's sais. Briefly, during symbol comparisons, ksa
implicitly replaces a sentinel
Published in 2017, gSACA-K is another library based on the linear-time SAIS algorithm. Please read its paper for details.
Here is the timing for constructing the CHM13v2 genome on both strand (6.2 billion symbols in total) on a Xeon Gold 6130:
ksa | gsaca-k | sais-t1 | sais-t4 | sais-t8 | sais-t8b | sais-t8c | sais16-t8c | |
---|---|---|---|---|---|---|---|---|
# threads | 1 | 1 | 1 | 4 | 8 | 8 | 8 | 8 |
Elapsed (s) | 1396 | 3356 | 588 | 386 | 260 | 374 | 473 | 296 |
CPU time (s) | 1395 | 3349 | 587 | 1152 | 1439 | 1895 | 2602 | 1146 |
Peak RSS (GB) | 52.3 | 53.5 | 92.9 | 92.9 | 92.9 | 92.9 | 92.9 | 58.4 |
Some notes and observations:
-
libsais is clearly the fastest even on a single thread and we see noticeable speedup with multiple threads. A caveat is that the multi-threading performance of libsais appears to have large fluctuation. For example, the three sais-t8 were run on different nodes with the same configuration but the speed was quite different. sais-t8c and sais16-t8c were run on the same machine.
-
gSACA-K would crash if compiled with
-fopenmp
. I am not sure why. -
ksa is faster than gSACA-K and has the same memory footprint. It would be good to apply this ksa strategy to libsais to reduce its peak memory.
-
We omitted ropebwt2 and BEETL because they are slow for chromosome-long strings and we omitted grlBWT because it writes temporary files and is not designed as a library. We did not evaluate eGAP because gSACA-K appears to be faster in multiple third-party benchmarks.