Pseudogenome-based Read Compressor (PgRC) is an in-memory algorithm for compressing the DNA stream of FASTQ datasets, based on the idea of building an approximation of the shortest common superstring over high-quality reads.
The current implementation supports constant-length reads limited to 255 bases.
The following steps create an PgRC executable.
On Linux PgRC build requires installed cmake version >= 3.4 (check using cmake --version
):
git clone https://github.com/kowallus/PgRC.git
cd PgRC
mkdir build
cd build
cmake ..
make PgRC
PgRC [-c compressionLevel] [-i seqSrcFile [pairSrcFile]] [-t noOfThreads] [-o] [-d] archiveName
-c compression levels: 1 - fast; 2 - default; 3 - max
-t number of threads used (8 - default)
-d decompression mode
-o preserve original read order information
compression of DNA stream in order non-preserving regime (SE mode):
./PgRC -i in.fastq comp.pgrc
compression of DNA stream in order preserving regime (SE_ORD mode):
./PgRC -o -i in.fastq comp.pgrc
compression of paired-end DNA stream in order non-preserving regime (PE mode):
./PgRC -i in1.fastq in2.fastq comp.pgrc
compression of paired-end DNA stream in order preserving regime (PE mode):
./PgRC -o -i in.fastq comp.pgrc