Skip to content

kowallus/PgRC

Repository files navigation

PgRC: Pseudogenome based Read Compressor

Pseudogenome-based Read Compressor (PgRC) is an in-memory algorithm for compressing the DNA stream of FASTQ datasets, based on the idea of building an approximation of the shortest common superstring over high-quality reads.

The current implementation supports constant-length reads limited to 255 bases.

Installation on Linux

The following steps create an PgRC executable. On Linux PgRC build requires installed cmake version >= 3.4 (check using cmake --version):

git clone https://github.com/kowallus/PgRC.git
cd PgRC
mkdir build
cd build
cmake ..
make PgRC

Basic usage

PgRC [-c compressionLevel] [-i seqSrcFile [pairSrcFile]] [-t noOfThreads] [-o] [-d] archiveName
   
   -c compression levels: 1 - fast; 2 - default; 3 - max
   -t number of threads used (8 - default)
   -d decompression mode
   -o preserve original read order information

compression of DNA stream in order non-preserving regime (SE mode):

./PgRC -i in.fastq comp.pgrc

compression of DNA stream in order preserving regime (SE_ORD mode):

./PgRC -o -i in.fastq comp.pgrc

compression of paired-end DNA stream in order non-preserving regime (PE mode):

./PgRC -i in1.fastq in2.fastq comp.pgrc

compression of paired-end DNA stream in order preserving regime (PE mode):

./PgRC -o -i in.fastq comp.pgrc

Publications

Tomasz M. Kowalski, Szymon Grabowski: PgRC: pseudogenome-based read compressor. Bioinformatics, Volume 36, Issue 7, pp. 2082–2089 (2020).

supplementary data

bioRxiv

Related projects

PgSA - Pseudogenome Suffix Array