Pil
(arrow in Swedish) is an open-source C++ library specifying a language-independent columnar memory format for flat and nested data, organized for efficient analytic operations on modern hardware. The memory-layout permits O(1)-time random access. The layout is highly cache-efficient in analytics workloads and permits SIMD optimizations with modern processors. Uniquely, Pil
supports streaming construction of schema-agnostic archives.
We created Pil
to make the advantages of compressed, efficient columnar data representation with support for very efficient compression and encoding schemes with a focus on supporting genomics data. Pil
allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.
When executing queries in the most generic and basic manner, filtering happens very late in the process. Moving filtering to an earlier phase of query execution provides significant performance gains by eliminating non-matches earlier, and therefore saving the cost of processing them at a later stage. This group of optimizations is collectively known as predicate pushdown. Pil
may evaluate the predicates using three different approaches and discards non-matching records:
- Segmental statistics: relies on the minimum and maximum value statistics in the ColumnStore meta data to filter and prune data at the row group level.
- Dictionary encoding: dictionaries can filter out values that are between
min
andmax
but not in the dictionary. - Bloom filter: When there are too many distinct values, constructing dictionaries can be expensive. In cases of high-cardinality sets, we use blocked Bloom-filters for probabilistic set-membership tests.
Pillar
is the specialized implementation that can consume the majority of the incumbent genomics interchange formats including: SAM, BAM, CRAM, VCF, BCF, YON, FASTA, FASTQ, BED, GTF2, GFF3 and have native coding support for sequencing-specific range codecs (CRAM and fqzcomp), PBWT (BGT), genotype-PBWT (YON), multi-symbol-PBWT, and individual-centric WAH-bitmaps (GQT).
Type | Format(s) |
---|---|
Sequence | FASTA, 2bit-FASTA |
Annotations | BED, GTF2, GFF3 |
Read alignments | SAM, BAM, CRAM |
Variant | VCF, BCF, GQT, BGT |
ERR194146.1: Illumina HiSeq 2000 run of Coriell CEPH/UTAH 1463 sample.
Format | File size | Import time | Compression ratio | Random access |
---|---|---|---|---|
FASTQ | 2337920025 | - | 1 | No |
gzip | 775486779 | 5m25.651s | 3.014777 | No |
zstd | 783213067 | 0m32.145s | 2.985037 | No |
fqzcomp | 441223058 | 1m34.187s | 5.298726 | No |
Pil | 501741060 | 3m34.864s | 4.440460 | Yes |
Pil-65536 | 469251192 | 5m16.680s | 4.982236 | Yes |
gzip -c ERR194146.fastq > ERR194146.fastq.gz
fqzcomp -s3 -e -q2 -n2 ERR194146.fastq ERR194146.fastq.fqz
zstd -3 ERR194146.fastq -o ERR194146.fastq.zst
Pil
usedPIL_COMPRESS_RC_QUAL
andPIL_COMPRESS_RC_BASES
codecs for per-base quality scores and bases, respectively. Sequence names were tokenized by:
(colon) and partitioned intoColumnStores
without additional processing.Pil-65536
was run with the same settings but theRecordBatch
size set to 65536 instead of 8192.
NA12878J: Illumina HiSeq-X run of Coriell CEPH/UTAH 1463 sample comprising 122.6 gigabases at 30× coverage. The original dataset from the Garvan Institute of Medical Research consists of the following files:
We will use the first 50 million reads from NA12878J_HiSeqX_R1
to convert into a BAM
file using biobambam2
.
Format | File size | Import time | Compression ratio | Random access |
---|---|---|---|---|
FASTQ | 4469877114 | - | 1 | No |
FASTQ.gz | 1185209615 | 7m59.405s | 3.771381 | No |
fqzcomp | 741975278 | 2m24.190s | 6.024294 | No |
SAM | 4532377114 | - | 0.986 | No |
SAM.gz | 1187984017 | 7m48.570s | 3.762573 | No |
BAM | 1208014419 | 4m8.770s*, 4m47.160s** | 3.700185 | Partial |
CRAM | 860706428 | 1m29.555s | 5.193266 | Partial |
Pil | 801811920 | 3m59.769s | 5.57472 | Yes |
Pil-65536 | 773132410 | 3m40.953s | 5.781516 | Yes |
* SAM
->BAM
** FASTQ
-> BAM
gzip -c NA12878J_HiSeqX_R1_50mil.fastq > NA12878J_HiSeqX_R1_50mil.fastq.gz
fqzcomp -s3 -e -q2 -n2 NA12878J_HiSeqX_R1_50mil.fastq NA12878J_HiSeqX_R1_50mil.fastq.fqz
fastqtobam NA12878J_HiSeqX_R1_50mil.fastq > NA12878J_HiSeqX_R1_50mil.fastq.bam
samtools view NA12878J_HiSeqX_R1_50mil.fastq.sam -O bam > NA12878J_HiSeqX_R1_50mil.fastq.bam
samtools view NA12878J_HiSeqX_R1_50mil.fastq.bam -O cram > NA12878J_HiSeqX_R1_50mil.fastq.cram
gzip -c NA12878J_HiSeqX_R1_50mil.fastq.sam > NA12878J_HiSeqX_R1_50mil.fastq.sam.gz
Pil
usedPIL_COMPRESS_RC_QUAL
andPIL_COMPRESS_RC_BASES
codecs for per-base quality scores and bases, respectively. Sequence names were tokenized by:
(colon) and partitioned intoColumnStores
without additional processing.Pil-65536
was run with the same settings but theRecordBatch
size set to 65536 instead of 8192.
Aligned data from above resulting in 12,565,597 records.
Format | File size | Import time | Compression ratio | Random access |
---|---|---|---|---|
SAM | 5271405563 | - | 1 | No |
BAM | 1540663158 | 5m5.326s | 3.421517 | Partial |
CRAM | 534863873* | 2m5.874s | 9.855602 | Partial |
Pil | 945503012 (498481408**) | 5m22.283s | 5.575239 (10.57493**) | Yes |
Pil-65536 | 915311876 (484110493**) | 5m5.223s | 5.759136 (10.88885**) | Yes |
* CRAM
requires an external reference, in this case hg19.fa.gz
, that is 948731427 bytes.
** Running Pil
with an external reference sequence like CRAM
and dropping CIGAR fields (they can be
recomputed on-the-fly).
samtools view NA12878J_HiSeqX_R1_50mil.fastq.aligned.sam -O bam > NA12878J_HiSeqX_R1_50mil.fastq.aligned.bam
samtools view NA12878J_HiSeqX_R1_50mil.fastq.aligned.bam -O cram > NA12878J_HiSeqX_R1_50mil.fastq.aligned.cram
Pil
usedPIL_COMPRESS_RC_QUAL
andPIL_COMPRESS_RC_BASES
codecs for per-base quality scores and bases, respectively. Sequence names were tokenized by:
(colon) and partitioned intoColumnStores
without additional processing.Pil-65536
was run with the same settings but theRecordBatch
size set to 65536 instead of 8192.