Pil

Experimental repository in pre-alpha. Use with caution!

Pil (arrow in Swedish) is an open-source C++ library specifying a language-independent columnar memory format for flat and nested data, organized for efficient analytic operations on modern hardware. The memory-layout permits O(1)-time random access. The layout is highly cache-efficient in analytics workloads and permits SIMD optimizations with modern processors. Uniquely, Pil supports streaming construction of schema-agnostic archives.

We created Pil to make the advantages of compressed, efficient columnar data representation with support for very efficient compression and encoding schemes with a focus on supporting genomics data. Pil allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.

Predicate pushdown

When executing queries in the most generic and basic manner, filtering happens very late in the process. Moving filtering to an earlier phase of query execution provides significant performance gains by eliminating non-matches earlier, and therefore saving the cost of processing them at a later stage. This group of optimizations is collectively known as predicate pushdown. Pil may evaluate the predicates using three different approaches and discards non-matching records:

Segmental statistics: relies on the minimum and maximum value statistics in the ColumnStore meta data to filter and prune data at the row group level.
Dictionary encoding: dictionaries can filter out values that are between min and max but not in the dictionary.
Bloom filter: When there are too many distinct values, constructing dictionaries can be expensive. In cases of high-cardinality sets, we use blocked Bloom-filters for probabilistic set-membership tests.

Subproject: Pillar

Pillar is the specialized implementation that can consume the majority of the incumbent genomics interchange formats including: SAM, BAM, CRAM, VCF, BCF, YON, FASTA, FASTQ, BED, GTF2, GFF3 and have native coding support for sequencing-specific range codecs (CRAM and fqzcomp), PBWT (BGT), genotype-PBWT (YON), multi-symbol-PBWT, and individual-centric WAH-bitmaps (GQT).

Type	Format(s)
Sequence	FASTA, 2bit-FASTA
Annotations	BED, GTF2, GFF3
Read alignments	SAM, BAM, CRAM
Variant	VCF, BCF, GQT, BGT

Preliminary results

FASTQ: Unaligned readset

ERR194146.1: Illumina HiSeq 2000 run of Coriell CEPH/UTAH 1463 sample.

Format	File size	Import time	Compression ratio	Random access
FASTQ	2337920025	-	1	No
gzip	775486779	5m25.651s	3.014777	No
zstd	783213067	0m32.145s	2.985037	No
fqzcomp	441223058	1m34.187s	5.298726	No
Pil	501741060	3m34.864s	4.440460	Yes
Pil-65536	469251192	5m16.680s	4.982236	Yes

Settings

gzip -c ERR194146.fastq > ERR194146.fastq.gz
fqzcomp -s3 -e -q2 -n2 ERR194146.fastq ERR194146.fastq.fqz
zstd -3 ERR194146.fastq -o ERR194146.fastq.zst
Pil used PIL_COMPRESS_RC_QUAL and PIL_COMPRESS_RC_BASES codecs for per-base quality scores and bases, respectively. Sequence names were tokenized by : (colon) and partitioned into ColumnStores without additional processing.
Pil-65536 was run with the same settings but the RecordBatch size set to 65536 instead of 8192.

SAM/BAM/CRAM: Unaligned readset

NA12878J: Illumina HiSeq-X run of Coriell CEPH/UTAH 1463 sample comprising 122.6 gigabases at 30× coverage. The original dataset from the Garvan Institute of Medical Research consists of the following files:

We will use the first 50 million reads from NA12878J_HiSeqX_R1 to convert into a BAM file using biobambam2.

Format	File size	Import time	Compression ratio	Random access
FASTQ	4469877114	-	1	No
FASTQ.gz	1185209615	7m59.405s	3.771381	No
fqzcomp	741975278	2m24.190s	6.024294	No
SAM	4532377114	-	0.986	No
SAM.gz	1187984017	7m48.570s	3.762573	No
BAM	1208014419	4m8.770s, 4m47.160s*	3.700185	Partial
CRAM	860706428	1m29.555s	5.193266	Partial
Pil	801811920	3m59.769s	5.57472	Yes
Pil-65536	773132410	3m40.953s	5.781516	Yes

* SAM->BAM
** FASTQ -> BAM

Settings

gzip -c NA12878J_HiSeqX_R1_50mil.fastq > NA12878J_HiSeqX_R1_50mil.fastq.gz
fqzcomp -s3 -e -q2 -n2 NA12878J_HiSeqX_R1_50mil.fastq NA12878J_HiSeqX_R1_50mil.fastq.fqz
fastqtobam NA12878J_HiSeqX_R1_50mil.fastq > NA12878J_HiSeqX_R1_50mil.fastq.bam
samtools view NA12878J_HiSeqX_R1_50mil.fastq.sam -O bam > NA12878J_HiSeqX_R1_50mil.fastq.bam
samtools view NA12878J_HiSeqX_R1_50mil.fastq.bam -O cram > NA12878J_HiSeqX_R1_50mil.fastq.cram
gzip -c NA12878J_HiSeqX_R1_50mil.fastq.sam > NA12878J_HiSeqX_R1_50mil.fastq.sam.gz
Pil used PIL_COMPRESS_RC_QUAL and PIL_COMPRESS_RC_BASES codecs for per-base quality scores and bases, respectively. Sequence names were tokenized by : (colon) and partitioned into ColumnStores without additional processing.
Pil-65536 was run with the same settings but the RecordBatch size set to 65536 instead of 8192.

SAM/BAM/CRAM: Aligned readset

Aligned data from above resulting in 12,565,597 records.

Format	File size	Import time	Compression ratio	Random access
SAM	5271405563	-	1	No
BAM	1540663158	5m5.326s	3.421517	Partial
CRAM	534863873*	2m5.874s	9.855602	Partial
Pil	945503012 (498481408**)	5m22.283s	5.575239 (10.57493**)	Yes
Pil-65536	915311876 (484110493**)	5m5.223s	5.759136 (10.88885**)	Yes

* CRAM requires an external reference, in this case hg19.fa.gz, that is 948731427 bytes.
** Running Pil with an external reference sequence like CRAM and dropping CIGAR fields (they can be recomputed on-the-fly).

Settings

samtools view NA12878J_HiSeqX_R1_50mil.fastq.aligned.sam -O bam > NA12878J_HiSeqX_R1_50mil.fastq.aligned.bam
samtools view NA12878J_HiSeqX_R1_50mil.fastq.aligned.bam -O cram > NA12878J_HiSeqX_R1_50mil.fastq.aligned.cram
Pil used PIL_COMPRESS_RC_QUAL and PIL_COMPRESS_RC_BASES codecs for per-base quality scores and bases, respectively. Sequence names were tokenized by : (colon) and partitioned into ColumnStores without additional processing.
Pil-65536 was run with the same settings but the RecordBatch size set to 65536 instead of 8192.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.settings		.settings
Debug		Debug
third_party		third_party
transform		transform
.cproject		.cproject
.gitignore		.gitignore
.gitmodules		.gitmodules
.project		.project
LICENSE		LICENSE
README.md		README.md
bit_utils.h		bit_utils.h
bloom_filter.cpp		bloom_filter.cpp
bloom_filter.h		bloom_filter.h
bloom_filter_test.h		bloom_filter_test.h
buffer.cpp		buffer.cpp
buffer.h		buffer.h
buffer_builder.h		buffer_builder.h
buffer_builder_test.h		buffer_builder_test.h
column_dictionary.cpp		column_dictionary.cpp
column_dictionary.h		column_dictionary.h
column_dictionary_test.h		column_dictionary_test.h
column_store.cpp		column_store.cpp
column_store.h		column_store.h
column_store_test.h		column_store_test.h
main.cpp		main.cpp
memory_pool.cpp		memory_pool.cpp
memory_pool.h		memory_pool.h
pil.h		pil.h
record_builder.h		record_builder.h
record_builder_test.h		record_builder_test.h
status.h		status.h
table.cpp		table.cpp
table.h		table.h
table_meta.cpp		table_meta.cpp
table_meta.h		table_meta.h
table_meta_test.h		table_meta_test.h
table_schemas.h		table_schemas.h
table_test.h		table_test.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pil

Experimental repository in pre-alpha. Use with caution!

Predicate pushdown

Subproject: Pillar

Preliminary results

FASTQ: Unaligned readset

Settings

SAM/BAM/CRAM: Unaligned readset

Settings

SAM/BAM/CRAM: Aligned readset

Settings

About

Releases

Packages

Languages

License

mklarqvist/pil

Folders and files

Latest commit

History

Repository files navigation

Pil

Experimental repository in pre-alpha. Use with caution!

Predicate pushdown

Subproject: Pillar

Preliminary results

FASTQ: Unaligned readset

Settings

SAM/BAM/CRAM: Unaligned readset

Settings

SAM/BAM/CRAM: Aligned readset

Settings

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages