read compressed alignment files? #323

cmccoy · 2013-11-19T18:22:32Z

A low priority request/question: can pplacer and friends be made to read compressed alignment files (eg, .gz or .bz2) natively? This could potentially save a lot of room on disk for work in progress.

matsen · 2013-11-18T19:12:15Z

Would Zip suit? I know it's not hip, even a little bit, but we are already using it.

https://github.com/matsen/pplacer/blob/dev/pplacer_src/refpkg_parse.ml

cmccoy · 2013-11-18T19:16:15Z

camlzip supports zip and gzip; we could use that without adding
dependencies.

On Mon, Nov 18, 2013 at 11:12 AM, Erick Matsen notifications@github.comwrote:

Would Zip suit? I know it's not hip, even a little bit, but we are already
using it.

https://github.com/matsen/pplacer/blob/dev/pplacer_src/refpkg_parse.ml

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/323#issuecomment-28727910
.

matsen · 2013-11-18T19:23:05Z

Ah, nice. I'll have a go on this some afternoon.

On Mon, Nov 18, 2013 at 11:16 AM, Connor McCoy notifications@github.comwrote:

camlzip supports zip and gzip; we could use that without adding
dependencies.

On Mon, Nov 18, 2013 at 11:12 AM, Erick Matsen notifications@github.comwrote:

Would Zip suit? I know it's not hip, even a little bit, but we are
already
using it.

https://github.com/matsen/pplacer/blob/dev/pplacer_src/refpkg_parse.ml

—
Reply to this email directly or view it on GitHub<
https://github.com/matsen/pplacer/issues/323#issuecomment-28727910>
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/323#issuecomment-28728274
.

Frederick "Erick" Matsen, Assistant Member
Fred Hutchinson Cancer Research Center
http://matsen.fhcrc.org/

nhoffman · 2013-11-18T22:30:16Z

gzip would probably be preferable for single files if available

matsen · 2013-11-19T01:02:06Z

Camlzip knows how to read bytes, characters, and sets thereof. When we read in fasta (e.g.) files, they get read in line by line and tokenized (see ppatteries for the definition of gen_parsers). It's important for the tokenize functions that things arrive a line at a time. We could read chunks of our compressed file at a time (say, 80 chars), look for newlines in them and spit out an Enum of lines as the newlines appear. Does that seem reasonable? Will that be sufficiently efficient?

cmccoy · 2013-11-19T01:19:07Z

Maybe we could hook into the Batteries I/O interface? IO.create_in (for camlzip in_channel) combined with IO.lines_of would give an Enum of strings.

matsen · 2013-11-19T03:50:42Z

Excellent!

On Mon, Nov 18, 2013 at 5:19 PM, Connor McCoy notifications@github.comwrote:

Maybe we could hook into the Batteries I/O interface? IO.create_inhttp://ocaml-batteries-team.github.io/batteries-included/hdoc2/BatIO.html#VALcreate_in(for camlzip
in_channel) combined with IO.lines_ofhttp://ocaml-batteries-team.github.io/batteries-included/hdoc2/BatIO.html#VALlines_ofwould give an
Enum of strings.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/323#issuecomment-28757285
.

Frederick "Erick" Matsen, Assistant Member
Fred Hutchinson Cancer Research Center
http://matsen.fhcrc.org/

e.g. `pplacer -c vaginal_16s.refpkg src/p4z1r36.fasta.gz` generates p4z1r36.jplace

checked via: pplacer -p -c vaginal_16s.refpkg src/p4z1r36.fasta.gz -o test.jplace.gz

Same as Ppatteries.safe_chop_suffix

cmccoy · 2013-11-20T01:40:31Z

This is working for me on the microbiome demo - sequence files ending in .gz get decompressed with gzip.

I added .jplace.gz compression support while I was there. Doing so required changing the JSON parser from acting on raw input channels to a wrapped Batteries IO.input (see b4e10c4) or IO.output (see d4701a0).
I'm hoping that doesn't incur any serious performance overhead. Happy to test more or drop those parts.

matsen · 2013-11-20T01:40:48Z

Whiplash. Nice work, and glad to see tests.

read compressed alignment files?

ghost assigned matsen Nov 19, 2013

Connor McCoy added 4 commits November 19, 2013 09:22

Support gzip-compressed alignment files

bf1b6c8

e.g. `pplacer -c vaginal_16s.refpkg src/p4z1r36.fasta.gz` generates p4z1r36.jplace

Support reading compressed .jplace files

b4e10c4

Support writing compressed .jplace files

d4701a0

checked via: pplacer -p -c vaginal_16s.refpkg src/p4z1r36.fasta.gz -o test.jplace.gz

Remove duplicate function chop_suffix_if_present

1bb2665

Same as Ppatteries.safe_chop_suffix

matsen added a commit that referenced this pull request Nov 20, 2013

Merge pull request #323 from matsen/323-compressed-alignment-files

2ae1c2f

read compressed alignment files?

matsen merged commit 2ae1c2f into dev Nov 20, 2013

cmccoy deleted the 323-compressed-alignment-files branch November 20, 2013 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read compressed alignment files? #323

read compressed alignment files? #323

cmccoy commented Nov 19, 2013

matsen commented Nov 18, 2013

cmccoy commented Nov 18, 2013

matsen commented Nov 18, 2013

nhoffman commented Nov 18, 2013

matsen commented Nov 19, 2013

cmccoy commented Nov 19, 2013

matsen commented Nov 19, 2013

cmccoy commented Nov 20, 2013

matsen commented Nov 20, 2013

read compressed alignment files? #323

read compressed alignment files? #323

Conversation

cmccoy commented Nov 19, 2013

matsen commented Nov 18, 2013

cmccoy commented Nov 18, 2013

matsen commented Nov 18, 2013

nhoffman commented Nov 18, 2013

matsen commented Nov 19, 2013

cmccoy commented Nov 19, 2013

matsen commented Nov 19, 2013

cmccoy commented Nov 20, 2013

matsen commented Nov 20, 2013