Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read compressed alignment files? #323

Merged
merged 4 commits into from
Nov 20, 2013
Merged

Conversation

cmccoy
Copy link
Collaborator

@cmccoy cmccoy commented Nov 19, 2013

A low priority request/question: can pplacer and friends be made to read compressed alignment files (eg, .gz or .bz2) natively? This could potentially save a lot of room on disk for work in progress.

@matsen
Copy link
Owner

matsen commented Nov 18, 2013

Would Zip suit? I know it's not hip, even a little bit, but we are already using it.

https://github.com/matsen/pplacer/blob/dev/pplacer_src/refpkg_parse.ml

@cmccoy
Copy link
Collaborator

cmccoy commented Nov 18, 2013

camlzip supports zip and gzip; we could use that without adding
dependencies.

On Mon, Nov 18, 2013 at 11:12 AM, Erick Matsen notifications@github.comwrote:

Would Zip suit? I know it's not hip, even a little bit, but we are already
using it.

https://github.com/matsen/pplacer/blob/dev/pplacer_src/refpkg_parse.ml


Reply to this email directly or view it on GitHubhttps://github.com//issues/323#issuecomment-28727910
.

@matsen
Copy link
Owner

matsen commented Nov 18, 2013

Ah, nice. I'll have a go on this some afternoon.

On Mon, Nov 18, 2013 at 11:16 AM, Connor McCoy notifications@github.comwrote:

camlzip supports zip and gzip; we could use that without adding
dependencies.

On Mon, Nov 18, 2013 at 11:12 AM, Erick Matsen notifications@github.comwrote:

Would Zip suit? I know it's not hip, even a little bit, but we are
already
using it.

https://github.com/matsen/pplacer/blob/dev/pplacer_src/refpkg_parse.ml


Reply to this email directly or view it on GitHub<
https://github.com/matsen/pplacer/issues/323#issuecomment-28727910>
.


Reply to this email directly or view it on GitHubhttps://github.com//issues/323#issuecomment-28728274
.

Frederick "Erick" Matsen, Assistant Member
Fred Hutchinson Cancer Research Center
http://matsen.fhcrc.org/

@nhoffman
Copy link
Collaborator Author

gzip would probably be preferable for single files if available

@matsen
Copy link
Owner

matsen commented Nov 19, 2013

Camlzip knows how to read bytes, characters, and sets thereof. When we read in fasta (e.g.) files, they get read in line by line and tokenized (see ppatteries for the definition of gen_parsers). It's important for the tokenize functions that things arrive a line at a time. We could read chunks of our compressed file at a time (say, 80 chars), look for newlines in them and spit out an Enum of lines as the newlines appear. Does that seem reasonable? Will that be sufficiently efficient?

@ghost ghost assigned matsen Nov 19, 2013
@cmccoy
Copy link
Collaborator

cmccoy commented Nov 19, 2013

Maybe we could hook into the Batteries I/O interface? IO.create_in (for camlzip in_channel) combined with IO.lines_of would give an Enum of strings.

@matsen
Copy link
Owner

matsen commented Nov 19, 2013

Excellent!

On Mon, Nov 18, 2013 at 5:19 PM, Connor McCoy notifications@github.comwrote:

Maybe we could hook into the Batteries I/O interface? IO.create_inhttp://ocaml-batteries-team.github.io/batteries-included/hdoc2/BatIO.html#VALcreate_in(for camlzip
in_channel) combined with IO.lines_ofhttp://ocaml-batteries-team.github.io/batteries-included/hdoc2/BatIO.html#VALlines_ofwould give an
Enum of strings.


Reply to this email directly or view it on GitHubhttps://github.com//issues/323#issuecomment-28757285
.

Frederick "Erick" Matsen, Assistant Member
Fred Hutchinson Cancer Research Center
http://matsen.fhcrc.org/

Connor McCoy added 4 commits November 19, 2013 09:22
e.g. `pplacer -c vaginal_16s.refpkg src/p4z1r36.fasta.gz` generates
p4z1r36.jplace
checked via:

    pplacer -p -c vaginal_16s.refpkg src/p4z1r36.fasta.gz -o test.jplace.gz
Same as Ppatteries.safe_chop_suffix
@cmccoy
Copy link
Collaborator

cmccoy commented Nov 20, 2013

This is working for me on the microbiome demo - sequence files ending in .gz get decompressed with gzip.

I added .jplace.gz compression support while I was there. Doing so required changing the JSON parser from acting on raw input channels to a wrapped Batteries IO.input (see b4e10c4) or IO.output (see d4701a0).
I'm hoping that doesn't incur any serious performance overhead. Happy to test more or drop those parts.

@matsen
Copy link
Owner

matsen commented Nov 20, 2013

Whiplash. Nice work, and glad to see tests.

matsen added a commit that referenced this pull request Nov 20, 2013
@matsen matsen merged commit 2ae1c2f into dev Nov 20, 2013
@cmccoy cmccoy deleted the 323-compressed-alignment-files branch November 20, 2013 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants