Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAM support #546

Closed
keiranmraine opened this Issue Dec 10, 2014 · 22 comments

Comments

Projects
None yet
8 participants
@keiranmraine
Copy link
Contributor

keiranmraine commented Dec 10, 2014

Hi,

We are being actively 'encouraged' to migrate to using CRAM instead of BAM for read storage and at present none of the popular genome browsers (e.g. Biodaliance, GBrowse, IGV, JBrowse) list it as a supported format (IGV indicates it will when Picard does) on their pages yet.

As we already use JBrowse we would be very happy to see support for this.

CRAM index files are also compressed so will be faster to transfer.

I noticed there wasn't a ticket for this so thought it was worth raising.

Regards,
Keiran

P.S. I can provide test files and test if needed.

@selewis

This comment has been minimized.

Copy link

selewis commented Dec 10, 2014

+1

Same here -->CRAM

On Wed, Dec 10, 2014 at 8:31 AM, Keiran Raine notifications@github.com
wrote:

Hi,

We are being actively 'encouraged' to migrate to using CRAM instead of BAM
for read storage and at present none of the popular genome browsers (e.g.
Biodaliance, GBrowse, IGV, JBrowse) list it as a supported format (IGV
indicates it will when Picard does) on their pages yet.

As we already use JBrowse we would be very happy to see support for this.

CRAM index files are also compressed so will be faster to transfer.

I noticed there wasn't a ticket for this so thought it was worth raising.

Regards,
Keiran

P.S. I can provide test files and test if needed.


Reply to this email directly or view it on GitHub
#546.

@scottcain

This comment has been minimized.

Copy link
Member

scottcain commented Jun 11, 2015

Hi @keiranmraine Do you happen to know if anybody has written a js library to support pulling CRAM regions? If something like that exists, pulling it into JBrowse would probably not be too hard (written as someone who wouldn't be doing the work :-)

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Jun 11, 2015

Maybe use emscripten to compile htslib to javascript?

@keiranmraine

This comment has been minimized.

Copy link
Contributor Author

keiranmraine commented Jun 12, 2015

Hi Scott,

The only person I can think of who may have started this would be Thomas Down the lead on Biodalliance (who I believe shared the BAM code with Robert)

http://www.biodalliance.org/

Keiran

@keiranmraine

This comment has been minimized.

Copy link
Contributor Author

keiranmraine commented Jun 23, 2015

If any progress is made please consider me a very enthusiastic tester.

@selewis

This comment has been minimized.

Copy link

selewis commented Jun 23, 2015

Not so far. But personally I'd like to put this very high on the list of
new feature priorities.

We're going to need it yesterday.

-S

On Tue, Jun 23, 2015 at 12:51 PM, Keiran Raine notifications@github.com
wrote:

If any progress is made please consider me a very enthusiastic tester.


Reply to this email directly or view it on GitHub
#546 (comment).

In the long history of humankind (& animal-kind too) those who learned to
collaborate & improvise most effectively have prevailed - Charles Darwin

@ihh

This comment has been minimized.

Copy link
Member

ihh commented Jun 24, 2015

I guess we could try the emscripten route as Colin suggests. zlib is the only htslib dependency and it has been successfully compiled into js with emscripten.

@selewis

This comment has been minimized.

Copy link

selewis commented Jul 13, 2015

CRAM is high on our list. We have to keep up with the times - and clearly
for very practical reasons.

How much (conservatively/realistically) effort do you think this might take?

On Mon, Jul 13, 2015 at 4:40 PM, Colin Diesh notifications@github.com
wrote:

I wanted to add more interest in getting CRAM support.

We are experiencing some problems with some very big BAM files that we
have (these are ones that are upwards of about 30GB). With these huge
files, the tracks simply do not load 90% of the time. I don't know exactly
where the bottleneck is, but this happens on standard apache servers, up to
date nginx, and webapollo servlets, so it's not even really specific to the
server. And, the reason it seems to fail on the client side is due to
request timeouts.

Here are the statistics for our files here

BAM files
13G
19G
21G
22G
20G
35G * +
35G * +
42G +

  • two different 35GB bam files
  • indicates that track fails to load probably more than 50% of the time although sometimes it does load. other tracks almost never fail

If we had cram, the files might be a bit smaller at least :)


Reply to this email directly or view it on GitHub
#546 (comment).

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Jul 14, 2015

I'd say in the best case, it would be like one-to-two months of futzing around in emscripten. I would rather go that route rather than reimplementing the library at least given the other constraints we have (i.e., working on other features in parallel). It might even be worth asking the developers of htslib about whether they have considered compiling with emscripten

@keiranmraine

This comment has been minimized.

Copy link
Contributor Author

keiranmraine commented Jul 14, 2015

@cmdcolin we regularly use BAMs in excess of 100GB (human, 30x+) and we don't see timeouts. The main bottleneck we see is index retrieval as these can get into the 25MB+ range quite quickly.

One thing one of our developers noticed was that our proxy may be hiding some of the lag as when active all requests were actioned in parallel but when left to the browser alone only 4(?) we triggered at a time. @mcast, can you chime in here as I'm fuzzy on the detail and may have this wrong.

@mcast

This comment has been minimized.

Copy link

mcast commented Jul 14, 2015

On Tue, Jul 14, 2015 at 12:30:55AM -0700, Keiran Raine wrote:

One thing one of our developers noticed was that our proxy may be
hiding some of the lag as when active all requests were actioned in
parallel but when left to the browser alone only 4(?) we triggered
at a time. @mcast, can you chime in here as I'm fuzzy on the detail
and may have this wrong.

What I initially believed I observed, using Firefox's developer tools,
was

Firefox (approx v.32) limits the number of connections to each server
to some small number, but is otherwise happy to make many simultaneous
fetches.[1]

When configured to use a proxy, Firefox seems to delegate all(?) of
this stampede-prevention to the proxy, and issues many requests at
once. Our local Squid is more eager to fetch many streams at once,
and I thought I had seen it open 30 parallel connections to local
webservers.

This is the origin of my expectation that using a proxy may increase
performance for large numbers of fetches from one HTTP server.

Then I saw the developer tools claiming that over 100 connections were
launched simultaneously, and returning simultaneously; but actually
the view on screen was not keeping up with the timeline.

I stopped trusting it as a diagnosis tool for this kind of
information, and I think Chromium's similar tool may be more reliable.

What I haven't done is objective comparisons of these timelines as
seen from the HTTP server - I lack a good tool for visualising the
request/response objects on a scrolly zoomy timeline.

Matthew

[1] This is the reason for (e.g.) the {s} placeholder in openstreetmap
tile server URLs,
https://en.wikipedia.org/wiki/Leaflet_%28software%29#Use

so the client may provide {s} from ("a", "b", "c") to get three sets
of streams of traffic going.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Jul 14, 2015

Thanks for the detailed response. This is somewhat analogous to what i've seen with the parallel connections being launche...and that these are hard to debug. They are launched after the bai file download but they are all returning timeouts. It is interesting to see that you have gotten 100gb files to work. I might try and debug some more things to see if it's possible to fix my case

@mcast

This comment has been minimized.

Copy link

mcast commented Jul 14, 2015

On Tue, Jul 14, 2015 at 08:15:16AM -0700, Colin Diesh wrote:

[ parallel connections ] launched after the bai file download but
they are all returning timeouts. It is interesting to see that you
have gotten 100gb files to work. [...]

For production we have nginx with an unremarkable configuration.

For some testing I also used Starman plus
https://github.com/mcast/plack-middlewares-jbrowse

to implement the necessary subset of 206 Partial Content. This may
prove easier to instrument? It is not a polished bit of code though.

Matthew

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@keiranmraine

This comment has been minimized.

Copy link
Contributor Author

keiranmraine commented Oct 27, 2015

We're starting to get poked with pointy sticks regarding switching to CRAM, is there any update on this?

I have a suggestion for a half-way house but I would need some assistance modifying the adaptor as it's not my expertise.

Basically my thought was we could put a light weight CGI server on the same host as the JBrowse system. The JBrowse adaptor would be mostly the existing BAM one, except it would pass the requested coordinate range in the request instead of the byte range and expect to receive a compressed BAM steam as before (it won't need to download the index), e.g.

# default compression
% time bash -c "curl -sS 'http://localhost:8080/test.cgi?loc=2:1..5000000;cram=test;zc=0' | pv | (samtools view -c - > tmp)"; echo -n 'Reads: '; cat tmp
3.31MiB 0:00:01 [2.84MiB/s] [ <=> ]

real    0m1.172s
user    0m0.090s
sys 0m0.040s
Reads: 68359

# fast compression
% time bash -c "curl -sS 'http://localhost:8080/test.cgi?loc=2:1..5000000;cram=test;zc=1' | pv | (samtools view -c - > tmp)"; echo -n 'Reads: '; cat tmp
 4.2MiB 0:00:00 [9.34MiB/s] [ <=> ]

real    0m0.458s
user    0m0.099s
sys 0m0.056s
Reads: 68359                             
  • loc = genomic range in for chr:start..stop or chr:start-stop
  • zc = compression level of returned BAM stream, 1=fast, 0/unset=default
  • cram = dataset identifier

zc is available as setting this will reduce CPU load on the server, but it will increase I/O

My proof of concept is attached, I was just using HTTPi for the server.

#!/usr/bin/perl
use strict;
use CGI;

my $ok = '200 OK';
my $bad = '400 Bad Request';
my $page_state = $ok;

my $param_arr = chk_params(CGI->new);

print "HTTP/1.1 $page_state\n" if($ENV{SERVER_SOFTWARE} =~ m/^HTTPi/);
if($page_state eq $ok) {
  print "Content-Type: application/x-gzip\n\n";
  $|++;
  open(my $pipe, '-|', cram_to_bam($param_arr)) || die $!;
  while(<$pipe>) { print $_; }
  close $pipe || die $!;
}
else {
  print <<"END_MESSAGE";
Content-Type: text/html\n\n
<html>
  <head>
    <title>$bad</title>
  </head>
  <body>
    <h1>$bad</h1>
    <p>Expected params not found in request</p>
    <ul>
      <li>loc = genomic range in for chr:start..stop or chr:start-stop</li>
      <li>zc = compression level of returned BAM stream, 1=fast, 0/unset=default</li>
      <li>cram = dataset identifier</li>
    </ul>
  </body>
</html>
END_MESSAGE
}

sub chk_params {
  my $q = shift;
  my @req_params;
  push @req_params, 'genome.fa';
  push @req_params, $q->param('zc') =~m/^[1]$/ ? '-'.$q->param('zc') : q{};
  my $cram = $q->param('cram');
  push @req_params, $cram if($cram && $cram =~ m/[^$;@!*`()]/);
  my $range = $q->param('loc');
  if(defined $range) {
    $range =~ s/[.]{2}/-/;
    push @req_params, $range;
  }
  $page_state = $bad unless(@req_params == 4);
  return \@req_params;
}

sub cram_to_bam {
  return sprintf 'samtools view -b -T /var/tmp/%s %s /var/tmp/%s.cram %s', @{ $_[0] };
}

1;
@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Nov 2, 2015

Hi @keiranmraine. That looks interesting but it would probably still need some extra work...probably can't just be used as a drop-in solution. A custom storeClass that does not perform range requests would probably need to be made

I still am hopeful for the client side CRAM solution so that such a conversion of cram->bam on the fly is not necessary!

@keiranmraine

This comment has been minimized.

Copy link
Contributor Author

keiranmraine commented Jun 2, 2017

Hi @cmdcolin GA4GH are working heavily on a new streaming API for sequence reads. The idea is that the server side format is irrelevant and you can request SAM/BAM/CRAM at the client and receive the stream for the coordinate range requested.

The specification is here: https://samtools.github.io/hts-specs/htsget.html

I'm following up to see if there's a public implementation (or code you can run a test server with).

@keiranmraine

This comment has been minimized.

Copy link
Contributor Author

keiranmraine commented Jun 2, 2017

@rbuels

This comment has been minimized.

Copy link
Collaborator

rbuels commented Jan 30, 2018

@keiranmraine

This comment has been minimized.

Copy link
Contributor Author

keiranmraine commented Jan 31, 2018

If useful human 21, cram with embedded ref: ftp://ngs.sanger.ac.uk/production/cancer/dockstore/cgpmap/insilico_21.cram

@rbuels rbuels added this to the 1.15.0 milestone Apr 17, 2018

@rbuels rbuels self-assigned this Jun 30, 2018

@GMOD GMOD deleted a comment from mcast Jul 11, 2018

@rbuels rbuels added the big task label Jul 11, 2018

@rbuels rbuels closed this Jul 18, 2018

@nathandunn

This comment has been minimized.

Copy link
Contributor

nathandunn commented Jul 18, 2018

👍

@rbuels

This comment has been minimized.

Copy link
Collaborator

rbuels commented Jul 18, 2018

implemented in PR #1120

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.