Skip to content
This repository has been archived by the owner on Sep 15, 2021. It is now read-only.

Implement a read + variant streaming specific heuristic for calculating shard sizes #62

Open
dionloy opened this issue Sep 30, 2015 · 11 comments

Comments

@dionloy
Copy link

dionloy commented Sep 30, 2015

Proposal:

Flags
--calculateShardSizeBasedOnCoverage (default true)
--basesPerShard (overrides if above is false)

Reads

  • get coverage stats
  • throughput ~= 1M reads / second / stream (dependent on info tag sizes, or if some fields are excluded).

Variants

  • sample variant * call size, guestimate in MB based on INFO tags, etc.. (gRPC message size?).
  • throughput: ~15 MB / sec. (this sounds a bit low compared to our earlier numbers, will have to double check).
@deflaux
Copy link
Contributor

deflaux commented May 18, 2016

Per https://cloud.google.com/genomics/reference/rpc/google.genomics.v1#google.genomics.v1.StreamReads "strict" shard semantics are now supported server-side. Its advisable to use small shards since the current retry semantics will no longer apply.

For overlapping semantics or custom sources (which split shards dynamically), the current sharding approach should still be used.

@pgrosu
Copy link

pgrosu commented May 18, 2016

Hi Dion (@dionloy),

What would be the average number of parallel gRPC streams can one assume at one time for a collection of data? The reason I ask is I look at the total read-count in 1000 genomes based on this link, it is 2.47123*(10^14):

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/1000genomes.sequence.index

Thus let's say we have 100 parallel gRPC streams with 10^6 reads per stream/second, it would take almost 29 days to process. Basically sequencing is only going to grow and integrative analysis across multiple datasets is usually assumed.

Thanks,
Paul

@dionloy
Copy link
Author

dionloy commented May 18, 2016

I don't have exact numbers, but we were running a Dataflow over the Autism
Speaks MSSNG database of around 5000 WGS with 500 workers at any given time
and pulled the whole set down in about ~8 hours. I'm guessing your 10^6
reads / per stream / second is underestimating the throughput.

(and even at 8 hours I think we can improve the performance significantly,
it was much lower than I had originally anticipated).

On Wed, May 18, 2016 at 1:58 PM, Paul Grosu notifications@github.com
wrote:

Hi Dion (@dionloy https://github.com/dionloy),

What would be the average number of parallel gRPC streams can one assume
at one time for a collection of data? The reason I ask is I look at the
total read-count in 1000 genomes based on this link, it is 2.47123*(10^14):

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/1000genomes.sequence.index

Thus let's say we have 100 parallel gRPC streams with 10^6 reads per
stream/second, it would take almost 29 days to process. Basically
sequencing is only going to grow and integrative analysis across multiple
datasets is usually assumed.

Thanks,
Paul


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#62 (comment)

@pgrosu
Copy link

pgrosu commented May 19, 2016

That sounds more reasonable, but could you give it a try with 2 million gRPC connections/worker as a comparison. Basically the analysis will change significantly if we have 10^9 gRPC connections (yes, that would be a billion overall connections). Do you think we can scale to that as the number of workers is limited to 1000 with 4 cores for Dataflow?

Thanks,
Paul

@dionloy
Copy link
Author

dionloy commented May 19, 2016

Hi Paul, no, we do not have the capacity to scale to a billion overall connections. On a linux machine I'm guessing you're theoretically restricted to ~64K connections, even assuming zero I/O contention on your own network.

@pgrosu
Copy link

pgrosu commented May 19, 2016

Hi Dion,

I am assuming this would be through GCE to take advantage of the fast internal transfer rates. As GCE are VMs, then additional NICs can be added as the max 65536 connections would be per IP address. Then if we have 1000 GCEs with 16 NICs then we can reach 1 billion connections with only 62500 concurrent connections. The alternative is to use IP aliasing to have 16 IP addresses assigned to a NIC. I am assuming you can change the bandwidth limit to the NICs for a VM. What are the upper limits for VM<->GS and VM<->REST_Server transmission? Previously it sounded that we can have tens of thousands of RPC parallel calls at the same time from several tens of thousands of workers:

#9 (comment)

If need be, we can work to make the transfer even faster, where we have the transmission compressed and/or generate replicated-inverted-indexed data-structures for the more frequently-accessed fields into GS buckets.

All I am saying is that we can approach the same response time for next-generation disease discovery by using the same concepts that have been implemented for Google Search (i.e. find me a similar variant that has been used previously for a differently labeled disease as new data is constantly streaming in). Otherwise we have the same tools we had for years but just getting the same results a little faster, rather than changing the way analysis is performed to enable a dramatic shift in how Personalized Medicine analysis can be performed as compared to today.

Paul

@dionloy
Copy link
Author

dionloy commented May 19, 2016

Indeed, the scaling should just be by machine (though I think by then you will be CPU limited for that number of connections). On the server side we can not support a billion connections at once (again, number of machines deployed), but definitely thousands should be supported (the comment you quoted was referring to REST calls, not the persistent streaming calls, which have a higher tax).

@pgrosu
Copy link

pgrosu commented May 19, 2016

Hi Dion,

It's too bad that the server cannot access 1 billion connections, but I think I have a way where this can still be achieved. After importing the data through the Google Genomics API, you then just export the JSON formatted pages onto GS buckets. In fact, you don't even have to store them, you can have it be processed directly to GS storage in the JSON format.

Then with 1000 GCE instances, with each having 1,000,000 simultaneous connections to the GS buckets, that would yield the 1 billion mark. What is the read/write I/O performance from a GCE instance to a GS bucket? I just benchmarked 1 million concurrent connections on my laptop using NGINX as a server - so that it is event-driven - and my CPU kept hovering at 40%. I think this is similar to what is suggested here:

https://cloud.google.com/solutions/https-load-balancing-nginx

In fact, I can make it even faster through bypassing the kernel's TCP/IP stack, and reading the packets directly in user-mode with a driver I would write with a few other optimizations that would be too long to detail all out. So on a 4 core system if I boot the kernel with maxcpus=1 boot parameter, then this would allow me to use packet processing one another core through the taskset affinity set command, leaving me the remaining two cores for performing genomic analysis on the reads.

Thus for 1000 Genomes with a total 2.47123E+14 reads with 1 million reads/seconds - assuming the GS-direct access being faster than 1 M/sec - then with the above setup all the reads will be processed in less than a second, which is the same result time of a Google Search.

Let me know what you think.

Thanks,
Paul

@dionloy
Copy link
Author

dionloy commented May 19, 2016

Paul, I don't dispute that it's possible, all it takes are machines and $ to serve it, whether through Genomics, Cloud Storage or Xyz =).

@pgrosu
Copy link

pgrosu commented May 22, 2016

Hi Dion,

As they say if you build it, they will come :) In the meantime, I will try to figure out a solution for this, and please let me know any other way I can help you out.

Thank you and have a great weekend!
Paul

@calbach
Copy link
Contributor

calbach commented May 31, 2016

Looks like we will need a version bump here in order to pull in the StreamReads sharding support.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants