-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance claims #60
Comments
Hey Shane, thanks for opening this issue and I welcome inquiries on performance because I agree it is surprising. Would be happy for you to explore my benchmarks and compare it for yourself: If you'd like to explore the algorithms as well and potentially benchmark the individual components you can see them at bedops versionI did not change any settings from default bedops which I downloaded via version from bedops
citation: http://bioinformatics.oxfordjournals.org/content/28/14/1919.abstract
https://doi.org/10.1093/bioinformatics/bts277
version: 2.4.41 (typical)
authors: Shane Neph & Scott Kuehn File sizes in figures 1 and 2You're correct the line numbers are not enormous for the benchmarks in figures 1 and 2 (measuring at 200,000 intervals) - but I wanted to select an interval set size that was similar to the sizes I work with regularly and was also appropriate to measure algorithmic, IO, and memory performances. gia isn't just a stream-focused tool so at those interval sizes I am using an in-place version (where all intervals are loaded into memory first). Currently I only have a streamable Large inputsHowever, for the streamed function I do have (intersect) we can test your claim that it does not generalize. I show in the paper input sizes up to 10,000,000, which I did just for practical reasons of running each command 30 or so times to get some statistics. I'll do a more thorough analysis soon with multiple runs instead of a single timestamp but for a single run we can get an estimate of run-time and memory overhead. I'll use gnu time so we can see memory usage as well as runtime. 100,000,000 random intervals# generate the random bed files
gia random -n 100000000 | gia sort -o random_large_a.bed
gia random -n 100000000 | gia sort -o random_large_b.bed
# get an estimate for bedops
time bedops -i random_large_a.bed random_large_b.bed > /dev/null
# get an estimate for streamed gia
time gia intersect -S -a random_large_a.bed -b random_large_b.bed -o /dev/null bedops
gia
1,000,000,000 random intervals# generate the random bed files
gia random -n 1000000000 | gia sort -o random_very_large_a.bed
gia random -n 1000000000 | gia sort -o random_very_large_b.bed
# get an estimate for bedops
time bedops -i random_very_large_a.bed random_very_large_b.bed > /dev/null
# get an estimate for streamed gia
time gia intersect -S -a random_very_large_a.bed -b random_very_large_b.bed -o /dev/null bedops
gia
10,000,000 random interval sortingAnother function which is comparable is the sort function since both tools must do an in-place sort. # generate 10,000,000 random intervals
gia random -n 10000000 -o random.bed
# get an estimate for bedops
time sort-bed random.bed > /dev/null
# get an estimate for gia
time gia sort -i random.bed -o /dev/null bedops
gia
|
Thanks for the quick response. |
Great point, here are those same timestamps with gia run first and bedops second. it seems like the trend continues to hold. I'd definitely like to include support for arbitrary number of inputs - I think that's a very useful feature but I haven't had a chance yet to really take a crack at it. I would love for gia to be more of a community project and if you're interested at all in contributing I'd be happy to accept PRs. I agree rust is a pretty language - the learning curve is steeper than others but honestly I think for anybody who has worked with systems programming languages before the switch will be much easier. I think the language will really continue to grow in bioinformatics. There are excellent resources for getting started as well and in my opinion the best way to learn is to just start arguing with the compiler! Benchmarks with swapped order10,000,000 random interval sortingtime gia intersect -S -a random_large_a.bed -b random_large_b.bed -o /dev/null
time bedops -i random_large_a.bed random_large_b.bed > /dev/null gia
bedops
1,000,000,000 random intervalstime gia intersect -S -a random_very_large_a.bed -b random_very_large_b.bed -o /dev/null
time bedops -i random_very_large_a.bed random_very_large_b.bed > /dev/null gia
bedops
10,000,000 random interval sortingtime gia sort -i random.bed -o /dev/null
time sort-bed random.bed > /dev/null gia
bedops
|
Nice! |
Thanks! I’m actually not doing anything clever with my sorting algorithm. The interval container is really just a built in vector of generic intervals. The only trick is that my generic intervals all implement Ordinal functions so I can perform an in place sort that’s already been optimized in rust. You can see the code in bedrs and how minimal the implementation is: The benefit of using the built in is that it can also be done in parallel without rewriting the underlying structure. The benchmarks I showed were using a single thread, but it can also be even more performant using multiple. |
Yes, it would be great to see what improvements there will be with multiple threads. I will close this out - great job so far - I am looking forward to seeing this development continue. |
I meant to do some testing on my own, but I may never get there. I'm one of the authors of BEDOPS. It is not easy to imagine a 6x or so improvement in runtimes, as these are linear (or n log n for sorting) time algorithms in bedops/closest-features utilities.
There are a couple of things that stand out to me in the bioarxiv paper. Mainly, timed tests are at most 1 second for the slowest tool which indicates very, very small inputs (Figures 1 and 2). If the trend held with large inputs, that would be far more interesting and impressive. Right now, the differences might be attributable to things that do not generalize beyond 1 second, for example.
The memory overhead shown for bedops (Figure 5) makes me think that they used the "megarow" build of BEDOPS. That build is meant for very large sequencing results (nanopore and pacbio). It scales to those much larger data at the cost of some small memory overhead but also considerable time overhead. It would be worth measuring time/memory against that larger build but also against the more popular (and default) build for utilities in BEDOPS.
You can use the switch-BEDOPS-binary-type utility to switch between typical (default) and megarow builds of utilities in BEDOPS.
The text was updated successfully, but these errors were encountered: