Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

are the replicates "independent"? #4

Closed
pooranis opened this issue Oct 28, 2021 · 6 comments
Closed

are the replicates "independent"? #4

pooranis opened this issue Oct 28, 2021 · 6 comments

Comments

@pooranis
Copy link

Is the computation of each replicate independent of the other? So, could you run divnet-rs with replicates = 1 5 separate times and then combine the outputs (taking care to fix up the headers, replicates column, etc)? I am curious if it is possible to parallelize this way. Thanks!

@mooreryan
Copy link
Owner

mooreryan commented Oct 28, 2021

TL;DR: in all my tests on a single computer, parallelizing the replicates actually slows things down.

The answer is ..."sort of" lol. You can actually run the replicates in parallel after you run the first aitchison model fitting code on the original data. That is used as input to the replicates. But once you have that, you could run the replicates in parallel.

In fact, this is what the DivNet R package does. However, I have found that on all computers I've tested on, it actually slows down the overall computation. I did try and parallelize the replicate code of divnet-rs with rayon (if you're not a rust coder, that's a work-stealing, data-parallelism library), and got worse performance in general with the parallel code.

There is a lot of BLAS/LAPAKCE as well as other vector operations going on that seem to saturate the memory bandwidth of a single computer (a possible explanation, at least).

Now, what I didn't try, and what I think probably would help, is to do the first model fitting, then write that to disk. Then provide a mini-app that just does the replicates...then you could, for example, submit each replicate as it's own job in a compute cluster...as long as they didn't all go to the same node, you should avoid the memory bandwidth problem (assuming nothing else on that node is taking it!). Which I think is more or less what you're getting at right?

@pooranis
Copy link
Author

pooranis commented Oct 29, 2021

Yes! The latter - running the replicates as separate single-threaded jobs on an HPC - is what I was asking about. So, it wouldn't work without a little tinkering to get the model fitting to run separate from the bootstrap. I hadn't heard of rust until using divnet-rs, so probably won't tinker on my own, but you have satisfied my curiosity ;). Thanks!

@mooreryan
Copy link
Owner

mooreryan commented Oct 29, 2021

Hey, so I thought this was an interesting idea, so I threw together a (partial) solution during my lunch break :) You can check it out in the separate-bootstraps branch.

You will need to clone this repository, and then pull that branch specifically.

git clone https://github.com/mooreryan/divnet-rs.git
git pull origin separate-bootstraps

You know in the divnet-rs book it has the following instruction for installation?

cd divnet-rs
cargo build --release

Just change the second line to

cargo build --release --all

That will build two additional binaries: fit_model and bootstrap.

You use those sort of like the original divnet-rs binary, except that you need to set some environment variables to tweak the behavior.

You can recreate the full divnet-rs like this:

DN_MODEL_DAT=dn_model DN_MODEL_CSV=dn_model.csv ./target/release/fit_model ./test_files/small/config_small.toml

# Each of these could be run on separate nodes, or as separate scheduled jobs (eg torque, slurm, etc).
#
# You must change the replicate, seed, and outfiles by hand!
# 
# DN_MODEL_DAT must be the same for these as for the fit_model run
DN_MODEL_DAT=dn_model DN_REPLICATE=1 DN_SEED=1 DN_BOOT_CSV=dn_boot_1.csv  ./target/release/bootstrap ./test_files/small/config_small.toml
DN_MODEL_DAT=dn_model DN_REPLICATE=2 DN_SEED=2 DN_BOOT_CSV=dn_boot_2.csv  ./target/release/bootstrap ./test_files/small/config_small.toml
DN_MODEL_DAT=dn_model DN_REPLICATE=3 DN_SEED=3 DN_BOOT_CSV=dn_boot_3.csv  ./target/release/bootstrap ./test_files/small/config_small.toml

# Concatenate the outfiles
cat dn_model.csv dn_boot_*.csv > dn_out.csv

# Clean up
rm dn_model dn_model.csv dn_boot_*.csv

Note that ./test_files/small/config_small.toml is for one of the test data sets. You can try it, but obviously for your data, you would want to use your own config :)

Now the cool thing is that the three (or however many replicates you want) bootstrap commands you could run on its own node or on whatever job scheduler your compute cluster uses.

I haven't tested it to see if you get a nice speedup or not on a cluster, but if you would be willing to try it out, I would be very interested in the results!

@pooranis
Copy link
Author

pooranis commented Oct 30, 2021

I tested with a modest dataset I had on hand:

Summary of OTU table:
     Samples         OTUs  Total#Reads    Min#Reads    Max#Reads Median#Reads
         207          685      6824973        10234        83107        28756
   Avg#Reads
    32970.88

Running 5 replicates with the 'careful' tuning parameters using plain divnet-rs with one thread:

export OPENBLAS_NUM_THREADS=1
time divnet-rs ./config_divnet_rs_abx.toml
real    23m19.236s
user    23m15.174s
sys     0m3.639s

Running fit_model by itself:

export OPENBLAS_NUM_THREADS=1
time DN_MODEL_DAT=dn_model_time DN_MODEL_CSV=dn_model_time.csv fit_model ./config_divnet_rs_abx.toml
real	3m5.013s
user	3m1.716s
sys	0m1.810s

Instead of parallelizing the bootstrap replicates with separate jobs, I used gnu-parallel and the options for our HPC cluster which uses SLURM. Thinking about what you said about memory bandwidth, I tried it 2 different ways.
Here is running using the regular hyperthreading with 5 threads (2 per core - so 3 cores):

export OPENBLAS_NUM_THREADS=1
time parallel --ungroup -j5 DN_MODEL_DAT=dn_model 'DN_REPLICATE={}' 'DN_SEED={}' 'DN_BOOT_CSV=dn_boot_{}.csv' bootstrap ./config_divnet_rs_abx.toml ::: 1 2 3 4 5
real    6m24.941s
user    29m46.054s
sys     0m16.408s

SLURM also has an option --hint=memory_bound which enforces use of a single thread per core AND a single core per socket (and also some other stuff involving affinity that I don't understand). So, running the same command with 5 threads using 5 separate cores/sockets:

real    4m8.211s
user    20m14.450s
sys     0m10.895s

with the user time being exactly the amount of time running serially minus the fit_model step!

Memory usage for divnet-rs was ~650mb and for parallel jobs around 3GB. The actual numerical results are also the same as running divnet-rs (and regular divnet with same params). Obviously should test with a bigger dataset, but potentially a nice improvement if you have access to an HPCc! Thanks so much! Cheers!! 🍻

@mooreryan
Copy link
Owner

This is great stuff @pooranis, thank you!!! I will have to give it a closer look and see how we can incorporate it. If not directly in the code, then at least in the documentation! 🚀

@mooreryan
Copy link
Owner

This is now in the main branch (as of e9cfe24).

I still need to add to the docs and clean it up a bit, but it's fine for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants