-
-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Within-chain parallelization via "reduce_sum" #892
Comments
Great to see this here. It will likely take a moment until it lands in brms given that RStan is still at 2.19. Should we maybe prepare in the meantime a short vignette which shows with an example how the Stan code from the current brms needs to be modified in order to run with CmdStan? Is that an option? As alternative a wiki page maybe? |
Would issue #891 help with this? :-) |
Obviously yes. |
Did the tutorial/wiki that @wds15 mentioned ever get created? |
Inspired by @wds15 talk at StanCon, I started thinking about
As far as I understand, slicing over
|
I can't offer any benchmarks on this and I would suggest to just find out... I am looking forward to these results (speak: results someone else worked out). It's all about Amdahl's law as I explained... |
Thanks! Let's aim high and parallelize the whole model block. Take, for example, the model block of a simple varying intercept-slope model, with priors removed (we don't want to include priors in reduce_sum, do we?):
Here is how it would look like
where Above, I use @wds15 What do you think of this approach? Am I missing something important in the spec? |
This starts to look very good! My thoughts on this:
|
I will write a brms branch where I support just a small subset of brms models in the above described way. That should take just a few hours but would enable us to rapidly test a lot of cases (instead of doing all the hand writing). With regard to |
The However, if it is really hard to integrate that in your code gen, then just keep it as is for now (there is hope that the Stan compiler gets smarter and the problem goes away automatically). |
I see. So, in brms, I will go ahead and do the header-only-change for now, even if inefficient in some cases so that we have a version to try out. |
Oh... if that's also data... then yes, avoid the temporary re-assignemnts if you can. So better use instead of Then you keep this thing as data. |
This is going to be a code-generation nightmare :-D Let's see if I can find a principled way to enable this indexing on the fly. |
I usually write partial sum functions with a loop over the slice, but with two indices. One spanning |
Thanks. I will see what I can make possible in a first step. |
Maybe start this by taking care of Xc... which is of the order of N * number of regressors.... but Z is just of the size of N. |
Yeah, but in order to make that scale to all the complexity of brms models, we should consistently use one or the other approach all the way through and not mix things up. I will play around with it and report back once there is something that works (ish). |
Ok, one follow up question, given that the header-change approach would be so much simpler: Do you know of any plans to make the Stan compiler smarter in definining variables as data (when appropriate) in user-defined functions? |
I don't know... with some luck this is even already available with the recently release stanc3 optimisations which can be turned on. I have not been following this that close enough to know if it is there or is still in the making. I would have hoped that a simple replace - text approach would work to start with at least. |
This is in 2.24 already. If you specify --O (not --o) it will optimize this case. Optimizations are currently experimental, but at least these optimizations work as they should (at least as far as we tested them off course). |
So functions {
real foo(data matrix X, vector paramv) {
matrix[2,2] Xs = X[1:2];
return sum( Xs * paramv);
}
} without optimization: template <typename T1__>
stan::promote_args_t<T1__>
foo(const Eigen::Matrix<double, -1, -1>& X,
const Eigen::Matrix<T1__, -1, 1>& paramv, std::ostream* pstream__) {
using local_scalar_t__ = stan::promote_args_t<T1__>;
const static bool propto__ = true;
(void) propto__;
local_scalar_t__ DUMMY_VAR__(std::numeric_limits<double>::quiet_NaN());
(void) DUMMY_VAR__; // suppress unused var warning
try {
Eigen::Matrix<local_scalar_t__, -1, -1> Xs;
Xs = Eigen::Matrix<local_scalar_t__, -1, -1>(2, 2);
stan::math::fill(Xs, DUMMY_VAR__);
current_statement__ = 1;
assign(Xs, nil_index_list(),
rvalue(X, cons_list(index_min_max(1, 2), nil_index_list()), "X"),
"assigning variable Xs");
current_statement__ = 2;
return sum(multiply(Xs, paramv));
} catch (const std::exception& e) {
stan::lang::rethrow_located(e, locations_array__[current_statement__]);
// Next line prevents compiler griping about no return
throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
}
} with optimization template <typename T1__>
stan::promote_args_t<T1__>
foo(const Eigen::Matrix<double, -1, -1>& X,
const Eigen::Matrix<T1__, -1, 1>& paramv, std::ostream* pstream__) {
using local_scalar_t__ = stan::promote_args_t<T1__>;
const static bool propto__ = true;
(void) propto__;
local_scalar_t__ DUMMY_VAR__(std::numeric_limits<double>::quiet_NaN());
(void) DUMMY_VAR__; // suppress unused var warning
try {
Eigen::Matrix<double, -1, -1> lcm_sym2__;
double lcm_sym1__;
{
Eigen::Matrix<double, -1, -1> Xs;
Xs = Eigen::Matrix<double, -1, -1>(2, 2);
stan::math::fill(Xs, std::numeric_limits<double>::quiet_NaN());
assign(lcm_sym2__, nil_index_list(),
rvalue(X, cons_list(index_min_max(1, 2), nil_index_list()), "X"),
"assigning variable lcm_sym2__");
current_statement__ = 2;
return sum(multiply(lcm_sym2__, paramv));
}
} catch (const std::exception& e) {
stan::lang::rethrow_located(e, locations_array__[current_statement__]);
// Next line prevents compiler griping about no return
throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
}
} Note that Xs is a matrix of doubles which is a huge win in this case. |
Thanks @rok-cesnovar ... but a requirement is that you declare things as real foo(data matrix X, vector paramv) {
matrix[2,2] Xs = X[1:2];
return sum( Xs * paramv);
} So the Is that an option for you @paul-buerkner ? (BTW, this is super cool to know... I have been waiting for this feature since ages) |
Yes, you need the data qualifier. Otherwise stanc3 cant infer the type. In theory it probably could check with what inputs it is used and create extra signatures. But not atm. |
Nice! So I just need to put the |
Yes + call the cmdstan_model with an argument. I will post an example call when I am back at my laptop. |
Great, thanks @rok-cesnovar! @wds15 After some work today, I got a simple version working already, which used the subsetting on the fly approach, you advocated before knowing of the new stanc optimization. I will post the version tomorrow or so when I am convinced of it. I am no longer sure which of the two discussed versions would be easier to implement and maintain. I will play around with it more. |
If you get it to work with subsetting as suggested first, then this will be more flexible... since the |
Can you say more about this? What kind of update? Is this something we need to change in cmdstanr or is it related to how brms is interfacing with cmdstanr? I can help with this if necessary. |
@paul-buerkner I can work around the recompilation issue by either going brute-force (doing many compiles and then just take out the compilation time) or be clever about it by converting things directly to cmdstanr style (brms spits out the stan code and the stan data). The brute-force way is brms cleaner while the second approach will work with what you have. Any preference? |
Updating threads and grainsize is now possible without recompilation. Here is an example: # this model just serves as an illustration
# threading may not actually speed things up here
fit1 <- brm(count ~ zAge + zBase * Trt + (1|patient),
data = epilepsy, family = negbinomial(),
chains = 1, threads = threading(2, grainsize = 100),
backend = "cmdstanr")
fit2 <- update(fit1, threads = threading(2, grainsize = 50)) |
Nice! |
Unit tests are now in place as well. After adding some illustrative examples and deciding on the default grainsize, the |
Cool beans! I will try to hurry up with my bits. |
No worries i will not be able to work much next week anyway.
wds15 <notifications@github.com> schrieb am So., 6. Sept. 2020, 20:33:
… Cool beans! I will try to hurry up with my bits.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#892 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCW2AD6HZOQH6OTRF76AMLSEPIXXANCNFSM4M2PFYVQ>
.
|
Below is a small code snippet which can be used by users to evaluate their models. Maybe it is possible to include such code into a brms example section? The notebook can be rendered as spin file. This creates multiple versions of the report. Parameters which can be varied are the sample size N, the number of groups G and the likelihood (normal or poisson). The report creates in essence two plots: One demonstrating the slowdown due to increasing the chunking. The other demonstrating at multiple grainsizes (2xdefault, default, default/2) the change in execution time. The report runs the benchmarks at 25 and 50 iterations each. The point is that the number of iterations need to be large enough for a stable estimate such that doubling the number of iterations should result in similar curves for a given problem - otherwise users need to increase these further. The function The default grainsize of number of rows / 2 * number of physical cores seems to work ok from what I see so far. BTW... at first I was looking at the speedup vs cores plots mainly... but then I noticed that it's a lot better to look at the absolute runtime, since the runtime at 1 core can go up with smaller grainsizes such that higher relative speedups to that get easier, but the total wallclock time is larger. Have a look at the codes and let me know what you think. I did follow the tidyverse approaches which I hope fits into brms. |
Thank you so much @wds15! I am still on vacation but will take a more detailed look in a week or so. With regard to making this a brms example, I prefer adding a few dependencies to the package as possible just for this example. How easy is it to strip the code of its dependency of dplyr and tidyr? The other dependencies are either part of brms already, such as ggplot2, or can easily be replaced such as the dependency on mvtnorm. Just on minor clarification: The default rule should be #rows / (2 * #cores) right? Above, you didn't specify brackets but from the context I assume that was a typo. |
Oh... I would have expected dplyr and tidyr to be already in your list of deps... but if not, then of course - let's take them out. This will make the code look quite a bit different, but fine for me. So I should use *apply stuff, I suppose. I meant #rows / (2 * #cores).. sure. cores is the number of physical cores available. Maybe we should even put there max(100, #rows / (2 * #cores)) to always have at leas a grainsize of 100. |
Thanks for the clarifications and for offering to adjust the code. I agree
tidyverse makes it more pretty but for an example in the brms doc it may
have just too many new dependencies.
wds15 <notifications@github.com> schrieb am Sa., 12. Sept. 2020, 19:54:
… Oh... I would have expected dplyr and tidyr to be already in your list of
deps... but if not, then of course - let's take them out. This will make
the code look quite a bit different, but fine for me. So I should use
*apply stuff, I suppose.
I meant #rows / (2 * #cores).. sure. cores is the number of physical cores
available. Maybe we should even put there
max(100, #rows / (2 * #cores))
to always have at leas a grainsize of 100.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#892 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCW2AEKZB6V6G2RLFRW5H3SFOYV7ANCNFSM4M2PFYVQ>
.
|
Here is a version which does not require anything else but just |
Thank you! It looks this should almost be a small vignette for brms, not an example or demo. Also, at the moment, it does not seem to be self-contained as it uses a "params" variable that seems to come out of your folder structure(?). What I am wondering is if we need this detailed vignette/demo already now or if it makes more sense to add it later once threading in brms is less of an experimental thing. If we decide to add it later, the reduce_sum feature would be ready to merge now as everything else in in place. |
"params" is set to some default values when you run this script without anything. So it should run out of the box. I am inclined to say... "release right away" as experimental, but then I fear we are overwhelmed with Qs from users. How about I turn this into a mini-vignette for a still experimental feature "reduce_sum"? Some doc in form of a vignette should be a lot better than none. For the sake of simplicity I would reduce to just one likelihood presumably. So unless you want to release right away a new brms, I can make the next days an attempt to prepare such a mini vignette. I am definitely fine with releasing without this, but most material is already there such that it's not too much effort now to have a mini vignette (unless I overlook something here). |
I would prefer releasing with vignette. Thank you!
wds15 <notifications@github.com> schrieb am So., 20. Sept. 2020, 15:52:
… "params" is set to some default values when you run this script without
anything. So it should run out of the box.
I am inclined to say... "release right away" as experimental, but then I
fear we are overwhelmed with Qs from users.
How about I turn this into a mini-vignette for a still experimental
feature "reduce_sum"? Some doc in form of a vignette should be a lot better
than none. For the sake of simplicity I would reduce to just one likelihood
presumably.
So unless you want to release right away a new brms, I can make the next
days an attempt to prepare such a mini vignette.
I am definitely fine with releasing without this, but most material is
already there such that it's not too much effort now to have a mini
vignette (unless I overlook something here).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#892 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCW2AD6QBBP5IB7EAAPG2LSGX3IDANCNFSM4M2PFYVQ>
.
|
Great. Let's do that... it's probably easier if I fork brms and then make a PR against your repo (or you grant me directly access to this repo...whatever you prefer). |
If you fork the repo and then work from the reduce_sum branch that would be
perfect! Brms has a couple of other rmd vignettes from which you can see
the required header structure.
wds15 <notifications@github.com> schrieb am So., 20. Sept. 2020, 16:09:
… Great. Let's do that... it's probably easier if I fork brms and then make
a PR against your repo (or you grant me directly access to this
repo...whatever you prefer).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#892 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCW2AHSZRCV33I7AB7OBY3SGX5HJANCNFSM4M2PFYVQ>
.
|
I started to write this up and I am starting to populate it with text. If you want to have a look at the flow of the document, have a look here: If you have any comments already now, let me know. I will need to find a way to provide the code I wrote, but not necessarily put all of that into the document as it is rendered. So maybe I will pull out some of the utility functions and make them "sourcable" such that users can grab the easily - let's see. |
Thanks! I think the exiting text already looks quite good! I am not sure what a good approach is too sourcing code in vignettes to be honest. Personally, I would be fine with showing large chunks of code in the vignette (for users to copy), but I understand it kind of breaks the flow a little. @jgabry and @mjskay do you have experience or suggestions with handling lots of code in vignettes? |
Maybe I found a good solution. One can run code chunk blocks in the header of the document, but not include its output at all. When you name these code chunks it is possible with knitr to print them later on in an "Appendix" section without executing them a second time. So this allows me to avoid distracting the reading flow and still include the code in completion. Sound good? EDIT: Have a look at the updated vignette which includes a first implementation of this with dummy code. |
Sounds good!
wds15 <notifications@github.com> schrieb am Di., 22. Sept. 2020, 18:36:
… Maybe I found a good solution. One can run code chunk blocks in the header
of the document, but not include its output at all. When you name these
code chunks it is possible with knitr to print them later on in an
"Appendix" section without executing them a second time. So this allows me
to avoid distracting the reading flow and still include the code in
completion. Sound good?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#892 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCW2AFIAP7FHGLO55WE43DSHC76RANCNFSM4M2PFYVQ>
.
|
It's progressing nicely: I will probably drop the normal model for the sake of simplicity. Hopefully tomorrow I have time to finish the text in a first version. |
I just pushed a complete first version. How to proceed? |
Nice! Can you make a PR towards brms/reduce_sum?
wds15 <notifications@github.com> schrieb am Do., 24. Sept. 2020, 22:45:
… I just pushed a complete first version. How to proceed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#892 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCW2ADTYJLIDXUCURRG43LSHOOUXANCNFSM4M2PFYVQ>
.
|
sure... I added two more bits and you got your PR. I should stop now as the document got almost lenghty... but now people without time can grasp the most important stuff on the first page and others can get some more details by going through the entire text. |
Thank you so much! I will read through it later on and then merge it into |
Closed via #1004 |
See the blog post of Sebastian Weber (@wds15): https://statmodeling.stat.columbia.edu/2020/05/05/easy-within-chain-parallelisation-in-stan/
The text was updated successfully, but these errors were encountered: