Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Within-chain parallelization via "reduce_sum" #892

Closed
paul-buerkner opened this issue May 6, 2020 · 111 comments
Closed

Within-chain parallelization via "reduce_sum" #892

paul-buerkner opened this issue May 6, 2020 · 111 comments

Comments

@paul-buerkner
Copy link
Owner

See the blog post of Sebastian Weber (@wds15): https://statmodeling.stat.columbia.edu/2020/05/05/easy-within-chain-parallelisation-in-stan/

@wds15
Copy link
Contributor

wds15 commented May 7, 2020

Great to see this here.

It will likely take a moment until it lands in brms given that RStan is still at 2.19. Should we maybe prepare in the meantime a short vignette which shows with an example how the Stan code from the current brms needs to be modified in order to run with CmdStan? Is that an option? As alternative a wiki page maybe?

@paul-buerkner
Copy link
Owner Author

Would issue #891 help with this? :-)

@wds15
Copy link
Contributor

wds15 commented May 7, 2020

Obviously yes.

@paul-buerkner paul-buerkner added this to the brms 2.13.0++ milestone Jun 7, 2020
@mike-lawrence
Copy link

Did the tutorial/wiki that @wds15 mentioned ever get created?

@paul-buerkner
Copy link
Owner Author

paul-buerkner commented Aug 12, 2020

Inspired by @wds15 talk at StanCon, I started thinking about reduce_sum in brms a little more and I would like your (read: anybody's) input.

  1. Over what variable the automatic slicing should be done? In a standard brms model, we could slice over y (the response variable) or over mu (the mean predictor term). A standard likelihood of brms could look as follows:
vector[N] y;
vector[N] mu;
real sigma;
target += normal_lpdf(y | mu, sigma);

As far as I understand, slicing over mu would be computationally prefferable as y can be copied by reference even if passed to reduce_sum via an additional argument, while mu would be passed by value if it was an additional argument. Does anybody know how big of an influence this decision has?

  1. I know it is preferrable to parallelize as much as possible, but given that the computation of the predictor term mu may involve so many different variables that the housekeeping required to stuff them all into reduce_sum would be too much to handle for brms I think. Given that I currently plan to only use reduce_sum over the likelihood call, with mu being computed outside of it in a serial way, would reduce_sum still be worthwhile? Does there exist some benchmark data for this case?

@wds15
Copy link
Contributor

wds15 commented Aug 12, 2020

  1. Yep, slice over mu, not over y is much better.

  2. It would still be worthwhile if the computational cost of the likelihood is very large compared to compiling mu serially (lieklihoods with gamma function calls like the Poisson should work ok and similar ones...but probably not a normal lpdf). This is probably a good starting point. Still, is it thinkable to write the entire model block out as a function and just pass everything into it?

I can't offer any benchmarks on this and I would suggest to just find out... I am looking forward to these results (speak: results someone else worked out). It's all about Amdahl's law as I explained...

@paul-buerkner
Copy link
Owner Author

Thanks! Let's aim high and parallelize the whole model block. Take, for example, the model block of a simple varying intercept-slope model, with priors removed (we don't want to include priors in reduce_sum, do we?):

model {
  // initialize linear predictor term
  vector[N] mu = Intercept + Xc * b;
  for (n in 1:N) {
    // add more terms to the linear predictor
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
  }
  // likelihood including all constants
  if (!prior_only) {
    target += normal_lpdf(Y | mu, sigma);
  }
}

Here is how it would look like

functions {
  real partial_log_lik(real[] Y, int start, int end, real Intercept, 
                       matrix fXc, vector b, int[] fJ_1, vector fZ_1_1, 
                       vector fZ_1_2, vector r_1_1, vector r_1_2,
                       real sigma) {
    int N = end - start + 1;
    matrix[N, cols(fXc)] Xc = fXc[start:end];
    int J_1[N] = fJ_1[start:end];
    vector[N] Z_1_1 = fZ_1_1[start:end];
    vector[N] Z_1_2 = fZ_1_2[start:end];
    real ptarget = 0;
    // initialize linear predictor term
    vector[N] mu = Intercept + Xc * b;
    for (n in 1:N) {
      // add more terms to the linear predictor
      mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
    }
    ptarget += normal_lpdf(Y | mu, sigma);
    return ptarget;
  }
}
model {
  // likelihood including all constants
  if (!prior_only) {
    target += reduce_sum(partial_log_lik, Y, 1, ...)
  }
}

where ... are all the additional arguments passed to partial_log_lik.
I do a lot of housekeeping at the start of the partial_log_lik where I subset the original vectors etc. so that they become the partial vectors etc that we need. The full vectors to be subsetted get the prefix f so that the subsetted vectors have the original names. That way only the definition header of partial_log_lik is different from the original model block's code, which should make the required changes to brms' Stan code generation much less cumbersome.

Above, I use Y as the first argument not because it is necessarily the best to reduce over but because it is the only thing that is always available no matter how the model looks like.

@wds15 What do you think of this approach? Am I missing something important in the spec?

@wds15
Copy link
Contributor

wds15 commented Aug 13, 2020

This starts to look very good! My thoughts on this:

  • we first should try this out without all the code gen to see if this really gives us the cool effect we want. I would manually rewrite a generated brms model.
  • the line matrix[N, cols(fXc)] Xc = fXc[start:end]; is very expensive for the AD tape. The Xc will be declared as a parameter and add significantly to the AD cost. To avoid this, you would have to avoid declaring Xc, but rather write the expression with it directly... so replace vector[N] mu = Intercept + Xc * b; with vector[N] mu = Intercept + fXc[start:end] * b;
  • yes, priors should not go into the partial sum function.

@paul-buerkner
Copy link
Owner Author

I will write a brms branch where I support just a small subset of brms models in the above described way. That should take just a few hours but would enable us to rapidly test a lot of cases (instead of doing all the hand writing).

With regard to Xc, can you explain why this specification is a problem for the AD tape while the same for, say, Z_1_1 is not? Also, your suggestions would kill the principle of leaving the fundamental model block unchanged and just adding a definition header, so I would like to avoid it if possible.

@wds15
Copy link
Contributor

wds15 commented Aug 13, 2020

The Z_* already come in as parameters and it looks to me as if doing this re-indexing (basically this is what it is) makes it easier for you to work within code-gen of brms.... but any real (real, vector, matrix) in a user functions is always instantiated as var (even if you assign data to it). At that point Xc is written to the AD tape. If you instead do what I suggested, then Xc is never written to the AD tape, since you use it in a matrix-vector multiplication. Doing so keeps the AD tape clean from the matrix and does a very much optimised matrix-vector product where only terms of the size of the vector end up on the AD tape.

However, if it is really hard to integrate that in your code gen, then just keep it as is for now (there is hope that the Stan compiler gets smarter and the problem goes away automatically).

@paul-buerkner
Copy link
Owner Author

I see. So, in brms, Z_* is data as well so I suspect the same would apply to those variables as well, right? That is, ideally don't redefine them but use them directly. The problem with this would be that Z_1_1[start:end] would be repeated in every evaluation of the loop, which is probably not what we want either...

I will go ahead and do the header-only-change for now, even if inefficient in some cases so that we have a version to try out.

@wds15
Copy link
Contributor

wds15 commented Aug 13, 2020

Oh... if that's also data... then yes, avoid the temporary re-assignemnts if you can.

So better use instead of
Z_1_1[n]
do this:
fZ_1_1[start + n - 1]

Then you keep this thing as data.

@paul-buerkner
Copy link
Owner Author

This is going to be a code-generation nightmare :-D Let's see if I can find a principled way to enable this indexing on the fly.

@wds15
Copy link
Contributor

wds15 commented Aug 13, 2020

I usually write partial sum functions with a loop over the slice, but with two indices. One spanning n=1...N for N being the slice size and another one for the global context n_overall=n + start - 1. With these two defined it's often easy to write down what you want... but I understand that this may be harder for your brms code gen things.

@paul-buerkner
Copy link
Owner Author

Thanks. I will see what I can make possible in a first step.

@wds15
Copy link
Contributor

wds15 commented Aug 13, 2020

Maybe start this by taking care of Xc... which is of the order of N * number of regressors.... but Z is just of the size of N.

@paul-buerkner
Copy link
Owner Author

Yeah, but in order to make that scale to all the complexity of brms models, we should consistently use one or the other approach all the way through and not mix things up. I will play around with it and report back once there is something that works (ish).

@paul-buerkner
Copy link
Owner Author

Ok, one follow up question, given that the header-change approach would be so much simpler: Do you know of any plans to make the Stan compiler smarter in definining variables as data (when appropriate) in user-defined functions?

@wds15
Copy link
Contributor

wds15 commented Aug 13, 2020

I don't know... with some luck this is even already available with the recently release stanc3 optimisations which can be turned on. I have not been following this that close enough to know if it is there or is still in the making.

I would have hoped that a simple replace - text approach would work to start with at least.

@rok-cesnovar
Copy link
Contributor

Do you know of any plans to make the Stan compiler smarter in definining variables as data (when appropriate) in user-defined functions?

This is in 2.24 already. If you specify --O (not --o) it will optimize this case. Optimizations are currently experimental, but at least these optimizations work as they should (at least as far as we tested them off course).

@rok-cesnovar
Copy link
Contributor

So

functions {
real foo(data matrix X, vector paramv) {
   matrix[2,2] Xs = X[1:2];
   return sum( Xs * paramv);
}
}

without optimization:

template <typename T1__>
stan::promote_args_t<T1__>
foo(const Eigen::Matrix<double, -1, -1>& X,
    const Eigen::Matrix<T1__, -1, 1>& paramv, std::ostream* pstream__) {
  using local_scalar_t__ = stan::promote_args_t<T1__>;
  const static bool propto__ = true;
  (void) propto__;
  local_scalar_t__ DUMMY_VAR__(std::numeric_limits<double>::quiet_NaN());
  (void) DUMMY_VAR__;  // suppress unused var warning
  
  try {
    Eigen::Matrix<local_scalar_t__, -1, -1> Xs;
    Xs = Eigen::Matrix<local_scalar_t__, -1, -1>(2, 2);
    stan::math::fill(Xs, DUMMY_VAR__);
    
    current_statement__ = 1;
    assign(Xs, nil_index_list(),
      rvalue(X, cons_list(index_min_max(1, 2), nil_index_list()), "X"),
      "assigning variable Xs");
    current_statement__ = 2;
    return sum(multiply(Xs, paramv));
  } catch (const std::exception& e) {
    stan::lang::rethrow_located(e, locations_array__[current_statement__]);
      // Next line prevents compiler griping about no return
      throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***"); 
  }
  
}

with optimization

template <typename T1__>
stan::promote_args_t<T1__>
foo(const Eigen::Matrix<double, -1, -1>& X,
    const Eigen::Matrix<T1__, -1, 1>& paramv, std::ostream* pstream__) {
  using local_scalar_t__ = stan::promote_args_t<T1__>;
  const static bool propto__ = true;
  (void) propto__;
  local_scalar_t__ DUMMY_VAR__(std::numeric_limits<double>::quiet_NaN());
  (void) DUMMY_VAR__;  // suppress unused var warning
  
  try {
    Eigen::Matrix<double, -1, -1> lcm_sym2__;
    double lcm_sym1__;
    {
      Eigen::Matrix<double, -1, -1> Xs;
      Xs = Eigen::Matrix<double, -1, -1>(2, 2);
      stan::math::fill(Xs, std::numeric_limits<double>::quiet_NaN());
      
      assign(lcm_sym2__, nil_index_list(),
        rvalue(X, cons_list(index_min_max(1, 2), nil_index_list()), "X"),
        "assigning variable lcm_sym2__");
      current_statement__ = 2;
      return sum(multiply(lcm_sym2__, paramv));
    }
  } catch (const std::exception& e) {
    stan::lang::rethrow_located(e, locations_array__[current_statement__]);
      // Next line prevents compiler griping about no return
      throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***"); 
  }
  
}

Note that Xs is a matrix of doubles which is a huge win in this case.

@wds15
Copy link
Contributor

wds15 commented Aug 13, 2020

Thanks @rok-cesnovar ... but a requirement is that you declare things as

real foo(data matrix X, vector paramv) {
   matrix[2,2] Xs = X[1:2];
   return sum( Xs * paramv);
}

So the X must be declared as data matrix and one also can only call it with data at that point, of course.

Is that an option for you @paul-buerkner ?

(BTW, this is super cool to know... I have been waiting for this feature since ages)

@rok-cesnovar
Copy link
Contributor

Yes, you need the data qualifier.

Otherwise stanc3 cant infer the type. In theory it probably could check with what inputs it is used and create extra signatures. But not atm.

@paul-buerkner
Copy link
Owner Author

Nice! So I just need to put the data qualifier before the variable type and things should be fine? That would be awesome!

@rok-cesnovar
Copy link
Contributor

Yes + call the cmdstan_model with an argument. I will post an example call when I am back at my laptop.

@paul-buerkner
Copy link
Owner Author

paul-buerkner commented Aug 13, 2020

Great, thanks @rok-cesnovar! @wds15 After some work today, I got a simple version working already, which used the subsetting on the fly approach, you advocated before knowing of the new stanc optimization. I will post the version tomorrow or so when I am convinced of it. I am no longer sure which of the two discussed versions would be easier to implement and maintain. I will play around with it more.

@wds15
Copy link
Contributor

wds15 commented Aug 13, 2020

If you get it to work with subsetting as suggested first, then this will be more flexible... since the data declaration in the arguments makes any function less flexible which could limit things wrt to missing data stuff, for example.

@jgabry
Copy link
Contributor

jgabry commented Sep 2, 2020

Allow to update cmdstanr models without recompilation (currently does not work as it should)

Can you say more about this? What kind of update? Is this something we need to change in cmdstanr or is it related to how brms is interfacing with cmdstanr? I can help with this if necessary.

@wds15
Copy link
Contributor

wds15 commented Sep 2, 2020

@paul-buerkner I can work around the recompilation issue by either going brute-force (doing many compiles and then just take out the compilation time) or be clever about it by converting things directly to cmdstanr style (brms spits out the stan code and the stan data). The brute-force way is brms cleaner while the second approach will work with what you have. Any preference?

@paul-buerkner
Copy link
Owner Author

@jgabry I think it is just a brms issue. I will let you know once I fixed the parts on the brms side. Let's see if there is still some stuff remaining afterwards.

@wds15 I prefer you using the brute-force way for now. I will convert the code to the new version once I fixed the update problem.

@paul-buerkner
Copy link
Owner Author

Updating threads and grainsize is now possible without recompilation. Here is an example:

# this model just serves as an illustration
# threading may not actually speed things up here
fit1 <- brm(count ~ zAge + zBase * Trt + (1|patient),
           data = epilepsy, family = negbinomial(),
           chains = 1, threads = threading(2, grainsize = 100),
           backend = "cmdstanr")

fit2 <- update(fit1, threads = threading(2, grainsize = 50))

@jgabry
Copy link
Contributor

jgabry commented Sep 3, 2020

Nice!

@paul-buerkner
Copy link
Owner Author

Unit tests are now in place as well. After adding some illustrative examples and deciding on the default grainsize, the reduce_sum branch is ready for review/merging.

@wds15
Copy link
Contributor

wds15 commented Sep 6, 2020

Cool beans! I will try to hurry up with my bits.

@paul-buerkner
Copy link
Owner Author

paul-buerkner commented Sep 6, 2020 via email

@wds15
Copy link
Contributor

wds15 commented Sep 8, 2020

Below is a small code snippet which can be used by users to evaluate their models. Maybe it is possible to include such code into a brms example section?

The notebook can be rendered as spin file. This creates multiple versions of the report. Parameters which can be varied are the sample size N, the number of groups G and the likelihood (normal or poisson).

The report creates in essence two plots: One demonstrating the slowdown due to increasing the chunking. The other demonstrating at multiple grainsizes (2xdefault, default, default/2) the change in execution time. The report runs the benchmarks at 25 and 50 iterations each. The point is that the number of iterations need to be large enough for a stable estimate such that doubling the number of iterations should result in similar curves for a given problem - otherwise users need to increase these further.

The function benchmark_threading is relatively generic and should be useful to many other users. One point one could improve with the function is to readout ESS/s for lp__ from the cmdstan output; right now I use the total runtime as a metric.

The default grainsize of number of rows / 2 * number of physical cores seems to work ok from what I see so far.

BTW... at first I was looking at the speedup vs cores plots mainly... but then I noticed that it's a lot better to look at the absolute runtime, since the runtime at 1 core can go up with smaller grainsizes such that higher relative speedups to that get easier, but the total wallclock time is larger.

Have a look at the codes and let me know what you think. I did follow the tidyverse approaches which I hope fits into brms.

evaluate-tuning-20200908.zip

@paul-buerkner
Copy link
Owner Author

Thank you so much @wds15! I am still on vacation but will take a more detailed look in a week or so. With regard to making this a brms example, I prefer adding a few dependencies to the package as possible just for this example. How easy is it to strip the code of its dependency of dplyr and tidyr? The other dependencies are either part of brms already, such as ggplot2, or can easily be replaced such as the dependency on mvtnorm.

Just on minor clarification: The default rule should be #rows / (2 * #cores) right? Above, you didn't specify brackets but from the context I assume that was a typo.

@wds15
Copy link
Contributor

wds15 commented Sep 12, 2020

Oh... I would have expected dplyr and tidyr to be already in your list of deps... but if not, then of course - let's take them out. This will make the code look quite a bit different, but fine for me. So I should use *apply stuff, I suppose.

I meant #rows / (2 * #cores).. sure. cores is the number of physical cores available. Maybe we should even put there

max(100, #rows / (2 * #cores))

to always have at leas a grainsize of 100.

@paul-buerkner
Copy link
Owner Author

paul-buerkner commented Sep 13, 2020 via email

@wds15
Copy link
Contributor

wds15 commented Sep 14, 2020

Here is a version which does not require anything else but just ggplot2. It's a bit more verbose to code up. Maybe this is better suited for a demo as part of brms? Just a thought.

evaluate-tuning-base-20200914.zip

@paul-buerkner
Copy link
Owner Author

Thank you! It looks this should almost be a small vignette for brms, not an example or demo. Also, at the moment, it does not seem to be self-contained as it uses a "params" variable that seems to come out of your folder structure(?). What I am wondering is if we need this detailed vignette/demo already now or if it makes more sense to add it later once threading in brms is less of an experimental thing. If we decide to add it later, the reduce_sum feature would be ready to merge now as everything else in in place.

@wds15
Copy link
Contributor

wds15 commented Sep 20, 2020

"params" is set to some default values when you run this script without anything. So it should run out of the box.

I am inclined to say... "release right away" as experimental, but then I fear we are overwhelmed with Qs from users.

How about I turn this into a mini-vignette for a still experimental feature "reduce_sum"? Some doc in form of a vignette should be a lot better than none. For the sake of simplicity I would reduce to just one likelihood presumably.

So unless you want to release right away a new brms, I can make the next days an attempt to prepare such a mini vignette.

I am definitely fine with releasing without this, but most material is already there such that it's not too much effort now to have a mini vignette (unless I overlook something here).

@paul-buerkner
Copy link
Owner Author

paul-buerkner commented Sep 20, 2020 via email

@wds15
Copy link
Contributor

wds15 commented Sep 20, 2020

Great. Let's do that... it's probably easier if I fork brms and then make a PR against your repo (or you grant me directly access to this repo...whatever you prefer).

@paul-buerkner
Copy link
Owner Author

paul-buerkner commented Sep 20, 2020 via email

@wds15
Copy link
Contributor

wds15 commented Sep 21, 2020

I started to write this up and I am starting to populate it with text. If you want to have a look at the flow of the document, have a look here:

reduce_sum...wds15:reduce_sum

If you have any comments already now, let me know.

I will need to find a way to provide the code I wrote, but not necessarily put all of that into the document as it is rendered. So maybe I will pull out some of the utility functions and make them "sourcable" such that users can grab the easily - let's see.

@paul-buerkner
Copy link
Owner Author

Thanks! I think the exiting text already looks quite good! I am not sure what a good approach is too sourcing code in vignettes to be honest. Personally, I would be fine with showing large chunks of code in the vignette (for users to copy), but I understand it kind of breaks the flow a little. @jgabry and @mjskay do you have experience or suggestions with handling lots of code in vignettes?

@wds15
Copy link
Contributor

wds15 commented Sep 22, 2020

Maybe I found a good solution. One can run code chunk blocks in the header of the document, but not include its output at all. When you name these code chunks it is possible with knitr to print them later on in an "Appendix" section without executing them a second time. So this allows me to avoid distracting the reading flow and still include the code in completion. Sound good?

EDIT: Have a look at the updated vignette which includes a first implementation of this with dummy code.

@paul-buerkner
Copy link
Owner Author

paul-buerkner commented Sep 22, 2020 via email

@wds15
Copy link
Contributor

wds15 commented Sep 23, 2020

It's progressing nicely:

brms_threading.zip

I will probably drop the normal model for the sake of simplicity.

Hopefully tomorrow I have time to finish the text in a first version.

@wds15
Copy link
Contributor

wds15 commented Sep 24, 2020

I just pushed a complete first version. How to proceed?

@paul-buerkner
Copy link
Owner Author

paul-buerkner commented Sep 25, 2020 via email

@wds15
Copy link
Contributor

wds15 commented Sep 25, 2020

sure... I added two more bits and you got your PR. I should stop now as the document got almost lenghty... but now people without time can grasp the most important stuff on the first page and others can get some more details by going through the entire text.

@paul-buerkner
Copy link
Owner Author

Thank you so much! I will read through it later on and then merge it into reduce_sum.

@paul-buerkner
Copy link
Owner Author

Closed via #1004

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants