# greta-dev/greta

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

# Add sampling #269

Open
opened this issue Mar 1, 2019 · 11 comments

Contributor

### jeffreypullin commented Mar 1, 2019

 Continued from thread on forum:
Contributor Author

### jeffreypullin commented Mar 1, 2019

 Jeffrey: Okay I've had a bit of a go at an implementation (you can see it here) but I've run into a few problems. (By the way, not sure if here is the best place to have these sort of discussions but not sure where else) The basic idea for simulating a distribution is a new method for every distribution: `tf_sample()` which is called from the wrapper `sample()` at the distribution node level. For example for the normal distribution: ``````tf_sample = function() { parameters <- lapply(self\$parameters, function(x) x\$value()) dist <- tfp\$distributions\$Normal(loc = parameters\$mean, scale = parameters\$sd) tf\$Session()\$run(dist\$sample(self\$dim)) } `````` I'm not sure if there is any better infrastructure in greta to do this `tf\$Session\$run` idea - it does feel a bit inelegant. Does this look like a somewhat sensible idea? A big issue with this approach is how we sample distributions not implemented in tensorflow_probability or ones for which we don't use it. Some ideas I had: Simulate in R. Issue - how to broadcast? Simulate using distribution identities - i.e. F is the ratio of Chi squared Don't simulate and throw an error if we are told to Do you have any ideas? All the best!
Contributor Author

### jeffreypullin commented Mar 1, 2019 • edited

 Nick: Great! Probably best if you open an issue in the main GitHub repo about adding sampling, copy this stuff over, then we can discuss it there. I wasn’t thinking you’d execute a tf session inside those methods, just return the tensor for the random elements (so end the function with `dist\$sample(self\$dim))`. Then the function that does simulation of the whole model can put that tensor wherever it needs to go. E.g. when having the variable node define itself on the dag inside the simulation function, it can use this method, and place the resulting tensor in the dag’s `tf_environment` with the appropriate name (rather than collecting from the free state vector and transforming it, like it would we defining the model for inference). Whenever the tensorflow session is executed, that will then generate a different IID random draw from this distribution. For the distributions that don’t have TF probability random number generators, we’ll have to code our own in tensorflow code. It’s normally easiest to generate IID standard uniform variables and push them through the distribution’s inverse CDF. E.g. for a logistic distribution: ```tf_rlogis <- function (n, location = 0, scale = 1) { u <- tf\$random\$uniform(shape(n)) z <- log(u / (1 - u)) z * scale + location } # get tensor for result draws <- tf_rlogis(1000, 2, 1.3) # compare with stats::rlogis par(mfrow = c(2, 1)) sess <- tf\$Session() hist(sess\$run(draws)) hist(rlogis(1000, 2, 1.3))```
Contributor Author

### jeffreypullin commented Mar 1, 2019

 Okay that does make more sense. The functions for sampling the `dag` I've written just operate on the nodes directly but I'm sure I can adapt them in the way you describe. Regarding the rng generation: I'm aware of the inverting the CDF method, and I'll try to implement it when possible, but it does require that inverting the CDF can be done analytically. This isn't always possible, for example I don't think it is in F distribution case so we'll have to use some other methods. A quick look through R's source shows that R samples the F distribution by sampling chi squared distributions: ```#include "nmath.h" double rf(double n1, double n2) { double v1, v2; if (ISNAN(n1) || ISNAN(n2) || n1 <= 0. || n2 <= 0.) ML_ERR_return_NAN; v1 = R_FINITE(n1) ? (rchisq(n1) / n1) : 1; v2 = R_FINITE(n2) ? (rchisq(n2) / n2) : 1; return v1 / v2; }``` so that seems like a good bet in that case. I'll have a think about the many multivariate distributions but any ideas for those would be very helpful. Thanks!
Member

### goldingn commented Mar 6, 2019

 Cool, yes interesting to know about tricks like that! Sampling the lkj_distribution is a bit of a faff too, there's code in the rethinking package that we could base it on (duplicated in the greta test code here)

### cboettig commented Jul 17, 2019

 Any news on this? Still very much looking forward to simulation methods (see #286), particularly as a way of introducing / teaching `greta` (maybe this coming semester)...
Contributor Author

### jeffreypullin commented Jul 21, 2019

 So I have a seperate repo (i.e. not a fork or greta) which implements this here. I'd love to incorporate this into greta proper but I haven't had much time to work on it. Also, the design I had in mind - all simulation done on the R side - is a bit different to the design @goldingn has in mind where the simulation is done using Tensorflow. The one part of greta I really don't understand is how it pushes stuff to Tensorflow so that sort of stymied me. Hopefully this will be implemented soon though!

### cboettig commented Jul 22, 2019

 Thanks @jeffreypullin , really cool. Unfortunately I don't quite follow how I can simulate a time-series model in this design, since I can't even declare the model object in `greta` without providing the "data" values. e.g. let's say I want to simulate logistic growth with some Gaussian random noise: ```# priors r <- uniform(0, 1) K <- uniform(0, 10) sigma <- uniform(0, 1) # Model mean <- x_t + r * x_t * (1 - x_t / K) distribution(x_t1) <- normal(mean, sigma * x_t) m <- model(x_t, x_t1) # ERROR, this is not valid syntax``` (I put priors in there but would also like to simulate with fixed values of parameters; but in any event as discussed in #286, I think the trick is how to simulate `N` samples of the data (the ordered pairs `x_t` and `x_t1`), not just drawing of the parameter values from the priors. To make this concrete, here's an example the same example in perfectly valid code in `nimble`: ```library(nimble) logistic <- nimbleCode({ K ~ dunif(0,2) r ~ dunif(0,1) sigma ~ dunif(0,1) x <- x0 for(t in 1:(N-1)){ mu[t] <- x[t] + x[t] * r * (1 - x[t] / K) y[t] <- x[t] * sigma x[t+1] ~ dnorm(mu[t], sd = y[t]) } }) constants <- list(N = 1e2, x0 = 0.2) inits <- list(r = 0.05, K = 2, sigma = 0.02) model <- nimbleModel(logistic, constants = constants, inits = inits) set.seed(123) # note we define the 'data' nodes that need to be simulated simulate(model, nodes = c("mu", "x", "y")) # extract simulated data & plot to see results df <- tibble(t = seq_along(model\$x), x = model\$x) df %>% ggplot(aes(t,x)) + geom_point()``` Clearly the `nimble` syntax is a bit more verbose, but it makes it clear that I'm simulating length-N vector `x` (and 'internal' variables `y` and `mu` which are I think just needed as a limitation of the syntax). Thoughts on this welcome! An implementation in tensorflow backend would be great, not sure how often the performance would be needed/better that way, but I think it would keep the internals more consistent that way? I'm also happy with a pure-R implementation but I think there's still syntax things to be hammered out (or at least for me to wrap my head around!)
Member

### goldingn commented Jul 22, 2019

 That code would work for a model with all the x observed. Because there are no unobserved x values, you can define the distribution simultaneously. But to simulate data, each x depends on the previous one. So you would need a for loop, and no `distribution()` statement. Happy to help work through that, and have some tricks to make for loops more efficient in greta (though since tensorflow is designed for vectorised operations, loops will always be slow). Though would you mind posting this on the forum? That's a better place for these conversations, and easier for others to find and read.
Member

### goldingn commented Jul 22, 2019

 Yeah, there are two things needed for a nice greta implementation: writing/exposing random generators for each of the distributions in tensorflow code writing the interface and internal code so users can run `simulate()` and get results I'll probably need to help out with the TF plumbing required for 2, but the rest should then be fairly straightforward.

### cboettig commented Jul 23, 2019

 Thanks @goldingn ! I've created a forum thread as you suggested, cross-linking here: https://forum.greta-stats.org/t/simulations-with-greta-including-time-series/140. good point about the `for` loop issue, though I guess that means I'm stuck writing separate model definitions for simulating and estimating in the timeseries case (since presumably it would be needlessly slow to attempt timeseries estimation using the non-vectorized version). Haven't tried writing it with a `for` loop before, I think it would help to see an example. Anyway, can continue discussion in the forum. As always really appreciate all you're doing and very excited about the potential here!
Member

### goldingn commented Aug 18, 2019

 I've opened a forum post asking for feedback on the interface to this functionality. @cboettig, if you have time please let me know what you think of the plan.
to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.