Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sampling #269

Open
jeffreypullin opened this issue Mar 1, 2019 · 11 comments

Comments

@jeffreypullin
Copy link
Contributor

commented Mar 1, 2019

Continued from thread on forum:

@jeffreypullin

This comment has been minimized.

Copy link
Contributor Author

commented Mar 1, 2019

Jeffrey:

Okay I've had a bit of a go at an implementation (you can see it here) but I've run into a few problems. (By the way, not sure if here is the best place to have these sort of discussions but not sure where else)

The basic idea for simulating a distribution is a new method for every distribution: tf_sample() which is called from the wrapper sample() at the distribution node level. For example for the normal distribution:

tf_sample = function() {

      parameters <- lapply(self$parameters, function(x) x$value())
      dist <- tfp$distributions$Normal(loc = parameters$mean,
                                       scale = parameters$sd)
      tf$Session()$run(dist$sample(self$dim))

    }

I'm not sure if there is any better infrastructure in greta to do this tf$Session$run idea - it does feel a bit inelegant.

Does this look like a somewhat sensible idea?

A big issue with this approach is how we sample distributions not implemented in tensorflow_probability or ones for which we don't use it. Some ideas I had:

  • Simulate in R. Issue - how to broadcast?
  • Simulate using distribution identities - i.e. F is the ratio of Chi squared
  • Don't simulate and throw an error if we are told to

Do you have any ideas?

All the best!

@jeffreypullin

This comment has been minimized.

Copy link
Contributor Author

commented Mar 1, 2019

Nick:

Great! Probably best if you open an issue in the main GitHub repo about adding sampling, copy this stuff over, then we can discuss it there.

I wasn’t thinking you’d execute a tf session inside those methods, just return the tensor for the random elements (so end the function with dist$sample(self$dim)).

Then the function that does simulation of the whole model can put that tensor wherever it needs to go. E.g. when having the variable node define itself on the dag inside the simulation function, it can use this method, and place the resulting tensor in the dag’s tf_environment with the appropriate name (rather than collecting from the free state vector and transforming it, like it would we defining the model for inference). Whenever the tensorflow session is executed, that will then generate a different IID random draw from this distribution.

For the distributions that don’t have TF probability random number generators, we’ll have to code our own in tensorflow code. It’s normally easiest to generate IID standard uniform variables and push them through the distribution’s inverse CDF. E.g. for a logistic distribution:

tf_rlogis <- function (n, location = 0, scale = 1) {
  u <- tf$random$uniform(shape(n))
  z <- log(u / (1 - u))
  z * scale + location
}

# get tensor for result
draws <- tf_rlogis(1000, 2, 1.3)

# compare with stats::rlogis
par(mfrow = c(2, 1))
sess <- tf$Session()
hist(sess$run(draws))
hist(rlogis(1000, 2, 1.3))
@jeffreypullin

This comment has been minimized.

Copy link
Contributor Author

commented Mar 1, 2019

Okay that does make more sense. The functions for sampling the dag I've written just operate on the nodes directly but I'm sure I can adapt them in the way you describe.

Regarding the rng generation: I'm aware of the inverting the CDF method, and I'll try to implement it when possible, but it does require that inverting the CDF can be done analytically. This isn't always possible, for example I don't think it is in F distribution case so we'll have to use some other methods. A quick look through R's source shows that R samples the F distribution by sampling chi squared distributions:

#include "nmath.h"

double rf(double n1, double n2)
{
    double v1, v2;
    if (ISNAN(n1) || ISNAN(n2) || n1 <= 0. || n2 <= 0.)
	ML_ERR_return_NAN;

    v1 = R_FINITE(n1) ? (rchisq(n1) / n1) : 1;
    v2 = R_FINITE(n2) ? (rchisq(n2) / n2) : 1;
    return v1 / v2;
}

so that seems like a good bet in that case. I'll have a think about the many multivariate distributions but any ideas for those would be very helpful.

Thanks!

@goldingn

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

Cool, yes interesting to know about tricks like that!

Sampling the lkj_distribution is a bit of a faff too, there's code in the rethinking package that we could base it on (duplicated in the greta test code here)

@cboettig

This comment has been minimized.

Copy link

commented Jul 17, 2019

Any news on this? Still very much looking forward to simulation methods (see #286), particularly as a way of introducing / teaching greta (maybe this coming semester)...

@jeffreypullin

This comment has been minimized.

Copy link
Contributor Author

commented Jul 21, 2019

So I have a seperate repo (i.e. not a fork or greta) which implements this here. I'd love to incorporate this into greta proper but I haven't had much time to work on it. Also, the design I had in mind - all simulation done on the R side - is a bit different to the design @goldingn has in mind where the simulation is done using Tensorflow. The one part of greta I really don't understand is how it pushes stuff to Tensorflow so that sort of stymied me. Hopefully this will be implemented soon though!

@cboettig

This comment has been minimized.

Copy link

commented Jul 22, 2019

Thanks @jeffreypullin , really cool. Unfortunately I don't quite follow how I can simulate a time-series model in this design, since I can't even declare the model object in greta without providing the "data" values.
e.g. let's say I want to simulate logistic growth with some Gaussian random noise:

# priors
r <- uniform(0, 1)
K <- uniform(0, 10)
sigma <- uniform(0, 1)
# Model
mean <- x_t + r * x_t * (1 - x_t / K) 
distribution(x_t1) <- normal(mean, sigma * x_t)
m <- model(x_t, x_t1) # ERROR, this is not valid syntax

(I put priors in there but would also like to simulate with fixed values of parameters; but in any event as discussed in #286, I think the trick is how to simulate N samples of the data (the ordered pairs x_t and x_t1), not just drawing of the parameter values from the priors.

To make this concrete, here's an example the same example in perfectly valid code in nimble:

library(nimble)

logistic  <- nimbleCode({
  K ~ dunif(0,2)
  r ~ dunif(0,1)
  sigma ~ dunif(0,1)
  x[1] <- x0
  for(t in 1:(N-1)){
    mu[t] <- x[t] + x[t] * r * (1 - x[t] / K) 
    y[t] <- x[t] * sigma
    x[t+1] ~ dnorm(mu[t], sd = y[t])
  }
  
})


constants <- list(N = 1e2, x0 = 0.2)
inits <- list(r = 0.05, K = 2, sigma = 0.02)

model <- nimbleModel(logistic, constants = constants, inits = inits)

set.seed(123)
# note we define the 'data' nodes that need to be simulated 
simulate(model, nodes = c("mu", "x", "y"))

# extract simulated data & plot to see results
df <- tibble(t = seq_along(model$x), x = model$x)
df %>% ggplot(aes(t,x)) + geom_point()

Clearly the nimble syntax is a bit more verbose, but it makes it clear that I'm simulating length-N vector x (and 'internal' variables y and mu which are I think just needed as a limitation of the syntax). Thoughts on this welcome!

An implementation in tensorflow backend would be great, not sure how often the performance would be needed/better that way, but I think it would keep the internals more consistent that way? I'm also happy with a pure-R implementation but I think there's still syntax things to be hammered out (or at least for me to wrap my head around!)

@goldingn

This comment has been minimized.

Copy link
Member

commented Jul 22, 2019

That code would work for a model with all the x observed. Because there are no unobserved x values, you can define the distribution simultaneously. But to simulate data, each x depends on the previous one. So you would need a for loop, and no distribution() statement.

Happy to help work through that, and have some tricks to make for loops more efficient in greta (though since tensorflow is designed for vectorised operations, loops will always be slow). Though would you mind posting this on the forum? That's a better place for these conversations, and easier for others to find and read.

@goldingn

This comment has been minimized.

Copy link
Member

commented Jul 22, 2019

Yeah, there are two things needed for a nice greta implementation:

  1. writing/exposing random generators for each of the distributions in tensorflow code
  2. writing the interface and internal code so users can run simulate() and get results

I'll probably need to help out with the TF plumbing required for 2, but the rest should then be fairly straightforward.

@cboettig

This comment has been minimized.

Copy link

commented Jul 23, 2019

Thanks @goldingn ! I've created a forum thread as you suggested, cross-linking here: https://forum.greta-stats.org/t/simulations-with-greta-including-time-series/140.

good point about the for loop issue, though I guess that means I'm stuck writing separate model definitions for simulating and estimating in the timeseries case (since presumably it would be needlessly slow to attempt timeseries estimation using the non-vectorized version). Haven't tried writing it with a for loop before, I think it would help to see an example. Anyway, can continue discussion in the forum. As always really appreciate all you're doing and very excited about the potential here!

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 18, 2019

I've opened a forum post asking for feedback on the interface to this functionality. @cboettig, if you have time please let me know what you think of the plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.