Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

greta goes NUTS #314

Open
jeffreypullin opened this issue Aug 20, 2019 · 37 comments

Comments

@jeffreypullin
Copy link
Contributor

commented Aug 20, 2019

Tensorflow probability has just open sourced it's unrolled* NUTS implementation here

Personally, I think having a NUTS implementation would be a big thing for greta.

*I assume this means that it very parallel(y)

@jeffreypullin jeffreypullin changed the title Greta goes NUTS greta goes NUTS Aug 20, 2019

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 20, 2019

Yeah, this'll be great. I have a suspicion it'll be a fair bit slower than the HMC implementation. But will hopefully make up for that in efficiency.

We'll need to wait for the next TFP release before we can put this in a greta release or the master branch, but it should be fairly straightforward to implement in a branch for now. Just need to copy and modify the hmc_sampler class to create a nuts_sampler class, provide a nuts() constructor like the hmc() constructor, and probably modify travis and DESCRIPTION to use the TFP nightly builds.

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 21, 2019

I just pushed up a nuts branch which implements an interface to the new sampler. You can install the required nightly TF stuff with some variant of:

install_tensorflow(version = "nightly")

It looks like the API might be different (or buggy), as it's erroring for me (when I run mcmc(m, nuts())) with this message:

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: Cannot convert a partially known TensorShape to a Tensor: (11, ?, 1)

Will try to take a look at that soon.

I haven't changed any of the Travis instructions, so expecting it to fail there too.

@jeffreypullin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2019

I’m happy to look into that error/the interface if you would like - I should have some time this weekend. (I was actually going to offer to submit a pull request to add the nuts sampler - but you beat me to that! 😊)

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 21, 2019

Great, please do! And let me know if you need help navigating the batch dimension stuff. It isn't very intuitive.

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

I couldn't help myself, so here's a reprex of the bug in R tensorflow.

The error doesn't occur when the initial values have known dimensions, so it looks like either some unaccounted for broadcasting issue, or maybe the error is an artefact of using placeholders (a TF 1 concept) rather than eager execution and tf_function. If the latter, it'll require a significant amount of work to make this work with greta (though something that will need doing eventually anyway).

library(tensorflow)
library(tfprobability)

# Target distribution is proportional to: `exp(-x (1 + x))`.
unnormalized_log_prob <- function (x) {
  tf$reduce_sum(-x - x ^ 2, axis = 1L)
}

run_chain <- function (kernel) {
  
  # Run the chain (with burn-in).
  tfp$mcmc$sample_chain(
    num_results = 10L,
    current_state = vals,
    kernel = kernel,
    trace_fn = function (current_state, kernel_results) {kernel_results}
  )

}

# set up kernels
hmc <- tfp$mcmc$HamiltonianMonteCarlo(
    target_log_prob_fn = unnormalized_log_prob,
    num_leapfrog_steps = 3L,
    step_size = 1
)

nuts <- tfp$mcmc$NoUTurnSampler(
  target_log_prob_fn = unnormalized_log_prob,
  max_tree_depth = 3L,
  step_size = 1
)

vals <- tf$compat$v1$placeholder(dtype = tf$float32, shape = shape(NULL, 100))

hmc_output <- run_chain(hmc)

# this errors because of the unknown dimension!
nuts_output <- run_chain(nuts)
 Error in py_call_impl(callable, dots$args, dots$keywords) : 
 ValueError: Cannot convert a partially known TensorShape to a Tensor: (4, ?, 100) 

For completeness, here's the code to execute the inference for HMC:

# defining the number of chains now, via initial values
n_chains <- 3
init <- matrix(rnorm(100 * n_chains), n_chains, 100)
feed_dict <- dict(vals = init)

sess <- tf$compat$v1$Session()
system.time(res <- sess$run(hmc_output, feed_dict = feed_dict))
@junpenglao

This comment has been minimized.

Copy link

commented Aug 22, 2019

I think this is an error with dynamic shape which we have never tested in unrolled NUTS - I just filed a bug in TFP side: tensorflow/probability#530 I will prepare a fix.

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

Awesome, thanks @junpenglao!

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

Other things we still need to do on the greta side:

  • add tests, like for hmc: here and here
  • install TF/TFP nightlies on travis for the nuts branch

A speed test of NUTS vs vanilla HMC (using the above code but fixed shape) shows a fairly major slowdown on a small dataset and with few chains. That's to be expected given how much more complicated the unrolled algorithm must be to make a static graph and vectorised NUTS! Though I would expect there are many cases (large numbers of chains, computationally-intensive objective function, complex posterior geometry) where NUTS is more efficient. I'd be keen to try this method of tuning the marginal distribution of L and see how it compares with vanilla HMC and NUTS.

@junpenglao

This comment has been minimized.

Copy link

commented Aug 22, 2019

In our internal test NUTS is at most 2x slower than HMC (note that to compare the two fairly, you need to make sure both are taking roughly the same number of leapfrogs), you can narrow the gap by using large numbers of chains, and run it under XLA.

Note that you can already do something similar to eHMC: first run NUTS, and using the histogram of leapfrog_computed (you can get it from the previous_kernel_result) as the empirical distribution of num_leapfrog in HMC (will need to modify tfp.mcmc.hmc to accept a callable as num_leapfrog).

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

That's really useful, thanks!

Yes, was thinking about how to modify hmc to jitter the number of leapfrog steps. Is is also possible to do this using the trace_fn argument to sample_chain, modifying num_leapfrog before returning it for the next step?

@junpenglao

This comment has been minimized.

Copy link

commented Aug 22, 2019

Yes, but you will need to write another transitional kernel that wrap around hmc, similar to https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/mcmc/simple_step_size_adaptation.py but modify the number of leapfrog step instead.

@junpenglao

This comment has been minimized.

Copy link

commented Aug 22, 2019

Actually, a bit hacky way to do it is modify hmc.py and change num_leapfrog_steps arg to a callable num_leapfrog_steps_fn, and pass a random function to it:

kernel = HamiltonianMonteCarlo(
    target_log_prob_fn=tfd.Independent(tfd.Normal(tf.zeros(10), 1.), 1).log_prob,
    num_leapfrog_steps_fn=lambda _: tfd.Categorical(
        # here we have our empirical distribution for L
        probs=np.asarray([.15, .15, .5, .15, .05])).sample() + tf.constant(3, dtype=tf.int32),
    step_size=1.)

@tf.function
def run_chain():
  # Run the chain (with burn-in).
  samples, is_accepted = tfp.mcmc.sample_chain(
      num_results=1000,
      num_burnin_steps=100,
      current_state=tf.zeros(10),
      kernel=kernel,
      trace_fn=lambda _, pkr: pkr.is_accepted)

  sample_stddev = tf.math.reduce_std(samples)
  is_accepted = tf.reduce_mean(tf.cast(is_accepted, dtype=tf.float32))
  return samples, sample_stddev, is_accepted

samples, sample_stddev, is_accepted = run_chain()

You can also do dynamic number of step with fix path length which is the PyMC3 current HMC:

path_length = 5.
kernel = HamiltonianMonteCarlo(
    target_log_prob_fn=tfd.Independent(tfd.Normal(tf.zeros(10), 1.), 1).log_prob,
    num_leapfrog_steps_fn=lambda step_size: tf.cast(path_length/step_size, tf.int32),
    step_size=1.)
@junpenglao

This comment has been minimized.

Copy link

commented Aug 23, 2019

FYI, dynamic shape should be fixed now: tensorflow/probability@23573a1

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

Awesome, thanks for fixing that so quickly!

I get an error about shape invariants when using this in greta with the nuts branch:

library(greta)
x <- normal(0, 1)
m <- model(x)
draws <- mcmc(m, nuts())
Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: Input tensor 'mcmc_sample_chain/trace_scan/while/smart_for_loop/while/NoUTurnSampler/.one_step/mul:0' enters the loop with shape (2, ?, 1), but has shape (2, ?, ?) after one iteration. To allow the shape to vary across iterations, use the `shape_invariants` argument of tf.while_loop to specify a less-specific shape. 

But I don't get an error when doing this with the R tensorflow example above, so there's something funky going on and may be a problem on the greta side. Should be able to take a look at it this week

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

Oh, it looks like this is an issue (in the R tensorflow code also) when the input dimensions are ?, 1. so probably still a problem with dynamic shape @junpenglao?

This code is as before, but I changed from shape(NULL, 100) to shape(NULL, 1) when defining the placeholder for the parameters

library(tensorflow)
library(tfprobability)

# Target distribution is proportional to: `exp(-x (1 + x))`.
unnormalized_log_prob <- function (x) {
  tf$reduce_sum(-x - x ^ 2, axis = 1L)
}

run_chain <- function (kernel) {
  
  # Run the chain (with burn-in).
  tfp$mcmc$sample_chain(
    num_results = 10L,
    current_state = vals,
    kernel = kernel,
    trace_fn = function (current_state, kernel_results) {kernel_results}
  )
  
}

# set up kernels
hmc <- tfp$mcmc$HamiltonianMonteCarlo(
  target_log_prob_fn = unnormalized_log_prob,
  num_leapfrog_steps = 3L,
  step_size = 1
)

nuts <- tfp$mcmc$NoUTurnSampler(
  target_log_prob_fn = unnormalized_log_prob,
  max_tree_depth = 3L,
  step_size = 1
)

vals <- tf$compat$v1$placeholder(dtype = tf$float32, shape = shape(NULL, 1))

unnormalized_log_prob(vals)
hmc_output <- run_chain(hmc)

# this errors because of the unknown dimension!
nuts_output <- run_chain(nuts)
 Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: Input tensor 'mcmc_sample_chain_1/trace_scan/while/smart_for_loop/while/NoUTurnSampler/.one_step/mul:0' enters the loop with shape (2, ?, 1), but has shape (2, ?, ?) after one iteration. To allow the shape to vary across iterations, use the `shape_invariants` argument of tf.while_loop to specify a less-specific shape. 
@junpenglao

This comment has been minimized.

Copy link

commented Aug 27, 2019

Thanks for reporting, I will take a look.

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

For now, it looks like the nuts branch works with the nightlies, at least provided the parameters space is larger than 1!
Here's a sampler efficiency comparison on my laptop with a 1000D standard normal

set.seed(2019-08-27)

library(greta)
x <- normal(0, 1, dim = 1000)
m <- model(x, precision = "single")

# get (minimum) effective samples per second during sampling (don't time warmup period)
efficiency <- function(sampler) {
  draws <- mcmc(m, sampler, n_samples = 1, verbose = FALSE)
  time <- system.time(
    draws <- extra_samples(draws, 1000, verbose = FALSE)
  )
  size <- coda::effectiveSize(draws)
  min(size) / time["elapsed"]
}

hmc_efficiency <- efficiency(hmc())
nuts_efficiency <- efficiency(nuts())
rwmh_efficiency <- efficiency(rwmh())
slice_efficiency <- efficiency(slice())

efficiencies <- round(c(hmc_efficiency, nuts_efficiency, rwmh_efficiency, slice_efficiency), 1)
names(efficiencies) <- c("hmc", "nuts", "rwmh", "slice")
efficiencies
  hmc  nuts  rwmh slice 
279.8 137.2   2.8   0.8 

So still paying enough of a cost in the NUTS code to halve the number of effective samples in this example. Could well be worth that cost in an example with more variable geometry.

@junpenglao

This comment has been minimized.

Copy link

commented Aug 27, 2019

Hmmm I cant seems to be able to reproduce the error in python, is the code you are running in R spell out correctly like this in python?
https://colab.research.google.com/drive/16Fu1kRdf11PeRy3kqjgdMsWqDo7FQPtV

I tested in tf1 and tf2

@junpenglao

This comment has been minimized.

Copy link

commented Aug 27, 2019

What TF version you are currently using?

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

I just left the office and won't be able to have a play until tomorrow unfortunately.

But reading from my phone, the main difference is that you are using a placeholder with a default, whereas I was running without a default.

Otherwise the only difference I see is is the objective using integer 2 instead of float 2 ( 2 in R is 2. in python)

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

I was running with the current TF1 nightly (cpu, 1.15.?)

@junpenglao

This comment has been minimized.

Copy link

commented Aug 27, 2019

Thanks! Let me try with the one without default

@junpenglao

This comment has been minimized.

Copy link

commented Aug 27, 2019

Yep can confirm the bug using:

batch_shape = 10

def unnormalized_log_prob(x):
  return tf.reduce_sum(-x-x**2, axis=-1)

for shape in [1, 100]:
#   x_init = tf1.placeholder_with_default(np.zeros((batch_shape, shape), dtype=np.float32), shape=None)
  x_init = tf1.placeholder(dtype=tf.float32, shape=tf.TensorShape([None, 1]))
  unnormalized_log_prob(x_init)

  def run_chain():
    kernel1 = tfp.mcmc.NoUTurnSampler(
        unnormalized_log_prob,
        step_size=0.3)
    trace = tfp.mcmc.sample_chain(num_results=7,
                                  current_state=x_init,
                                  kernel=kernel1,
                                  trace_fn=None)
    return trace

  if tf.executing_eagerly():
    trace = tf.function(run_chain, autograph=False)()
  else:
    with tf1.Session() as sess:
      trace = sess.run(run_chain(), feed_dict={x_init: np.zeros((batch_shape, shape), dtype=np.float32)})
  print(trace.shape)
@junpenglao

This comment has been minimized.

Copy link

commented Aug 27, 2019

OK, I see what happen - so the tensors lost its shape in loop_tree_doubling of NUTS, and it could be fixed by wrapping the output of the function with tf.ensure_shape. However, I am not sure this is worth as tf.placeholder will be deprecated in TF2. @goldingn How difficult would it be to update your codebase from tf.placeholder to tf.placeholder_with_default? You can use a zeros tensor and give the batch shape an arbitrary number.

@skeydan

This comment has been minimized.

Copy link

commented Aug 27, 2019

Hi @goldingn I know I'm late to the party but I'm adding NUTS here: rstudio/tfprobability#75

(((just in case this makes your testing easier)))

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

Yeah, it should be fine too use placeholder with default in future versions of greta. I'll implement that on the nuts branch soon to make sure. Thanks @junpenglao!

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

Thanks @skeydan! Looking forward to tfprobability hitting cran so we can depend on it in future releases

@skeydan

This comment has been minimized.

Copy link

commented Aug 27, 2019

We'll work on it!

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

@junpenglao this works if I set the placeholder-with-default with completely undefined shape, like:

vals <- tf$compat$v1$placeholder_with_default(
  tf$zeros(shape(1, 1), dtype = tf$float32),
  shape = NULL
)

but not if I provide a default but partially-defined shape:

tf$compat$v1$placeholder_with_default(
  tf$zeros(shape(1, 1), dtype = tf$float32),
  shape = shape(NULL, 1)
)

Same error as before. Only errors when the second dimension is 1, fine if it's anything else.

Equivalent code in your Python example would be:

tf1.placeholder_with_default(np.zeros((batch_shape, shape), dtype=np.float32), shape=tf.TensorShape([None, 1]))

Using placeholder-with-default is fine in greta, but having the tensors have completely undefined shape would definitely be a problem - there are a lot of functions that operate based on the dimensions of the input.

How difficult do you think it would be to get NUTS working with partially-known shape like the other samplers?

@junpenglao

This comment has been minimized.

Copy link

commented Aug 28, 2019

I see, yep this is a valid point and definitively need a fix - thank you for digging in! I will send out a fix shortly :-)

@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

Thanks!

@junpenglao

This comment has been minimized.

Copy link

commented Aug 28, 2019

BTW, I am curious about:

Using placeholder-with-default is fine in greta, but having the tensors have completely undefined shape would definitely be a problem - there are a lot of functions that operate based on the dimensions of the input.

I would imagine that these dimensions of the input in this case is static, as it is the event_shape of the random variable, maybe it is a better practice to pass the static value around instead of inferring the shape at run time? That's also the main idea behind https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/internal/prefer_static.py, basically you try to move as much as possible the static values outside of the graph, so that your graph is smaller (faster to compile and potentially faster to execute as well).

tensorflow-copybara pushed a commit to tensorflow/probability that referenced this issue Aug 28, 2019
Bug fix in Unrolled NUTS to make sure it does not lose shape for even…
…t_shape=1.

This bug is discovered by greta developer (see greta-dev/greta#314)

PiperOrigin-RevId: 265875343
@goldingn

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

Right, in greta the shapes are generally (always?) known before run time. The only thing that is dynamic is the first dimension of the variable, which we use to set either the number of chains, or the number of samples when computing posteriors over downstream variables. So all dimensions other than one are static in the graph. That unknown dimension is set in one placeholder, and then propagated through the graph.

But greta uses a bunch of R functions to build up a TF graph automatically for the user based on their model. These R functions are mostly to translate existing mathematical operations in R into TensorFlow. There are quite often situations in those functions (e.g. here) where the dimensions of the tensors are needed to work out how to do the operation. But then those dimensions are static in the graph.

So we use the shapes of the tensors in several places to build the graph, but that's problematic when the placeholder has unknown shape, since everything downstream of that has unknown shape and we can't use these functions that are shape-dependent.

@junpenglao

This comment has been minimized.

Copy link

commented Aug 29, 2019

I see. Yeah to have it work with none shape you probably need to make those functions take an additional shape argument, which could be quite annoying to specify everywhere. In any case, the pr is merged now so you should have no error any more.

@goldingn

This comment has been minimized.

Copy link
Member

commented Sep 4, 2019

Thanks @junpenglao! Took me a while to get to this, but the placeholder fix works just fine.

Unfortunately the new sampler is tripping of greta's more stringent sampler tests. the variances and correlation of a bivariate normal appear to be biased. I'll open an issue on the TFP repo to see if this is right, or if my code is wrong somehow.

@goldingn

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

Thanks for the fix @junpenglao! Can confirm it's working as expected with greta's posterior tests.

So nuts sampler looks like it's good to go and can be merged into master as soon as nuts is in a TF probability release.

@junpenglao

This comment has been minimized.

Copy link

commented Sep 6, 2019

WOOHOO!!! We are preparing 0.8 release, so 🔜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.