Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing values in MVN distribution #1126

Open
tjmckinley opened this issue May 7, 2021 · 11 comments
Open

Missing values in MVN distribution #1126

tjmckinley opened this issue May 7, 2021 · 11 comments

Comments

@tjmckinley
Copy link

Hi,

Firstly, thanks so much for developing NIMBLE - it's been fantastic and we're using it for all sorts of interesting problems. However, I've run into a slight problem, which I think might be a bug.

I've been trying to fit multivariate normal models with some missing data (but only in a subset of the dimensions). In this case it seems that NIMBLE will essentially treat the entire row as missing, rather than just one element, and assigns a posterior predictive sampler accordingly. I've included a reproducible example below

## load libraries
library(nimble)
library(MASS)

## simulate some data
dataset <- mvrnorm(100, c(1, 2), diag(2))

## set some missing data
dataset[1, 1] <- NA

## set model code
code <- nimbleCode({
    for(i in 1:N) {
        y[i, ] ~ dmnorm(mu[], cov = sigma[, ])  
    }
    mu[] ~ dmnorm(mu0[], cov = sigma0[, ])  
})

## set up other components of model
consts <- list(N = nrow(dataset))
data <- list(
    y = as.matrix(dataset),
    mu0 = c(0, 0),
    sigma0 = diag(2),
    sigma = diag(2)
)

## sample initial values
initFn <- function(mu0, sigma0, y) {
    mu <- mvrnorm(1, mu0, sigma0)
    y1 <- y
    y1[!is.na(y)] <- NA
    y1[is.na(y)] <- 0
    list(mu = mu, y = y1)
}

## define the model, data, inits and constants
model <- nimbleModel(
    code = code, 
    constants = consts, 
    data = data, 
    inits = initFn(data$mu0, data$sigma0, data$y))

## compile the model
cmodel <- compileNimble(model)

## set monitors
config <- configureMCMC(cmodel, monitors = c("mu", "y"), thin = 1)

## check monitors and samplers
config$printMonitors()
config$printSamplers()

## build the model
built <- buildMCMC(config)
cbuilt <- compileNimble(built)

## run the model
cbuilt$run(niter = 10)

Here I only expect y[1, 1] to be updated and not y[1, 2]. I can overcome this (I think) by fitting the MVN as a product of the conditionals in an appropriate way, but I thought I'd flag this.

Many thanks,

TJ

@danielturek
Copy link
Member

danielturek commented May 7, 2021 via email

@tjmckinley
Copy link
Author

Thanks Daniel.

That makes sense, and your approach to model the conditionals aligns with what we were thinking of doing to overcome this in this case, so that's great!

However, there are a couple of points that may (or may not) be worth considering.

Firstly, NIMBLE doesn't throw an error or warning that it treats all dimensions as missing, so an unwary user might not realise what is happening and might think the updates are only being done in the missing dimensions. (Though of course they should check!)

Secondly, the updates for a more general MV distribution could be done using e.g. random-walk samplers or suchlike, even if the full conditionals are not known. Would this be feasible and/or useful?

Cheers,

TJ

@danielturek
Copy link
Member

Useful, for certain, but my concern is that some of the conditional distributions (up to a constant of proportionality) may be difficult to derive analytical expressions for, in the non-normal case, which are necessary for general MCMC sampling algorithms (including RW Metropolis-Hastings). But this is something worth considering, as well as the warning message you suggested. Thank you.

@tjmckinley
Copy link
Author

No worries at all. Thanks for your reply and thanks once again for developing NIMBLE.

@tjmckinley
Copy link
Author

So I've been thinking about this and I'm not quite following why analytical forms for the conditionals would be required for a RW M-H algorithm, since you could just use the ratio of the joint densities. For example, if I had a generic multivariate variable say, with joint density , then if I did a RW Metropolis proposal for , conditional on the other dimensions / parameters being fixed, then the MH acceptance probability is:

where is the proposed value. Asymmetric or MV RW proposals could be implemented in the usual way, in which case all that you would need is the joint MV density and some index vector telling you which dimensions to update accordingly.

Just a thought in case it might be a useful feature. Apologies if I'm missing something obvious.

@tjmckinley
Copy link
Author

In case it's of interest, I've attached some code below which seems to work without requiring the conditionals. I've used the standard RW and RW_block sampler code, but amended to work with MV nodes where only subsets are updated (I did remove the reflective sampling bit of the standard RW when I was testing out, sorry. Otherwise they should be very close to the originals and hence allow for adaptation etc.). The dimensions to update are passed in as control variables, hence the setup loop.

The reprex4.R file contains the run-time code, but the other two files need to be in the working directory and contain the custom samplers.

reprex4.zip

Man, I love NIMBLE! Many thanks.

@danielturek
Copy link
Member

danielturek commented May 17, 2021 via email

@tjmckinley
Copy link
Author

Hi Daniel,

No worries at all and thanks for the response. Of course, there's no pressure to adopt any of these ideas, but if they're useful to build upon then that's great.

Much appreciated,

TJ

@paciorek
Copy link
Contributor

I'm finally looking back into the question of warning users about multivariate nodes that have inconsistent data flagging amongst the scalar elements of the node. We're treating this very poorly, as isData just returns the T/F for the first element of the node. We have a comment about this in isDataFromGraphID but no warning to user. What this means is that if we have, say,

y[1:2] ~ dmnorm(mu[1:2], pr[1:2, 1:2])

and set y=c(NA,3) vs. y=c(3,NA), we get different behavior as to whether the multivariate node is considered data (no in the first case and yes in the second case).

My suggestion here is to modify isDataFromGraphID() to:

  1. Warn users if a multivariate node has mixed T/F in the model's isDataEnv, which is the canonical source of info about whether something is data.
  2. We set the result of isDataFromGraphID to always either be TRUE or FALSE in mixed T/F cases.

So we'd have to decide whether we want to always set it TRUE or always FALSE.

If we always set it to be TRUE, that will hit the user in the eyes, because the likelihood will always be NA. I don't see a way for them to then run an MCMC because of the various processing steps that check the data flag, though there might be a hack.

If we always set to FALSE, a user's MCMC will run, but it will initialize the data values that the user will have expected to stay as the values they originally gave. However, a user could set up a specialized sampler to handle things, and in that sampler they could avoid changing the actual data values. I think they'd need to set initial values for all the NA values so that our initialization doesn't overwrite the entire node. Haven't looked carefully at this. Perhaps what we might want to do is modify our initialization to not initialize such mixed nodes so that we are never opaquely overwriting elements a user expects not to be overwritten. However simulate would still overwrite, so we wouldn't be handling everything.

Given all this, I wonder if always setting to TRUE is best as it forces a user to confront the situation. Then if a user really wants to handle this case, they could instead have data that are not provided as data and do their own careful user-defined MCMC sampling.

@perrydv @danielturek, I'd like your input here, but it may be best to discuss at our next meeting.

@danielturek
Copy link
Member

I think this is worthy of group discussion at some point. I agree, it needs attention.

@paciorek
Copy link
Contributor

PR #1165 is fixing the behavior of isData.

As far as potential new functionality for such mixed cases such as handling in MCMC, that is of interest but not something the core development team will have time for in the foreseeable future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants