Missing values in MVN distribution #1126

tjmckinley · 2021-05-07T15:59:18Z

Hi,

Firstly, thanks so much for developing NIMBLE - it's been fantastic and we're using it for all sorts of interesting problems. However, I've run into a slight problem, which I think might be a bug.

I've been trying to fit multivariate normal models with some missing data (but only in a subset of the dimensions). In this case it seems that NIMBLE will essentially treat the entire row as missing, rather than just one element, and assigns a posterior predictive sampler accordingly. I've included a reproducible example below

## load libraries
library(nimble)
library(MASS)

## simulate some data
dataset <- mvrnorm(100, c(1, 2), diag(2))

## set some missing data
dataset[1, 1] <- NA

## set model code
code <- nimbleCode({
    for(i in 1:N) {
        y[i, ] ~ dmnorm(mu[], cov = sigma[, ])  
    }
    mu[] ~ dmnorm(mu0[], cov = sigma0[, ])  
})

## set up other components of model
consts <- list(N = nrow(dataset))
data <- list(
    y = as.matrix(dataset),
    mu0 = c(0, 0),
    sigma0 = diag(2),
    sigma = diag(2)
)

## sample initial values
initFn <- function(mu0, sigma0, y) {
    mu <- mvrnorm(1, mu0, sigma0)
    y1 <- y
    y1[!is.na(y)] <- NA
    y1[is.na(y)] <- 0
    list(mu = mu, y = y1)
}

## define the model, data, inits and constants
model <- nimbleModel(
    code = code, 
    constants = consts, 
    data = data, 
    inits = initFn(data$mu0, data$sigma0, data$y))

## compile the model
cmodel <- compileNimble(model)

## set monitors
config <- configureMCMC(cmodel, monitors = c("mu", "y"), thin = 1)

## check monitors and samplers
config$printMonitors()
config$printSamplers()

## build the model
built <- buildMCMC(config)
cbuilt <- compileNimble(built)

## run the model
cbuilt$run(niter = 10)

Here I only expect y[1, 1] to be updated and not y[1, 2]. I can overcome this (I think) by fitting the MVN as a product of the conditionals in an appropriate way, but I thought I'd flag this.

Many thanks,

TJ

The text was updated successfully, but these errors were encountered:

danielturek · 2021-05-07T16:33:04Z

Briefly - this is not a bug. NIMBLE's handling of NA's in multivariate nodes is only written to sample the entirely multivariate node (consisting of entirely NA's). There is no functionality to sample part of (the unobserved part) of a multivariate node, conditional on the observed parts. Agreed, for the MVN distribution, these conditionals are simply and well-defined. However, more generally for other multivariate distributions, the conditional distribution of the unobserved parts of the random variate conditional on the observed parts may not have simple closed forms. My suggestion for what you're attempting is to write the conditional MVN distribution (conditional on the observed values) for the unobserved values of the MVN variate. With this correct prior (which is the conditional distribution, given the observed values), NIMBLE will correctly update the (entirely unobserved) MVN node. Does that make sense?

…

On Fri, May 7, 2021 at 11:59 AM TJ McKinley ***@***.***> wrote: Hi, Firstly, thanks so much for developing NIMBLE - it's been fantastic and we're using it for all sorts of interesting problems. However, I've run into a slight problem, which I think might be a bug. I've been trying to fit multivariate normal models with some missing data (but only in a subset of the dimensions). In this case it seems that NIMBLE will essentially treat the entire row as missing, rather than just one element, and assigns a posterior predictive sampler accordingly. I've included a reproducible example below ## load libraries library(nimble) library(MASS) ## simulate some data dataset <- mvrnorm(100, c(1, 2), diag(2)) ## set some missing data dataset[1, 1] <- NA ## set model code code <- nimbleCode({ for(i in 1:N) { y[i, ] ~ dmnorm(mu[], cov = sigma[, ]) } mu[] ~ dmnorm(mu0[], cov = sigma0[, ]) }) ## set up other components of model consts <- list(N = nrow(dataset)) data <- list( y = as.matrix(dataset), mu0 = c(0, 0), sigma0 = diag(2), sigma = diag(2) ) ## sample initial values initFn <- function(mu0, sigma0, y) { mu <- mvrnorm(1, mu0, sigma0) y1 <- y y1[!is.na(y)] <- NA y1[is.na(y)] <- 0 list(mu = mu, y = y1) } ## define the model, data, inits and constants model <- nimbleModel( code = code, constants = consts, data = data, inits = initFn(data$mu0, data$sigma0, data$y)) ## compile the model cmodel <- compileNimble(model) ## set monitors config <- configureMCMC(cmodel, monitors = c("mu", "y"), thin = 1) ## check monitors and samplers config$printMonitors() config$printSamplers() ## build the model built <- buildMCMC(config) cbuilt <- compileNimble(built) ## run the model cbuilt$run(niter = 10) Here I only expect y[1, 1] to be updated and not y[1, 2]. I can overcome this (I think) by fitting the MVN as a product of the conditionals in an appropriate way, but I thought I'd flag this. Many thanks, TJ — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1126>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCNNYMKSKELV3XIPKXFOPDTMQE6VANCNFSM44KNYYRA> .

tjmckinley · 2021-05-07T16:42:19Z

Thanks Daniel.

That makes sense, and your approach to model the conditionals aligns with what we were thinking of doing to overcome this in this case, so that's great!

However, there are a couple of points that may (or may not) be worth considering.

Firstly, NIMBLE doesn't throw an error or warning that it treats all dimensions as missing, so an unwary user might not realise what is happening and might think the updates are only being done in the missing dimensions. (Though of course they should check!)

Secondly, the updates for a more general MV distribution could be done using e.g. random-walk samplers or suchlike, even if the full conditionals are not known. Would this be feasible and/or useful?

Cheers,

TJ

danielturek · 2021-05-07T19:29:07Z

Useful, for certain, but my concern is that some of the conditional distributions (up to a constant of proportionality) may be difficult to derive analytical expressions for, in the non-normal case, which are necessary for general MCMC sampling algorithms (including RW Metropolis-Hastings). But this is something worth considering, as well as the warning message you suggested. Thank you.

tjmckinley · 2021-05-07T20:03:35Z

No worries at all. Thanks for your reply and thanks once again for developing NIMBLE.

tjmckinley · 2021-05-12T09:38:24Z

So I've been thinking about this and I'm not quite following why analytical forms for the conditionals would be required for a RW M-H algorithm, since you could just use the ratio of the joint densities. For example, if I had a generic multivariate variable $y = (y_1,\dots,y_K)$ say, with joint density $\pi(y_1,\dots,y_K)$ , then if I did a RW Metropolis proposal for $y_1$ , conditional on the other dimensions / parameters being fixed, then the MH acceptance probability is:

$\alpha = \min\left(1, \frac{\pi(y_1^', y_2, y_3)}{\pi(y_1, y_2, y_3)}\right)$

where $y_1'$ is the proposed value. Asymmetric or MV RW proposals could be implemented in the usual way, in which case all that you would need is the joint MV density and some index vector telling you which dimensions to update accordingly.

Just a thought in case it might be a useful feature. Apologies if I'm missing something obvious.

tjmckinley · 2021-05-13T16:34:41Z

In case it's of interest, I've attached some code below which seems to work without requiring the conditionals. I've used the standard RW and RW_block sampler code, but amended to work with MV nodes where only subsets are updated (I did remove the reflective sampling bit of the standard RW when I was testing out, sorry. Otherwise they should be very close to the originals and hence allow for adaptation etc.). The dimensions to update are passed in as control variables, hence the setup loop.

The reprex4.R file contains the run-time code, but the other two files need to be in the working directory and contain the custom samplers.

reprex4.zip

Man, I love NIMBLE! Many thanks.

danielturek · 2021-05-17T22:28:31Z

TJ, my apologies for the delay in responding. I took a look over your code, and it's a nice approach to sampling the missing dimensions of multivariate nodes. Thanks for making these modifications, and sharing them also. I'm planning to speak with development team, and discuss possible paths forward for incorporating this - or something similar - into the package. It won't be a quick addition, since we'll carefully consider how it fits into the over MCMC and nimble package structure. But this is a great starting point, and we appreciate. Thanks again. I'll keep you posted as it develops. Daniel

…

On Thu, May 13, 2021 at 12:35 PM TJ McKinley ***@***.***> wrote: In case it's of interest, I've attached some code below which seems to work without requiring the conditionals. I've used the standard RW and RW_block sampler code, but amended to work with MV nodes where only subsets are updated (I did remove the reflective sampling bit of the standard RW when I was testing out, sorry. Otherwise they should be very close to the originals and hence allow for adaptation etc.). The dimensions to update are passed in as control variables, hence the setup loop. The reprex4.R file contains the run-time code, but the other two files need to be in the working directory and contain the custom samplers. reprex4.zip <https://github.com/nimble-dev/nimble/files/6473796/reprex4.zip> Man, I love NIMBLE! Many thanks. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1126 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCNNYLGKFME65JHR6HVUSTTNP5TNANCNFSM44KNYYRA> .

tjmckinley · 2021-05-18T08:00:50Z

Hi Daniel,

No worries at all and thanks for the response. Of course, there's no pressure to adopt any of these ideas, but if they're useful to build upon then that's great.

Much appreciated,

TJ

paciorek · 2021-08-20T16:41:13Z

I'm finally looking back into the question of warning users about multivariate nodes that have inconsistent data flagging amongst the scalar elements of the node. We're treating this very poorly, as isData just returns the T/F for the first element of the node. We have a comment about this in isDataFromGraphID but no warning to user. What this means is that if we have, say,

y[1:2] ~ dmnorm(mu[1:2], pr[1:2, 1:2])

and set y=c(NA,3) vs. y=c(3,NA), we get different behavior as to whether the multivariate node is considered data (no in the first case and yes in the second case).

My suggestion here is to modify isDataFromGraphID() to:

Warn users if a multivariate node has mixed T/F in the model's isDataEnv, which is the canonical source of info about whether something is data.
We set the result of isDataFromGraphID to always either be TRUE or FALSE in mixed T/F cases.

So we'd have to decide whether we want to always set it TRUE or always FALSE.

If we always set it to be TRUE, that will hit the user in the eyes, because the likelihood will always be NA. I don't see a way for them to then run an MCMC because of the various processing steps that check the data flag, though there might be a hack.

If we always set to FALSE, a user's MCMC will run, but it will initialize the data values that the user will have expected to stay as the values they originally gave. However, a user could set up a specialized sampler to handle things, and in that sampler they could avoid changing the actual data values. I think they'd need to set initial values for all the NA values so that our initialization doesn't overwrite the entire node. Haven't looked carefully at this. Perhaps what we might want to do is modify our initialization to not initialize such mixed nodes so that we are never opaquely overwriting elements a user expects not to be overwritten. However simulate would still overwrite, so we wouldn't be handling everything.

Given all this, I wonder if always setting to TRUE is best as it forces a user to confront the situation. Then if a user really wants to handle this case, they could instead have data that are not provided as data and do their own careful user-defined MCMC sampling.

@perrydv @danielturek, I'd like your input here, but it may be best to discuss at our next meeting.

danielturek · 2021-09-07T19:09:37Z

I think this is worthy of group discussion at some point. I agree, it needs attention.

paciorek · 2021-09-18T18:32:05Z

PR #1165 is fixing the behavior of isData.

As far as potential new functionality for such mixed cases such as handling in MCMC, that is of interest but not something the core development team will have time for in the foreseeable future.

paciorek mentioned this issue Sep 18, 2021

treat mixed data/non-data mv nodes as data #1165

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing values in MVN distribution #1126

Missing values in MVN distribution #1126

tjmckinley commented May 7, 2021

danielturek commented May 7, 2021 via email

tjmckinley commented May 7, 2021

danielturek commented May 7, 2021

tjmckinley commented May 7, 2021

tjmckinley commented May 12, 2021

tjmckinley commented May 13, 2021

danielturek commented May 17, 2021 via email

tjmckinley commented May 18, 2021

paciorek commented Aug 20, 2021

danielturek commented Sep 7, 2021

paciorek commented Sep 18, 2021

Missing values in MVN distribution #1126

Missing values in MVN distribution #1126

Comments

tjmckinley commented May 7, 2021

danielturek commented May 7, 2021 via email

tjmckinley commented May 7, 2021

danielturek commented May 7, 2021

tjmckinley commented May 7, 2021

tjmckinley commented May 12, 2021

tjmckinley commented May 13, 2021

danielturek commented May 17, 2021 via email

tjmckinley commented May 18, 2021

paciorek commented Aug 20, 2021

danielturek commented Sep 7, 2021

paciorek commented Sep 18, 2021