Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sampleMcmc() error in chol.default(W) #8

Closed
keilastark opened this issue May 30, 2019 · 11 comments
Closed

sampleMcmc() error in chol.default(W) #8

keilastark opened this issue May 30, 2019 · 11 comments
Labels
algorithm Issues with the algorithm implementation bug Something isn't working

Comments

@keilastark
Copy link

I am trying to fit a model with 4 random effects, one is spatial coordinates to be specified with the sData argument, and three are hierarchical (Quadrat, Site, and Region). The input for the spatial random effect object (sData) is a matrix with two columns (lat and long), and a row for each observation. The spatial coordinates are in conventional lat and long to four decimal places, as some points are as close as 15m apart. However each row has a distinct set of coordinates; I have tested this. The object "Pi" below describes my three hierarchical random effects.

spat <- data.frame(spat = sprintf('spatial_%.2d',1:78))
studyDesign <- cbind(Pi, spat)

rL1 = HmscRandomLevel(units = studyDesign$Quadrat)
rL2 = HmscRandomLevel(units = studyDesign$Site)
rL3 = HmscRandomLevel(units = studyDesign$Region)
rL4 = HmscRandomLevel(sData = coords[,2:3])

m <- Hmsc(Y = Y, XData = X, XFormula = ~eelgrass_lai +  dissox + salinity + ph + sstmean + nitrate + chlomean, studyDesign = studyDesign, ranLevels = list(Quadrat = rL1, Site = rL2, Region = rL3, spat = rL4))

When I use the sampleMcmc() function I get the error:

Error in chol.default(W) : the leading minor of order 2 is not positive definite

I do not see this error when I attempt to fit the exact same model but with no spatial random effect (rL4 not included). I have also tried swapping the order of the rL's when specifying my Hmsc object, but this doesn't change anything.

@gtikhonov
Copy link
Collaborator

gtikhonov commented May 30, 2019

@keilast thanks for posting this here instead of the personal email.
Indeed, swapping the order of elements in the rL argument should have no essential effect on the results. However, now since you explicitly mentioned that

The spatial coordinates are in conventional lat and long to four decimal places, as some points are as close as 15m apart.

I guess that potential source of error you encounter can be simply due to the numerical precision of the spatial covariance matrix. Thus, since you have very close locations and quite-far-away locations (guessing based on that you have Region random effect), the covariance matrix for your spatial random effect is going to be close to singular, leading to the Cholesky decomposition failure.
As a fast way to test this hypothesis, I would suggest that you try to test whether the code runs if you drop some of your spatial locations, so that none of them are closer than X meters apart (e.g. you can try with X = 0.1% of your maximum distance between locations). If that is the reason, we will give some more detailed advice on how you can handle the whole data.
One more question that would help us to pinpoint the problem - does your code throw this error before you see the "Computing chain 1" output (or any other output from the sampleMcmc function) to your console or after?

@gtikhonov gtikhonov added bug Something isn't working algorithm Issues with the algorithm implementation labels May 30, 2019
@keilastark
Copy link
Author

Yes, I suspected the same re: the precision of the spatial points. I have considered converting the coordinates to UTM, however they span two UTM zones, and I am not sure this will do anything to fix the problem.

Yes, the code throws the error instantaneously, before any "Computing chain" outputs.

Removing sites so the minimum cutoff distance is >10% of the maximum distance does not fix the problem either, yet works fine when I try to fit that reduced-site version of the model without the spatial rL.

Thank you!

@gtikhonov
Copy link
Collaborator

gtikhonov commented May 31, 2019

@keilast , so now I am almost 100% sure that your problem happens in the internal function computeDataParameters defined in computeDataParameters.R, which is called from the sampleMcmc method. Particularly the call to the chol function in line 68 of that file.

Given your description of the spatial data you input to the model fitting, there should be no problem, yet you still encounter it. Thus, I suggest that you try to execute that part of code outside of package in separate script, which would allow to explicitly check what and why goes wrong. So, just copy-paste the lines 47-75 from that file and try to run them after you have defined your model as you did earlier, but save it to hM variable:
hM <- Hmsc(Y = Y, XData = X, XFormula = ~eelgrass_lai + dissox + salinity + ph + sstmean + nitrate + chlomean, studyDesign = studyDesign, ranLevels = list(Quadrat = rL1, Site = rL2, Region = rL3, spat = rL4))

If you have the same error message, check the content of variables W, distance and alpha. Also check the ag variable and please tell its value. The W must be positive-definite matrix. If it is not and alpha is not NA, you should check the distance - this is Euclidian distance between the locations you've provided. There should be neither NA in this matrix, nor off-diagonal zero values. If there are NA - check the s variable, it has to be something wrong with the coordinates you provide (e.g. they are read as strings instead of numerical values)....

@keilastark
Copy link
Author

Hello,

So I have gone and done as you've recommended, and the same error came up.

W = square matrix of mostly NA's, except for the diagonal elements which = 1
distance = same as W
ag = 2
alpha = 0.92
s = 2 columns of NA's

Since you mentioned the coordinates may not be read properly, I saved them as a matrix (previously was a dataframe) before naming it as sData in rL4, and ran the exact same code. Now this error comes up:

Error in hM$rL[[r]]$s[levels(hM$dfPi[, r]), ] : subscript out of bounds

, but I checked hM$rL[[4]]$s give the correct spatial coordinates. I have tried transforming the coordinate data (ie. multiplying by 10 000) but that did not change the outcome.

Thanks

@gtikhonov
Copy link
Collaborator

Hello,
Thanks for the testing you've conducted!

distance = same as W

This clearly indicates that there is some issue with imputation of the coordinates. I believe that the reason is due to reading-in some non we almost figured out the issue.

I saved them as a matrix (previously was a dataframe) before naming it as sData in rL4

Are you sure that your matrix consists of numbers and not strings and its rownames are the same values as you plug to the spat <- data.frame(spat = sprintf('spatial_%.2d',1:78))
E.g. that your matrix look like this:
matrix(1:10,5,2,dimnames=list(sprintf('spatial_%.2d',1:5),NULL)),
and not like this
matrix(as.character(1:10),5,2,dimnames=list(sprintf('spatial_%.2d',1:5),NULL))
or this
matrix(1:10,5,2)

Thanks for co-operation!

@keilastark
Copy link
Author

keilastark commented Jul 8, 2019

Hello, sorry for the delayed response on this!

A colleague tinkered with my data and the function ran successfully:

spatial$longitude <- spatial$longitude-min(spatial$longitude)
studyDesign$Spatial <- factor(studyDesign$Quadrat)

#then establish random levels
rL1 = HmscRandomLevel(units = studyDesign$Quadrat)
rL2 = HmscRandomLevel(units = studyDesign$Site)
rL3 = HmscRandomLevel(units = studyDesign$Region)
rL4= HmscRandomLevel(sData = spatial)

Cheers!

@LimaRAF
Copy link

LimaRAF commented Dec 6, 2019

Dear @gtikhonov ,

I am having a similar problem trying to sampleMcmc a model 2 random effects, one that was the spatial coordinates and the other a regional categorical variable. Since I have >1000 community samples, the spatial method set to GPP, using about 117 knots. It is a hurdle model, so I am first fitting the presence/absence data and then species abundances (individuals ha-1 from 0.001 until ~2000) and relative densities (0-1 values). I log-transformed abundances and relative densities to use distr = "normal".

All my coordinates are in meters and I have jittered coordinates that were too close to each other to avoid problems related to the precision of the spatial coordinates. However, I still have distances between units that vary from 5 meters to 3e+06 meters. Anyways, not sure if this spatial precision is indeed an issue while using GPP.

Notably, the error only appeared when I increased the thin x samples combination and only for the abundance part of the model. The error appears error by the end of chain 1, sometimes of chain 2:
Error in chol.default(iU) :
the leading minor of order 19 is not positive definite

I run the tests you suggested for keilast here are the results for the computeDataParameters part of the sampleMCMC function. Here are the results:
alpha = 4257548
ag = 101
W is positive-definite and has no NAs
'distance' has no NAs but it is not positive-definite.
's' a matrix with 2 columns of numbers and no NAs, with upper off-diagonal = 8916294 and lower off-diagonal = 5766795 9000554.

In conclusion, I think the problem is probably in the small distance between some of my sites. But there may be a possible interaction between these distances and data type (problem not arises with presence/absence data, only abundances).

Any clues? You mentioned some that you had more detailed advice on how you can handle the whole data. I have spent the entire morning truing to solve this issue, so any help would be very welcome.

Thanks in advance,
Renato

@gtikhonov
Copy link
Collaborator

gtikhonov commented Dec 6, 2019

Hi @LimaRAF
Seems that your problem is quite different from the one above. And given your description of your spatial design, the numerical precision uses are very plausible (which were not the problem in the example above).

Generally, I would recommend to reconsider how you input the spatial effect to the study. However, the way you do that depends on what you want to capture with the spatial effect. One option is instead of having a spatial effect on the level of samples, you can have it on the level of "clusters of samples". and then extra no-spatial effect on the level of samples. Other option is to have a single spatial latent effect, but restrict its spatial range to somewhat small (although GPP is a poor choice for this one). The first option aims to capture long-range spatial correlation, the second - short range. Well, you can have both as well. But the key problem here is that it causes numerical instabilities if you do it in the straightforward way, so here is the idea of separating it in two levels.

@LimaRAF
Copy link

LimaRAF commented Dec 6, 2019

Dear @gtikhonov ,

Thanks for the quick reply and sorry for posting in an different issue type (do you want me to post it separatedly for the record?)

However, I am not sure I fully understood your suggestion. You are proposing to nest my samples in my regional categorical variable?

The region has a big impact on species distribution, but setting it as a fixed effects led to predictions on parts of the regions where the species actually does not occurrence or creating very sharp differences in the prediction in the transitions/contact of these regions. So the solution was to use it as an extra random effect, hoping to still capture the effect but without creating such 'weird' predictions.

Currently, I am defining my study design and random effects like this:
studyDesign = data.frame(site = as.factor(sites$SiteCode), ecoreg = as.factor(sites$ecoreg.af))
where 'SiteCode' are my samples and 'ecoreg.af' my biogeographical regions. 'sites' is my data frame

My Sdata:
xy = as.matrix(cbind(sites$x, sites$y)) # x and y are now in UTM (meters)
rownames(xy) = studyDesign[,1]

Defining the random effects and setting the spatial model:
Knots = constructKnots(xy, nKnots = 60, minKnotDist = 10000)
rL.gpp = HmscRandomLevel(sData = xy, sMethod = "GPP", sKnot = Knots)
rL.region = HmscRandomLevel(units = levels(studyDesign$ecoreg))

Are you suggesting to do:
rownames(xy) = studyDesign[,2]

and then drop rL.region from my the 'ranLevels'of my model?

Thanks again,
Renato

@gtikhonov
Copy link
Collaborator

I have opened and will reply in a separate issue.

@aeiche01
Copy link

aeiche01 commented Sep 19, 2020

For people reading this in the future: I had the same issue as keilast Error in chol.default(W) : the leading minor of order 2 is not positive definite and then ran into the same subsequent error once I put the data into a matrix: Error in hM$rL[[r]]$s[levels(hM$dfPi[, r]), ] : subscript out of bounds

I was able to fix this error by making sure that the rownames of the latlong matrix matched the spatial plots/routes/whatever the coordinates are supposed to mark in the studyDesign data frame.

My code looks like this: rownames(xy)=studyDesign$routes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
algorithm Issues with the algorithm implementation bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants