Use of 'thin' and 'start' in computePredictedValues() #86

rburner · 2021-03-03T10:41:24Z

Hi! I have a model with many sampling units (10k+), fitted using many (probably too many) samples (1000 per chain x 6 chains).

The problem is that when I use computePredictedValues() the array gets too big (50 gb+) and crashes the HPC cluster I'm using. I thought I could reduce this size by decreasing the 'depth' of the predicted values array from 6000 (n samples) to a smaller number using the 'thin' option. But, in tests with smaller models, using thin doesn't change the dimensions of the predicted values array.

The thin option does appear to affect the behavior of poolMcmcChains() within the computePredictedValues() function, but then predict() doesn't seem to notice.

Is this how the 'thin' and 'start' arguments are supposed to be used?

Of course I could refit the models with fewer samples, but would use a shortcut if available.

Thanks!!!

jarioksa · 2021-03-03T12:55:24Z

thin and start change the size of predicted results when I try:

library(Hmsc)
> TD$m # example data in the package
Hmsc object with 50 sampling units, 4 species, 3 covariates, 3 traits and 2 random levels
Posterior MCMC sampling with 2 chains each with 100 samples, thin 1 and transient 50 
> preds <- computePredictedValues(TD$m)
> dim(preds)
[1]  50   4 200
> preds <- computePredictedValues(TD$m, thin=10)
> dim(preds)
[1] 50  4 20

The number of samples in the predicted object is defined by the postList returned by poolMcmcChains and therefore thin and start are passed only there.

There may be some deeper reason for your failure than the object size. The result is a 3-D array with dimensions sampling units (10K+) times species times samples (6000). You don't say how many species you had, but with 10000 sampling units and 6000 samples you may exceed the integer maximum with 36 species:

> .Machine$integer.max/10000/6000
[1] 35.79139

I don't know about long-integer support in the underlying code, but indexing can become a problem. Probably storage space as well.

I hope you don't crash your HPC cluster: it costs a lot to buy a new one.

rburner · 2021-03-03T13:14:27Z

Hi @jarioksa thanks for all that info! That example works for me too. But, I didn't mention that I was also using a partition (for CV), and when I add that partition to the example the dimensions remain the same (50 x 4 x 200) even with thinning.

jarioksa · 2021-03-03T14:01:11Z

@rburner : this is a bug, and actually a severe bug. thin is ignored, and that is a bug, but not so bad. However, start is not ignored, and it will be influential and drop some samples, but the array is still filled by replicating predictions to fill the original matrix.

I'll have a look at this ASAP.

rburner · 2021-03-03T14:56:57Z

@jarioksa Jari, wow ok, thanks for looking into this! Let me know.

Functin computePredictedValues always returned array with original number of samples even if user defined thin or start for a smaller array. The real bug was that only the reduced number of samples was calculated, but the array was filled to the original number of samples by replicating predicted values. Discussed in issue #86 in github.

jarioksa · 2021-03-03T15:19:00Z

This should be fixed now in github with commit 51d14ec. Please try to see if this is sufficient (you still have 10k+ sampling units and multiplications is a nasty operation, and multiplying three numbers is much nastier than multiplying two numbers: 3-dim arrays can be huge even if you make one dimension shorter).

rburner · 2021-03-03T16:58:57Z

@jarioksa Great Jari, thanks so much for that! I will see if I can make it work with e.g. 500 samples! Best, Ryan

jarioksa added the bug Something isn't working label Mar 3, 2021

jarioksa mentioned this issue Mar 3, 2021

number of samples for cross-validation #85

Closed

jarioksa added a commit that referenced this issue Mar 3, 2021

thin was ignored in pooling MCMC chains w/ partitions (issue #86)

8629b3c

jarioksa added a commit that referenced this issue Mar 3, 2021

Merge branch 'thin-prediction': fix issue #86

51d14ec

rburner closed this as completed Mar 3, 2021

jarioksa mentioned this issue Aug 25, 2021

[1] "thin = 10; samples = 250" [1] "model = presence_absence" Computing chain 1 Error: cannot allocate vector of size 406.6 Mb #104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of 'thin' and 'start' in computePredictedValues() #86

Use of 'thin' and 'start' in computePredictedValues() #86

rburner commented Mar 3, 2021 •

edited

Loading

jarioksa commented Mar 3, 2021

rburner commented Mar 3, 2021

jarioksa commented Mar 3, 2021

rburner commented Mar 3, 2021

jarioksa commented Mar 3, 2021 •

edited

Loading

rburner commented Mar 3, 2021

Use of 'thin' and 'start' in computePredictedValues() #86

Use of 'thin' and 'start' in computePredictedValues() #86

Comments

rburner commented Mar 3, 2021 • edited Loading

jarioksa commented Mar 3, 2021

rburner commented Mar 3, 2021

jarioksa commented Mar 3, 2021

rburner commented Mar 3, 2021

jarioksa commented Mar 3, 2021 • edited Loading

rburner commented Mar 3, 2021

rburner commented Mar 3, 2021 •

edited

Loading

jarioksa commented Mar 3, 2021 •

edited

Loading