Experiences with the book, round 2 #15

LTLA · 2020-08-15T17:48:28Z

Similar to #7, but on a different dataset:

library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
qcdata <- bfcrpath(bfc, "https://github.com/LuyiTian/CellBench_data/blob/master/data/mRNAmix_qc.RData?raw=true")

env <- new.env()
load(qcdata, envir=env)
sce.8qc <- env$sce8_qc
sce.8qc$mix <- factor(sce.8qc$mix)

# Library size normalization and log-transformation.
library(scuttle)
library(scran)
sce.8qc <- logNormCounts(sce.8qc)
dec.8qc <- modelGeneVar(sce.8qc)
hvgs.8qc <- getTopHVGs(dec.8qc, n=1000)

library(scry)
sce.8qc2 <- GLMPCA(sce.8qc[hvgs.8qc,], fam="nb",
    L=20, sz=sizeFactors(sce.8qc))
## Error: Poor model fit (final deviance higher than initial deviance). Try modifying control parameters to improve optimization.

I guess I could fiddle with the control parameters to sneak it through, but that is going to be beyond the ability of most users, and this dataset is pretty tame. Seems like some more care could be taken with the step sizes to guarantee convergence, or at least reduction of the deviance.

Session information

R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] scry_1.1.0                  scran_1.17.15              
 [3] scuttle_0.99.12             SingleCellExperiment_1.11.6
 [5] SummarizedExperiment_1.19.6 DelayedArray_0.15.7        
 [7] matrixStats_0.56.0          Matrix_1.2-18              
 [9] Biobase_2.49.0              GenomicRanges_1.41.6       
[11] GenomeInfoDb_1.25.10        IRanges_2.23.10            
[13] S4Vectors_0.27.12           BiocGenerics_0.35.4        
[15] BiocFileCache_1.13.1        dbplyr_1.4.4               

loaded via a namespace (and not attached):
 [1] statmod_1.4.34            tidyselect_1.1.0         
 [3] locfit_1.5-9.4            purrr_0.3.4              
 [5] BiocSingular_1.5.0        lattice_0.20-41          
 [7] vctrs_0.3.2               generics_0.0.2           
 [9] blob_1.2.1                rlang_0.4.7              
[11] pillar_1.4.6              glue_1.4.1               
[13] DBI_1.1.0                 BiocParallel_1.23.2      
[15] rappdirs_0.3.1            bit64_4.0.2              
[17] dqrng_0.2.1               GenomeInfoDbData_1.2.3   
[19] lifecycle_0.2.0           zlibbioc_1.35.0          
[21] rsvd_1.0.3                memoise_1.1.0            
[23] irlba_2.3.3               curl_4.3                 
[25] BiocNeighbors_1.7.0       Rcpp_1.0.5               
[27] edgeR_3.31.4              limma_3.45.10            
[29] XVector_0.29.3            bit_4.0.4                
[31] digest_0.6.25             bluster_0.99.1           
[33] dplyr_1.0.1               grid_4.0.0               
[35] tools_4.0.0               bitops_1.0-6             
[37] magrittr_1.5              RCurl_1.98-1.2           
[39] tibble_3.0.3              RSQLite_2.2.0            
[41] crayon_1.3.4              pkgconfig_2.0.3          
[43] MASS_7.3-51.6             ellipsis_0.3.1           
[45] DelayedMatrixStats_1.11.1 glmpca_0.2.0             
[47] assertthat_0.2.1          httr_1.4.2               
[49] R6_2.4.1                  igraph_1.2.5             
[51] compiler_4.0.0

The text was updated successfully, but these errors were encountered:

willtownes · 2020-08-17T17:42:23Z

Thanks for another interesting data experiment. I'm not able to reproduce this. Could you please provide the random seed. It may be that it was just an unlucky initialization. I acknowledge the current random initialization is not ideal, but I haven't been able to get any other initialization tried so far to produce better average case performance in terms of convergence time, numerical stability etc. It may also be helpful to look at the deviance trace plot as a diagnostic for failed convergence (plot(fit$dev,type="l")) where fit is the glmpca object in the metadata of the SingleCellExperiment. By the way, when I ran this with L=2 it took 22 sec on my laptop (it took 24 sec with L=20) and this is what the factors look like:

LTLA · 2020-08-21T06:50:50Z

Interesting. I am consistently getting that error above. If it helps, I'll set the seed, let's say to 50:

library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
qcdata <- bfcrpath(bfc, "https://github.com/LuyiTian/CellBench_data/blob/master/data/mRNAmix_qc.RData?raw=true")

env <- new.env()
load(qcdata, envir=env)
sce.8qc <- env$sce8_qc
sce.8qc$mix <- factor(sce.8qc$mix)

# Library size normalization and log-transformation.
library(scuttle)
library(scran)
set.seed(50)

sce.8qc <- logNormCounts(sce.8qc)
dec.8qc <- modelGeneVar(sce.8qc)
hvgs.8qc <- getTopHVGs(dec.8qc, n=1000)

library(scry)
sce.8qc2 <- GLMPCA(sce.8qc[hvgs.8qc,], fam="nb",
    L=20, sz=sizeFactors(sce.8qc), verbose=TRUE)
## Control parameter 'penalty' should be provided as an element of 'ctl' rather than a separate argument.
## Control parameter 'verbose' should be provided as an element of 'ctl' rather than a separate argument.
## Control parameter 'eps' is deprecated. Coercing to equivalent 'tol'. Please use 'tol' in the future as 'eps' will eventually be removed
## Error: Poor model fit (final deviance higher than initial deviance). Try modifying control parameters to improve optimization.

With verbose=TRUE, it rambles on for a while:

Trying AvaGrad with learning rate: 5.12e-08
Iteration: 1 | deviance=2.01e+05 | nb_theta: 1
Iteration: 2 | deviance=2.486e+05 | nb_theta: 1.54
Iteration: 3 | deviance=2.805e+05 | nb_theta: 2
Iteration: 4 | deviance=3.018e+05 | nb_theta: 2.37
Iteration: 5 | deviance=3.163e+05 | nb_theta: 2.65
Iteration: 6 | deviance=3.262e+05 | nb_theta: 2.85
Iteration: 7 | deviance=3.33e+05 | nb_theta: 3.01
Iteration: 8 | deviance=3.378e+05 | nb_theta: 3.12
Iteration: 9 | deviance=3.411e+05 | nb_theta: 3.2
Iteration: 10 | deviance=3.435e+05 | nb_theta: 3.25
Iteration: 11 | deviance=3.451e+05 | nb_theta: 3.29
Iteration: 12 | deviance=3.463e+05 | nb_theta: 3.32
Iteration: 13 | deviance=3.471e+05 | nb_theta: 3.34
Iteration: 14 | deviance=3.477e+05 | nb_theta: 3.36
Iteration: 15 | deviance=3.481e+05 | nb_theta: 3.37
Iteration: 16 | deviance=3.484e+05 | nb_theta: 3.38
Iteration: 17 | deviance=3.486e+05 | nb_theta: 3.38
Iteration: 18 | deviance=3.487e+05 | nb_theta: 3.38
Iteration: 19 | deviance=3.488e+05 | nb_theta: 3.39
Iteration: 20 | deviance=3.489e+05 | nb_theta: 3.39
Iteration: 21 | deviance=3.49e+05 | nb_theta: 3.39
Iteration: 22 | deviance=3.49e+05 | nb_theta: 3.39
Iteration: 23 | deviance=3.49e+05 | nb_theta: 3.39
Iteration: 24 | deviance=3.49e+05 | nb_theta: 3.39
Iteration: 25 | deviance=3.491e+05 | nb_theta: 3.39
Iteration: 26 | deviance=3.491e+05 | nb_theta: 3.39
Iteration: 27 | deviance=3.491e+05 | nb_theta: 3.39
Iteration: 28 | deviance=3.491e+05 | nb_theta: 3.39
Iteration: 29 | deviance=3.491e+05 | nb_theta: 3.39
Iteration: 30 | deviance=3.491e+05 | nb_theta: 3.39
Iteration: 31 | deviance=3.491e+05 | nb_theta: 3.39
Error: Poor model fit (final deviance higher than initial deviance). Try modifying control parameters to improve optimization.

It's also kind of bemusing that the deviance is steadily increasing... I would have expected it to decrease.

willtownes · 2020-08-29T02:46:27Z

I haven't forgotten about this, sorry for the delay. One clarification- do you install all the packages from bioc development or from their respective github repositories?

LTLA · 2020-08-29T05:01:59Z

BioC-devel. Of course, I could use the GitHub repos locally, but that just kicks the can down the road as the OSCA book will only build using packages from the official BioC-devel channel.

willtownes · 2020-09-10T18:46:34Z

OK I finally was able to reproduce this bug. The bug occurs for a variety of random seeds, what matters is that scry must be installed from Bioconductor development branch. When I installed scry from github, the bug does not occur for any random seed. Also, it doesn't occur if I run if I call glmpca directly from the cran package without the scry wrapper. For example, replace the last two lines in Aaron's original code with

library(glmpca) # from CRAN
m<-assay(sce.8qc[hvgs.8qc, ], "counts")
fit<-glmpca(m, L=20, fam="nb", sz=sizeFactors(sce.8qc), verbose=TRUE)

What seems to be happening is the convergence happens at about 3.5e5 deviance for all initializations, but the Bioc development version weirdly produces an initialization having a lower deviance (2.0e5). Both the scry github version and the glmpca CRAN version initialize with deviance about 7.3e5. Another wrinkle is with scry github and glmpca CRAN, the nb_theta is initialized to 100, whereas with the bioc devel version it initializes to 1.

willtownes · 2020-09-10T18:48:15Z

By the way it is possible for the deviance to increase slightly sometimes at the end of the optimization due to the momentum aspect of the optimizer. Also, the deviance is only defined conditional on a particular nb_theta value so it's possible to have a lower deviance but a worse set of latent factors. This further underscores the need to improve the estimation of nb_theta in glmpca.

willtownes · 2020-09-10T18:54:44Z

OK I have now reproduced this bug in all implementations by setting nb_theta=1. @LTLA as a quick fix please try re-running your example but set nb_theta=100. It's still a bug because it shouldn't be defaulting to such a low nb_theta initialization.

willtownes · 2020-09-10T19:07:01Z

confirmed by checking source code that bioc-devel branch has an old implementation (bioc-release is actually further along). This old default has nb_theta=1. The new version has nb_theta=100. Once we update the bioc-devel the problem should go away.

LTLA · 2020-09-11T04:23:28Z

Yes, this fixes the problem. Finishes faster too.

Bit surprised that BioC-release is further along than BioC-devel. Probably preaching to the choir here, but I'd be pretty scared to push stuff to BioC-release without it having run the BBS gauntlet in BioC-devel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiences with the book, round 2 #15

Experiences with the book, round 2 #15

LTLA commented Aug 15, 2020

willtownes commented Aug 17, 2020

LTLA commented Aug 21, 2020

willtownes commented Aug 29, 2020

LTLA commented Aug 29, 2020

willtownes commented Sep 10, 2020

willtownes commented Sep 10, 2020

willtownes commented Sep 10, 2020

willtownes commented Sep 10, 2020

LTLA commented Sep 11, 2020

Experiences with the book, round 2 #15

Experiences with the book, round 2 #15

Comments

LTLA commented Aug 15, 2020

willtownes commented Aug 17, 2020

LTLA commented Aug 21, 2020

willtownes commented Aug 29, 2020

LTLA commented Aug 29, 2020

willtownes commented Sep 10, 2020

willtownes commented Sep 10, 2020

willtownes commented Sep 10, 2020

willtownes commented Sep 10, 2020

LTLA commented Sep 11, 2020