Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiences with the book, round 2 #15

Open
LTLA opened this issue Aug 15, 2020 · 9 comments
Open

Experiences with the book, round 2 #15

LTLA opened this issue Aug 15, 2020 · 9 comments

Comments

@LTLA
Copy link

LTLA commented Aug 15, 2020

Similar to #7, but on a different dataset:

library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
qcdata <- bfcrpath(bfc, "https://github.com/LuyiTian/CellBench_data/blob/master/data/mRNAmix_qc.RData?raw=true")

env <- new.env()
load(qcdata, envir=env)
sce.8qc <- env$sce8_qc
sce.8qc$mix <- factor(sce.8qc$mix)

# Library size normalization and log-transformation.
library(scuttle)
library(scran)
sce.8qc <- logNormCounts(sce.8qc)
dec.8qc <- modelGeneVar(sce.8qc)
hvgs.8qc <- getTopHVGs(dec.8qc, n=1000)

library(scry)
sce.8qc2 <- GLMPCA(sce.8qc[hvgs.8qc,], fam="nb",
    L=20, sz=sizeFactors(sce.8qc))
## Error: Poor model fit (final deviance higher than initial deviance). Try modifying control parameters to improve optimization.

I guess I could fiddle with the control parameters to sneak it through, but that is going to be beyond the ability of most users, and this dataset is pretty tame. Seems like some more care could be taken with the step sizes to guarantee convergence, or at least reduction of the deviance.

Session information
R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] scry_1.1.0                  scran_1.17.15              
 [3] scuttle_0.99.12             SingleCellExperiment_1.11.6
 [5] SummarizedExperiment_1.19.6 DelayedArray_0.15.7        
 [7] matrixStats_0.56.0          Matrix_1.2-18              
 [9] Biobase_2.49.0              GenomicRanges_1.41.6       
[11] GenomeInfoDb_1.25.10        IRanges_2.23.10            
[13] S4Vectors_0.27.12           BiocGenerics_0.35.4        
[15] BiocFileCache_1.13.1        dbplyr_1.4.4               

loaded via a namespace (and not attached):
 [1] statmod_1.4.34            tidyselect_1.1.0         
 [3] locfit_1.5-9.4            purrr_0.3.4              
 [5] BiocSingular_1.5.0        lattice_0.20-41          
 [7] vctrs_0.3.2               generics_0.0.2           
 [9] blob_1.2.1                rlang_0.4.7              
[11] pillar_1.4.6              glue_1.4.1               
[13] DBI_1.1.0                 BiocParallel_1.23.2      
[15] rappdirs_0.3.1            bit64_4.0.2              
[17] dqrng_0.2.1               GenomeInfoDbData_1.2.3   
[19] lifecycle_0.2.0           zlibbioc_1.35.0          
[21] rsvd_1.0.3                memoise_1.1.0            
[23] irlba_2.3.3               curl_4.3                 
[25] BiocNeighbors_1.7.0       Rcpp_1.0.5               
[27] edgeR_3.31.4              limma_3.45.10            
[29] XVector_0.29.3            bit_4.0.4                
[31] digest_0.6.25             bluster_0.99.1           
[33] dplyr_1.0.1               grid_4.0.0               
[35] tools_4.0.0               bitops_1.0-6             
[37] magrittr_1.5              RCurl_1.98-1.2           
[39] tibble_3.0.3              RSQLite_2.2.0            
[41] crayon_1.3.4              pkgconfig_2.0.3          
[43] MASS_7.3-51.6             ellipsis_0.3.1           
[45] DelayedMatrixStats_1.11.1 glmpca_0.2.0             
[47] assertthat_0.2.1          httr_1.4.2               
[49] R6_2.4.1                  igraph_1.2.5             
[51] compiler_4.0.0           
@willtownes
Copy link
Collaborator

Thanks for another interesting data experiment. I'm not able to reproduce this. Could you please provide the random seed. It may be that it was just an unlucky initialization. I acknowledge the current random initialization is not ideal, but I haven't been able to get any other initialization tried so far to produce better average case performance in terms of convergence time, numerical stability etc. It may also be helpful to look at the deviance trace plot as a diagnostic for failed convergence (plot(fit$dev,type="l")) where fit is the glmpca object in the metadata of the SingleCellExperiment. By the way, when I ran this with L=2 it took 22 sec on my laptop (it took 24 sec with L=20) and this is what the factors look like:

image

@LTLA
Copy link
Author

LTLA commented Aug 21, 2020

Interesting. I am consistently getting that error above. If it helps, I'll set the seed, let's say to 50:

library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
qcdata <- bfcrpath(bfc, "https://github.com/LuyiTian/CellBench_data/blob/master/data/mRNAmix_qc.RData?raw=true")

env <- new.env()
load(qcdata, envir=env)
sce.8qc <- env$sce8_qc
sce.8qc$mix <- factor(sce.8qc$mix)

# Library size normalization and log-transformation.
library(scuttle)
library(scran)
set.seed(50)

sce.8qc <- logNormCounts(sce.8qc)
dec.8qc <- modelGeneVar(sce.8qc)
hvgs.8qc <- getTopHVGs(dec.8qc, n=1000)

library(scry)
sce.8qc2 <- GLMPCA(sce.8qc[hvgs.8qc,], fam="nb",
    L=20, sz=sizeFactors(sce.8qc), verbose=TRUE)
## Control parameter 'penalty' should be provided as an element of 'ctl' rather than a separate argument.
## Control parameter 'verbose' should be provided as an element of 'ctl' rather than a separate argument.
## Control parameter 'eps' is deprecated. Coercing to equivalent 'tol'. Please use 'tol' in the future as 'eps' will eventually be removed
## Error: Poor model fit (final deviance higher than initial deviance). Try modifying control parameters to improve optimization.

With verbose=TRUE, it rambles on for a while:

Trying AvaGrad with learning rate: 5.12e-08
Iteration: 1 | deviance=2.01e+05 | nb_theta: 1
Iteration: 2 | deviance=2.486e+05 | nb_theta: 1.54
Iteration: 3 | deviance=2.805e+05 | nb_theta: 2
Iteration: 4 | deviance=3.018e+05 | nb_theta: 2.37
Iteration: 5 | deviance=3.163e+05 | nb_theta: 2.65
Iteration: 6 | deviance=3.262e+05 | nb_theta: 2.85
Iteration: 7 | deviance=3.33e+05 | nb_theta: 3.01
Iteration: 8 | deviance=3.378e+05 | nb_theta: 3.12
Iteration: 9 | deviance=3.411e+05 | nb_theta: 3.2
Iteration: 10 | deviance=3.435e+05 | nb_theta: 3.25
Iteration: 11 | deviance=3.451e+05 | nb_theta: 3.29
Iteration: 12 | deviance=3.463e+05 | nb_theta: 3.32
Iteration: 13 | deviance=3.471e+05 | nb_theta: 3.34
Iteration: 14 | deviance=3.477e+05 | nb_theta: 3.36
Iteration: 15 | deviance=3.481e+05 | nb_theta: 3.37
Iteration: 16 | deviance=3.484e+05 | nb_theta: 3.38
Iteration: 17 | deviance=3.486e+05 | nb_theta: 3.38
Iteration: 18 | deviance=3.487e+05 | nb_theta: 3.38
Iteration: 19 | deviance=3.488e+05 | nb_theta: 3.39
Iteration: 20 | deviance=3.489e+05 | nb_theta: 3.39
Iteration: 21 | deviance=3.49e+05 | nb_theta: 3.39
Iteration: 22 | deviance=3.49e+05 | nb_theta: 3.39
Iteration: 23 | deviance=3.49e+05 | nb_theta: 3.39
Iteration: 24 | deviance=3.49e+05 | nb_theta: 3.39
Iteration: 25 | deviance=3.491e+05 | nb_theta: 3.39
Iteration: 26 | deviance=3.491e+05 | nb_theta: 3.39
Iteration: 27 | deviance=3.491e+05 | nb_theta: 3.39
Iteration: 28 | deviance=3.491e+05 | nb_theta: 3.39
Iteration: 29 | deviance=3.491e+05 | nb_theta: 3.39
Iteration: 30 | deviance=3.491e+05 | nb_theta: 3.39
Iteration: 31 | deviance=3.491e+05 | nb_theta: 3.39
Error: Poor model fit (final deviance higher than initial deviance). Try modifying control parameters to improve optimization.

It's also kind of bemusing that the deviance is steadily increasing... I would have expected it to decrease.

@willtownes
Copy link
Collaborator

I haven't forgotten about this, sorry for the delay. One clarification- do you install all the packages from bioc development or from their respective github repositories?

@LTLA
Copy link
Author

LTLA commented Aug 29, 2020

BioC-devel. Of course, I could use the GitHub repos locally, but that just kicks the can down the road as the OSCA book will only build using packages from the official BioC-devel channel.

@willtownes
Copy link
Collaborator

OK I finally was able to reproduce this bug. The bug occurs for a variety of random seeds, what matters is that scry must be installed from Bioconductor development branch. When I installed scry from github, the bug does not occur for any random seed. Also, it doesn't occur if I run if I call glmpca directly from the cran package without the scry wrapper. For example, replace the last two lines in Aaron's original code with

library(glmpca) # from CRAN
m<-assay(sce.8qc[hvgs.8qc, ], "counts")
fit<-glmpca(m, L=20, fam="nb", sz=sizeFactors(sce.8qc), verbose=TRUE)

What seems to be happening is the convergence happens at about 3.5e5 deviance for all initializations, but the Bioc development version weirdly produces an initialization having a lower deviance (2.0e5). Both the scry github version and the glmpca CRAN version initialize with deviance about 7.3e5. Another wrinkle is with scry github and glmpca CRAN, the nb_theta is initialized to 100, whereas with the bioc devel version it initializes to 1.

@willtownes
Copy link
Collaborator

By the way it is possible for the deviance to increase slightly sometimes at the end of the optimization due to the momentum aspect of the optimizer. Also, the deviance is only defined conditional on a particular nb_theta value so it's possible to have a lower deviance but a worse set of latent factors. This further underscores the need to improve the estimation of nb_theta in glmpca.

@willtownes
Copy link
Collaborator

OK I have now reproduced this bug in all implementations by setting nb_theta=1. @LTLA as a quick fix please try re-running your example but set nb_theta=100. It's still a bug because it shouldn't be defaulting to such a low nb_theta initialization.

@willtownes
Copy link
Collaborator

confirmed by checking source code that bioc-devel branch has an old implementation (bioc-release is actually further along). This old default has nb_theta=1. The new version has nb_theta=100. Once we update the bioc-devel the problem should go away.

@LTLA
Copy link
Author

LTLA commented Sep 11, 2020

Yes, this fixes the problem. Finishes faster too.

Bit surprised that BioC-release is further along than BioC-devel. Probably preaching to the choir here, but I'd be pretty scared to push stuff to BioC-release without it having run the BBS gauntlet in BioC-devel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants