Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

umap_transform causes R Studio to abort (R encountered a fatal error.) #102

Closed
ChVav opened this issue Sep 25, 2022 · 4 comments
Closed
Labels
bug Something isn't working

Comments

@ChVav
Copy link

ChVav commented Sep 25, 2022

Hi!

My R Studio session crashes when I try to use umap_transform. No further error messages given.
I tested uwot_0.1.11 and 0.1.14, but exactly the same happens.

Many thanks!

Example code:

library(uwot)
train <- iris[1:100,]
test <- iris[101:150,]

set.seed(42)
train_umap <- umap(train, n_components = 50, ret_model=TRUE, y=train$Petal.Length)
set.seed(42)
test_umap <- umap_transform(test,train_umap)
library(uwot)
sessionInfo()

R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C LC_TIME=English_United States.utf8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] uwot_0.1.14 Matrix_1.4-1

loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 umap_0.2.9.0 RSpectra_0.16-1 compiler_4.2.0 pillar_1.7.0 tools_4.2.0 digest_0.6.29
[8] jsonlite_1.8.0 evaluate_0.15 lifecycle_1.0.1 tibble_3.1.7 lattice_0.20-45 pkgconfig_2.0.3 png_0.1-7
[15] rlang_1.0.4 DBI_1.1.2 cli_3.3.0 rstudioapi_0.13 yaml_2.3.5 xfun_0.30 fastmap_1.1.0
[22] dplyr_1.0.9 knitr_1.39 generics_0.1.2 vctrs_0.4.1 askpass_1.1 tidyselect_1.1.2 grid_4.2.0
[29] reticulate_1.26 glue_1.6.2 R6_2.5.1 fansi_1.0.3 rmarkdown_2.14 purrr_0.3.4 magrittr_2.0.3
[36] htmltools_0.5.2 ellipsis_0.3.2 assertthat_0.2.1 utf8_1.2.2 openssl_2.0.0 crayon_1.5.1

@jlmelville jlmelville added the bug Something isn't working label Sep 25, 2022
@jlmelville
Copy link
Owner

Hello, thanks for the report and the reproducible example. Tracking down what's happened is going to take me longer than the time I have for today, but I see that this example has also revealed some other problems that would probably cause issues even if the underlying memory error was fixed.

In the code you provide, the embedded coordinates from the initial run of umap contain NaN. umap_transform ought to check for this and give an error (bug number 1). umap should also check that the initial data doesn't contain NA, especially if it is responsible for generating the data (bug number 2).

The reason why those NaNs are occurring is because you have set n_components=50 but the initial dimensionality of the iris dataset is only 4. I don't recommend trying to generate an embedding where n_components is greater then the dimensionality of the dataset. Again, the umap function should check for this and prevent this occurring (bug number 3 and we haven't even got to the real problem yet).

I should check at this point @ChVav: did you mean to use n_components = 50 in this example or did you mean n_neighbors = 50? The latter makes more sense for iris, but I understand that you may have been using a different dataset that can't be shared for reproducibility purposes. If you did mean to use n_components = 50 with a dataset with a similarly low dimensionality as iris then please be aware that even after I fix the bug that is causing the crash, this is unlikely to ever work: both a spectral and PCA-based initialization will give NA after the first 4 components so you will need to set init="rand" to umap or pass a user-defined initialization. And that's even if I can be persuaded to not make setting n_components higher than the number of columns in an input dataframe or matrix a bug.

@ChVav
Copy link
Author

ChVav commented Sep 26, 2022

Hi, thanks for answering so fast. This is very helpful.

Silly me, yes, my actual training/test set has >800,000 variables, so I am in the end meaning to grid search what n_components to reduce dimensions to. For the iris dataset n_components = 50 of course does not make sense, my bad.
Unsupervised clustering with the umap package at least worked fine for my full dataset, testing up to 250 components.
I check for remaining NAs after imputation, so this is not the issue on my actual data.

I am testing my whole scheme with supervised dimension reduction on a subset of 5000 columns, and based on your suggestion of an underlying memory issue found the code to work for n_components = {2,3,10} but not n_components=25.
So thanks, at least I understand now why this crash was happening and can try and compute all this on a server.

Thank you!

@jlmelville
Copy link
Owner

@ChVav the problem should now be fixed on the master branch of this repo. The crash will be triggered whenever n_components > n_neighbors. This is a serious enough bug to merit a new release on CRAN but unfortunately I don't have a lot of time to do this for a while. Also I am not sure of a workaround. My apologies.

I would like to keep this issue open until I also fix the bugs around checking for NA in initial data and warning when n_components is probably set too high.

@ChVav
Copy link
Author

ChVav commented Sep 27, 2022

yes, just tested your version on the master branch, and works also for n_components > n_neighbors.

Many thanks for the help! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants