Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility issue with the same data and OS #110

Closed
SuhasSrinivasan opened this issue Jun 2, 2023 · 17 comments
Closed

Reproducibility issue with the same data and OS #110

SuhasSrinivasan opened this issue Jun 2, 2023 · 17 comments
Labels
enhancement New feature or request

Comments

@SuhasSrinivasan
Copy link

SuhasSrinivasan commented Jun 2, 2023

Hi jlmelville, thank you for developing and maintaining this package.

I am currently noticing a high severity issue, where the global structure of the 2D embedding changes drastically, on consecutive runs, given that the data, umap parameters and executing machine are all maintained constant.

The local relationships among the points in the cluster are similar but difficult to immediately gauge due to the large change in global structure.

Additionally, this issue is not resolved after trying many combinations of umap parameters as outlined in Issues #46 and #55.

Parameters checked include the following:

set.seed(123)
init = spectral, spca and pca
approx_pow = T and F
n_threads = NULL and 1
n_sgd_threads = 0 and 1
scale = euclidean and manhattan
scale = Z, colrange, range and maxabs
learning_rate = 0.25, 0.5 and 1

Below is the initial part of the R session information.

> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

Using umap::umap, I tried to replicate the issue with uwot::umap to determine if it is something specific to my environment.
But was not able to reproduce it, the umap::umap embedding maintains its local and global structure on consecutive runs.

Please let me know if there are any workarounds, troubleshooting steps or if you need additional information.
Thank you!

@jlmelville
Copy link
Owner

That sounds bad. Unfortunately, my ability to diagnose the problem is going to be limited without access to data that reproduces the issue, but I understand if you are unable to share it. So here are some things to try that might narrow it down a bit:

  • does this happen with other datasets or just the one you are currently analyzing? For example, are results with iris reproducible?
  • set verbose = TRUE and have a look at the output (feel free to paste it here if that will help). Maybe there is a warning or some other info that could give a clue.
  • set n_epochs = 0. This will mean only the initialization takes place, and none of the optimization. Try it with init = "spectral" and init = "spca" and repeat runs. If the output is consistent between runs for a given value of init, then we know the issue is with the optimization step.
  • calculate nearest neighbors externally to uwot and pass it in. Here's an example with the iris dataset using the internal Annoy routines:
    iris_nn_annoy <- uwot:::annoy_nn(uwot:::x2m(iris), k = 15)
    set.seed(123)
    iris_res <- umap(iris, nn_method = iris_nn_annoy, verbose = TRUE)
    Running set.seed(123); iris_res <- umap(iris, nn_method = iris_nn_annoy, verbose = TRUE) repeatedly should give reproducible output for iris_res. Does that work with your data? If not, maybe there is a problem with the input calibration of distances to affinities, or maybe there is a problem with the optimization. If you get reproducible results, then it points to some issue with the nearest neighbor search.

If you are able to try any of this let me know of any findings.

@SuhasSrinivasan
Copy link
Author

Hi jlmelville,

Thank you for quickly reviewing this issue!

Sorry, I should have tried it with iris dataset too and posted that to begin with.
It is indeed reproducible with iris as well.

Here is the execution with default parameters and below are Runs 1.1 and 1.2 respectively.

> umap_res = data.frame(uwot::umap(iris, verbose = T))
22:31:59 UMAP embedding parameters a = 1.896 b = 0.8006
22:31:59 Converting dataframe to numerical matrix
22:31:59 Read 150 rows and found 4 numeric columns
22:31:59 Using FNN for neighbor search, n_neighbors = 15
22:31:59 Commencing smooth kNN distance calibration using 6 threads with target n_neighbors = 15
22:32:00 Found 2 connected components, falling back to 'spca' initialization with init_sdev = 1
22:32:00 Using 'svd' for PCA
22:32:00 PCA: 2 components explained 97.77% variance
22:32:00 Scaling init to sdev = 1
22:32:00 Commencing optimization for 500 epochs, with 2694 positive edges
Using method 'umap'
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
22:32:00 Optimization finished

Run1.1
Run1.2

This issue for iris goes away when n_epochs = 0 and embedding structure is identical for init = 'spectral' and init = 'spectral'.

Unfortunately, it still exists for my data with n_epochs = 0, below is the execution log

> umap_res = data.frame(uwot::umap(df, verbose = T, n_epochs = 0))
22:48:19 UMAP embedding parameters a = 1.896 b = 0.8006
22:48:19 Converting dataframe to numerical matrix
22:48:19 Read 30 rows and found 19414 numeric columns
22:48:19 Using FNN for neighbor search, n_neighbors = 15
22:48:19 Commencing smooth kNN distance calibration using 6 threads with target n_neighbors = 15
22:48:20 Initializing from normalized Laplacian + noise (using irlba)

This time I noticed rotation about the central axis of the plot and local cluster consistency.
Strangely, two consecutive runs can be same, but third might change.
Or first and third runs are the same, with the second one flipping, see Runs 2.1, 2.2 and 2.3.

Run2.1
Run2.2
Run2.3

Let me know if you need more information from me.
Thank you again!

@LTLA
Copy link
Contributor

LTLA commented Jun 2, 2023

Given that it seems like an initialization-related problem, I wonder whether this is related to some issues with IRLBA in combination with certain non-default BLAS/LAPACK libraries, see bwlewis/irlba#14. Smells like some non-determinism related to parallelization, though I'm not too familiar with the Accelerate library.

@SuhasSrinivasan
Copy link
Author

Thank you.
Regarding parallelization, I tried the following too, but it did not help

umap_res = data.frame(uwot::umap(df, verbose = T, n_epochs = 0, n_threads = 1, n_sgd_threads = 1))

@jlmelville
Copy link
Owner

Thank you @SuhasSrinivasan for the details. Some observations:

First, I'm not sure about the iris result. Can you clarify what commands you are running? Scenario 1:

set.seed(123); plot(umap(iris))
set.seed(123); plot(umap(iris))

These should give identical results. If they don't, then there is something very strange with your installation of R and I am not sure it's a problem with uwot.

Scenario 2:

set.seed(123);
plot(umap(iris))
plot(umap(iris))

These will not give identical results by design. UMAP uses random numbers as part of the stochastic gradient descent. It will give different results for each run. For reproducibility you will need to fix the seed and reset it (via set.seed) before running umap every single time. If scenario 2 is what you are seeing, then there is (almost certainly) no bug.

With that out of the way, if you are seeing this problem with init = "spectral" then as @LTLA suggests, this could be a problem with IRLBA. Some things to try:

  1. If you have any ability to try a different BLAS library, that is worth a go. Unfortunately, I don't know how to do this on the Mac. Maybe the linux instructions by Carlos Santillan are of use to you.
  2. I know things have gone horribly wrong for me on R using Intel's MKL with multiple threads. So if the Accelerate BLAS is multi-threaded and there is a way to set an environment variable to only use one thread, try that and see if it helps.
  3. If you are able to install RSpectra then init = "spectral" will use that package and not IRLBA (or BLAS as far as I know). That might fix the issue.

As an aside, it's not impossible for small numerical differences to result in spectral or PCA initialization to result in sign flips. I'm afraid I don't do anything to standardize that (and in fact the latest R release has changed some SVD routines which can cause this change), so the initialized output should only be expected to be the same up to a reflection. I am slightly surprised that it's triggered for the same input, but that could be a random number issue (it's been a while since I looked at IRLBA so I don't actually know if it uses a randomized algorithm). Otherwise, that is suggestive of some oddities in the BLAS library.

Finally, although this is separate from the initialization problem, I notice you only have 30 observations in your dataset. The default n_neighbors = 15 might be a bit large for that and you may not see much clustering.

@jlmelville
Copy link
Owner

Thank you. Regarding parallelization, I tried the following too, but it did not help

umap_res = data.frame(uwot::umap(df, verbose = T, n_epochs = 0, n_threads = 1, n_sgd_threads = 1))

If there is some parallelization happening in the BLAS library, you cannot control that through uwot's interface, because none of that information is passed to IRLBA. As far as I know, IRLBA doesn't expose any ability to control the parallelization. If your BLAS library is multi-threaded, then you will need to consult its documentation to see if there is a way to control that. For example, for Intel's MKL version of BLAS, you can set an environment variable before you start R.

@SuhasSrinivasan
Copy link
Author

SuhasSrinivasan commented Jun 2, 2023

set.seed(123) was executed each time during initialization of the script, but not for consecutive runs.
On checking .Random.seed I now see that it changes after each umap call.
Sorry for not knowing about this and sorry for the false alarm.
Coming from umap-learn I did not know that once the seed was set, it could change.

I will close this issue but can I submit a feature request for a parameter random_state similar to the implementation below?
So that users are aware of setting this value for reproducibility.

https://umap-learn.readthedocs.io/en/latest/reproducibility.html

@jlmelville
Copy link
Owner

Thank you for digging deeper @SuhasSrinivasan.

I'm not against the idea of a random_state parameter but as far as I know it's not really the R way to pass around random_state as done in Python scikit-learn-style packages. Calling set.seed is the equivalent. There's nothing I can do internally that wouldn't boil down to doing that. Even then, I don't see anything conceptually different between the following code:

set.seed(123)
rnorm(10)
rnorm(10)

and

set.seed(123)
head(umap(iris))
head(umap(iris))

I don't think most people would have the expectation that a pseudo-random number generator like rnorm would return the same value for each call without calling set.seed each time. It's the same for the output of umap. If someone can point me to a some well-used R libraries which do provide a random_state-esque parameter (not ones that just pass through straight to Python via reticulate though), I am happy to change my mind on this.

That said, I think there may be a real problem here in that .Random.seed is changing between runs, which I can reproduce. I think this is a bug (although honestly I didn't even know .Random.seed existed until just now so thank you for pointing me to that). The question is what is causing that vector to change -- I am not explicitly running set.seed internally and it only happens if n_epochs is > 0, which means it's something in the C++ code. So either something I am doing inadvertently or something I didn't understand about a library I am using (dqrng or Rcpp probably).

Let's leave this issue open for now.

@SuhasSrinivasan
Copy link
Author

@jlmelville thank you so much for investigating this further and for understanding my limited experience with this.

I noticed another behavior which may be relevant to this, it could be a combination of set.seed() and another specific compute step.
When using init = 'pca' and pca_method = 'svd', the results are stable for consecutive runs for my data without setting seed (unexpected?), but not stable for iris.

> .Random.seed[c(1, 100, 200, 300, 400, 500, 600)]
[1]       10403 -1980858986    12825002 -1630469076  -461320936   228756363  1916908538

> umap_res = data.frame(uwot::umap(df, init = 'pca', verbose = T, pca_method = 'svd'))
10:48:19 UMAP embedding parameters a = 1.896 b = 0.8006
10:48:19 Converting dataframe to numerical matrix
10:48:19 Read 30 rows and found 19414 numeric columns
10:48:19 Using FNN for neighbor search, n_neighbors = 15
10:48:19 Commencing smooth kNN distance calibration using 6 threads with target n_neighbors = 15
10:48:19 Using 'svd' for PCA
10:48:19 PCA: 2 components explained 64.53% variance
10:48:19 Commencing optimization for 500 epochs, with 600 positive edges
Using method 'umap'
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
10:48:20 Optimization finished

> .Random.seed[c(1, 100, 200, 300, 400, 500, 600)]
[1]       10403 -1980858986    12825002 -1630469076  -461320936   228756363  1916908538

> head(umap_res)
                      X1          X2
Airway.B1      -80973.57 -162809.844
Airway.B2     -198197.30  -98631.000
Astrocytes.B1  -99855.22  121753.836
Astrocytes.B2 -273940.22  103751.938
Bladder.B1    -196972.81   14210.054
Bladder.B2    -161322.26    9845.696

> umap_res = data.frame(uwot::umap(df, init = 'pca', verbose = T, pca_method = 'svd'))
10:48:35 UMAP embedding parameters a = 1.896 b = 0.8006
10:48:35 Converting dataframe to numerical matrix
10:48:36 Read 30 rows and found 19414 numeric columns
10:48:36 Using FNN for neighbor search, n_neighbors = 15
10:48:36 Commencing smooth kNN distance calibration using 6 threads with target n_neighbors = 15
10:48:36 Using 'svd' for PCA
10:48:36 PCA: 2 components explained 64.53% variance
10:48:36 Commencing optimization for 500 epochs, with 600 positive edges
Using method 'umap'
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
10:48:36 Optimization finished

> .Random.seed[c(1, 100, 200, 300, 400, 500, 600)]
[1]      10403 1554347705 2052661169  325463611 1006457283 -159943609 1217185183

> head(umap_res)
                      X1          X2
Airway.B1      -80973.57 -162809.844
Airway.B2     -198197.30  -98631.000
Astrocytes.B1  -99855.22  121753.836
Astrocytes.B2 -273940.22  103751.938
Bladder.B1    -196972.81   14210.054
Bladder.B2    -161322.26    9845.696

Iris

> .Random.seed[c(1, 100, 200, 300, 400, 500, 600)]
[1]      10403 1554347705 2052661169  325463611 1006457283 -159943609 1217185183

> umap_res = data.frame(uwot::umap(iris, init = 'pca', verbose = T, pca_method = 'svd'))
10:50:19 UMAP embedding parameters a = 1.896 b = 0.8006
10:50:19 Converting dataframe to numerical matrix
10:50:19 Read 150 rows and found 4 numeric columns
10:50:19 Using FNN for neighbor search, n_neighbors = 15
10:50:19 Commencing smooth kNN distance calibration using 6 threads with target n_neighbors = 15
10:50:19 Using 'svd' for PCA
10:50:19 PCA: 2 components explained 97.77% variance
10:50:19 Commencing optimization for 500 epochs, with 2694 positive edges
Using method 'umap'
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
10:50:19 Optimization finished

> .Random.seed[c(1, 100, 200, 300, 400, 500, 600)]
[1]       10403 -1222614363  1066156676  -302350240   419159221  -273515205 -1322681970

> head(umap_res)
         X1        X2
1 -14.96021 -6.359911
2 -13.65837 -4.851874
3 -14.25497 -4.871355
4 -13.96090 -5.064689
5 -14.86268 -6.224919
6 -15.02521 -7.292343

> umap_res = data.frame(uwot::umap(iris, init = 'pca', verbose = T, pca_method = 'svd'))
10:50:30 UMAP embedding parameters a = 1.896 b = 0.8006
10:50:30 Converting dataframe to numerical matrix
10:50:30 Read 150 rows and found 4 numeric columns
10:50:30 Using FNN for neighbor search, n_neighbors = 15
10:50:30 Commencing smooth kNN distance calibration using 6 threads with target n_neighbors = 15
10:50:31 Using 'svd' for PCA
10:50:31 PCA: 2 components explained 97.77% variance
10:50:31 Commencing optimization for 500 epochs, with 2694 positive edges
Using method 'umap'
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
10:50:31 Optimization finished

> .Random.seed[c(1, 100, 200, 300, 400, 500, 600)]
[1]      10403   -1150057 -386797537 -960462357  993904053 1467933228  -54714025

> head(umap_res)
          X1         X2
1 -10.057139 -0.3562483
2  -8.396251 -0.8626499
3  -8.598498 -1.3419981
4  -8.413740 -1.1751511
5  -9.951806 -0.4802876
6 -10.638429 -1.2009307

@jlmelville
Copy link
Owner

Ok, I've dug into this a bit more and I don't think there is a bug. .Random.state is just the internal state of the R PRNG. If you generate enough random numbers it will change, e.g.:

> set.seed(123)
> head(.Random.seed)
[1]      10403        624 -983674937  643431772 1162448557 -959247990
> res <- rnorm(10000)
> head(.Random.seed)
[1]      10403         32 1064722786  -45720197  139516934 1211617164

So I don't think there are any bugs here. As long as you run in single-threaded mode, then calling set.seed() before each run of umap should give reproducible results.

I appreciate that needing to call an external function like set.seed could be inconvenient or impossible in some uses of UMAP, so I will consider adding a seed parameter that can be used for reproducibility, although it will be unset by default. Thank you for the feedback and assistance @SuhasSrinivasan.

@SuhasSrinivasan
Copy link
Author

Thank you for the clarification! And regret any inconvenience.

Any insight as to why in some scenarios the result is stable for consecutive runs, when not using set.seed(), which is unexpected?

@jlmelville
Copy link
Owner

Any insight as to why in some scenarios the result is stable for consecutive runs, when not using set.seed(), which is unexpected?

Not really. Can you give an example of what combination of parameters causes this and what you mean by "stable"? Exactly the same output coordinates? Similar interpoint distances?

@SuhasSrinivasan
Copy link
Author

Exactly the same output coordinates

#110 (comment)

@jlmelville
Copy link
Owner

Ah yes, sorry I didn't spot that sooner.

The absolute values of the output coordinates are very large. That's presumably because of the scaling of the input data. If you do PCA on the input (init = "pca") then the initial coordinates also show very large values. That means they have very large interpoint distances, $d_{ij}$ which will lead to tiny gradients. So what is happening in your example with your data is that after initialization, the optimization step is doing effectively nothing so the difference in the RNG state doesn't matter. I predict that if you compare those coordinates with the same run with n_epochs = 0, you should get the same (or very very similar coordinates). You can see that for iris the scale of the coordinates is much smaller which is why they change under the same settings.

The way to remedy this is to either use init = "spca" or to set init_sdev directly. Try init_sdev = 1.0 or init_sdev = 10.0.

FWIW, this problem isn't specific to uwot: in fact none of the t-SNE/LargeVis/UMAP-like dimensionality reduction methods are scale-invariant. Most of them will scale the input data and the output coordinates in a way to guarantee sufficiently large gradients for some kind of optimization to occur. uwot's default settings also account for this, but if you stray off the beaten path then under the "it's your funeral" principle, all bets are off. Perhaps I should add a check to look at the scale of the initial coordinates and warn if very little optimization can occur to avoid this potential confusion.

@SuhasSrinivasan
Copy link
Author

Thank you for the very helpful explanation on the optimization step!
Should I close this issue for now or will it be used for the seed parameter and scale checks + logging?

@jlmelville
Copy link
Owner

Leave it open for now please @SuhasSrinivasan, thanks.

@jlmelville
Copy link
Owner

In the master branch there is now a seed parameter which can be set to try and enforce reproducibility. Also a warning if any column of the initial coordinates has a standard deviation greater than 100.

A corner case for reproducibility: if you have a small dataset (less than 4,096 observations) then results will not be reproducible if between running umap with ret_model = TRUE and ret_model = FALSE even with the same value of seed. This is because setting ret_model = TRUE forces the use of Annoy for nearest neighbors calculations, which uses the random number generator. Workaround is to set nn_method = "annoy" in both cases, but it's a bit surprising to the user that an apparently unrelated parameter can cause this change. I suppose the fix is to set.seed twice, once before any nearest neighbor calculations, and then once before the optimization, but this would then make the behavior of setting seed = 42 different to just calling set.seed. I am settling for documenting this in the ret_model description.

I am going to close this issue, but feel free to re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants