Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

umap non determinism - intended? #27

Closed
allenqm opened this issue Nov 26, 2017 · 11 comments
Closed

umap non determinism - intended? #27

allenqm opened this issue Nov 26, 2017 · 11 comments

Comments

@allenqm
Copy link

allenqm commented Nov 26, 2017

Was testing it out and noticed that setting the random seed doesn't stop the embedding from changing upon different runs.

is non-determinism part of the design (like tsne)? is there a way to replicate prior results?

@lmcinnes
Copy link
Owner

That would definitely be a bug. I believe I had the seed working such that it eliminated randomness -- that is the latest version of UMAP has a random_state parameter you can set on initialisation. You can set it with a number, or a numpy RandomState, wither way, you should be able to fix that to reproduce results.

Just setting numpy's random seed is not going to be enough because of interactions with numba and the fact that UMAP uses its own internal PRNG for speed. Can you clarify under what conditions you aren't getting repeatability?

@allenqm
Copy link
Author

allenqm commented Nov 30, 2017

Hi thanks for responding. In this case I'm using the random_state parameter and setting it to 42:

embedding = umap.UMAP(n_neighbors=15,
                      min_dist=0.1,
                      n_components=2,
                      random_state=42,
                      metric='correlation', verbose=3).fit_transform(wordvectors) 
#wordvectors is an n_sample X n_dim numpy array of word vectors

I then plot the embedding like so:
pandas.DataFrame(embedding).plot(kind='scatter', x= 0, y=1, alpha=0.05).

The graph is get is different each time. I tried switching the axis but that doesn't explain the differences.

fyi I just pip installed umap.

Any thoughts would be helpful. Thanks!

@lmcinnes
Copy link
Owner

lmcinnes commented Dec 1, 2017

Okay, that's definitely disconcerting because I worked through getting the random_state to work properly (which turned out to be frustratingly non-trivial) and for at least the test dataset I was working with it produced perfectly consistent results when fixed. I'll try a few other datasets to verify that it is indeed working for me at least, and then perhaps we can start trying to track down why it isn't working for you. Which python version are you using? That's potentially one reason for issues...

@vseledkin
Copy link

nondeterminism probably comes from unstable result of metric_nn_descent function, I observe that some rows of returned knn_indices, knn_dists are not sorted according to knn distance (this may be a serious bug, not sure)

@lmcinnes
Copy link
Owner

lmcinnes commented Jun 8, 2018

That is a bug that was caught and should be fixed in more recent versions. It should either be in the current master or will appear in version 0.3.

@vmarkovtsev
Copy link

This is still happening for me on 0.3.8. However, @warenlg found that fixing the numpy seed makes it fully deterministic: numpy.random.seed(42)

@lmcinnes
Copy link
Owner

lmcinnes commented Apr 3, 2019

So this is somewhat disconcerting, and is definitely on my list of things to fix. I am not honestly quite sure where or how this is happening.

@ericloud
Copy link

ericloud commented Sep 3, 2019

This is still happening for me on 0.3.10.
I try several way to fix the numpy seed, as proposed by @vmarkovtsev, but it doesn't work for me.
Is it possible to have more details ? An example would be great.

@sleighsoft
Copy link
Collaborator

@ericloud Can you provide example code of what you did in order for others to reproduce the issue?

@ericloud
Copy link

ericloud commented Sep 4, 2019

Eureka!
random_state works perfectly fine on my side.
The problem was in the matrix given in input, where the features was randomly ordered.

#This transformation return unique ids but not in deterministic order
ids_selected = list(set(ids_selected))

mat = mat.loc[ids_selected]

#A solution to fix it.
mat.sort_index(inplace=True)

embedding = umap.UMAP(
        n_neighbors=10,
        random_state=42
    ).fit_transform(mat.T)

Thanks.

@sleighsoft
Copy link
Collaborator

Glad you resolved it. See here for details on Python Data Structures https://docs.python.org/3/tutorial/datastructures.html#sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants