Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spectral initialization for 1D line #360

Closed
dkobak opened this issue Feb 13, 2020 · 9 comments
Closed

Spectral initialization for 1D line #360

dkobak opened this issue Feb 13, 2020 · 9 comments

Comments

@dkobak
Copy link

dkobak commented Feb 13, 2020

Hi Leland, I was playing around with embedding a 1D line and noticed that the spectral initialization does not behave like I think it should.

Here is a reproducible example:

n = 10000
X = np.zeros((n,3))
X[:,0] = np.arange(n)

from sklearn.manifold import SpectralEmbedding
S = SpectralEmbedding(n_components=2, n_neighbors=15).fit_transform(X)
Z = UMAP().fit_transform(X)

plt.figure(figsize=(6,3))
plt.subplot(121)
plt.scatter(Z[:,0], Z[:,1], s=1, c=np.arange(n))
plt.subplot(122)
plt.scatter(S[:,0], S[:,1], s=1, c=np.arange(n))

index

The spectral embedding looks like a parabola, without any overlaps. But the UMAP result looks like a parabola folded in two. I ran it with small n_epochs and then it's even clearer that it is initialized with the parabola folded in two.

Why would that be?

@lmcinnes
Copy link
Owner

lmcinnes commented Feb 13, 2020 via email

@dkobak
Copy link
Author

dkobak commented Feb 13, 2020

That's what I am suspecting. Is this the relevant function? https://github.com/lmcinnes/umap/blob/master/umap/spectral.py#L213

@jlmelville
Copy link
Collaborator

@dkobak, yes that would be the function. For the example data you provide, it's using the scipy.sparse.linalg.eigsh code path.

This is probably an initialization issue (and specifically probably a convergence of the eigenvector calculation): the spectral initialization in uwot, which uses the same graph Laplacian but uses RSpectra to find the eigenvectors, correctly initializes the data to a parabola. Conversely, the Laplacian Eigenmap initialization does a pretty bad job, even though the only difference is the choice of the graph Laplacian.

@dkobak
Copy link
Author

dkobak commented Feb 22, 2020

Hi @jlmelville.

This is probably an initialization issue (and specifically probably a convergence of the eigenvector calculation): the spectral initialization in uwot, which uses the same graph Laplacian but uses RSpectra to find the eigenvectors, correctly initializes the data to a parabola.

Do you think it's possible? scipy.sparse.linalg.eigsh uses ARPACK to compute the eigenvectors and as far as I know Spectra is a C++ reimplementation of ARPACK routines. ARPACK is very well tested, so I'd be surprised if it returns nonconverged results. Also, for this 1D line example, the Laplacian should be so well behaved that I wouldn't expect any tricky convergence problems.

To be honest, I rather suspected some bug/problem in the UMAP code around the eigsh. But I didn't investigate it.

Conversely, the Laplacian Eigenmap initialization does a pretty bad job, even though the only difference is the choice of the graph Laplacian.

Not sure what you mean here. My code snippet above uses https://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html which is Laplacian Eigenmaps, and it yields a parabola shape.

@jlmelville
Copy link
Collaborator

There are a whole bunch of convergence options that are exposed in the routines we are discussing, so it seems feasible that the speed-vs-accuracy trade-off is set incorrectly for some datasets.

In my Laplacian Eigenmaps example, I had tol = 1e-4 and I get a bad result. If I change tol = 1e-6 and set maxitr = 5000 (up from 1000), I get a nice parabola like you do.

@dkobak
Copy link
Author

dkobak commented Feb 23, 2020

Hmm. Sklearn SpectralEmbedding uses ARPACK eigsh by default, with tol=0.0 (which also is default in eigsh): https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/manifold/_spectral_embedding.py#L179

Leland's code calls eigsh with tol=1e-4: https://github.com/lmcinnes/umap/blob/master/umap/spectral.py#L267

One could change the params of the eigsh call in spectral.py and see if it makes the difference in my toy example.

@jlmelville
Copy link
Collaborator

Based on my experiences with RSpectra in uwot, I am sure that decreasing the tol parameter would fix the issue.

@dkobak
Copy link
Author

dkobak commented Mar 11, 2020

@jlmelville Interestingly, adding a small amount of Gaussian noise to the simulated data makes the problem go away. I wonder if it somehow can make eigsh converge faster.

@dkobak
Copy link
Author

dkobak commented Mar 16, 2020

Based on my experiences with RSpectra in uwot, I am sure that decreasing the tol parameter would fix the issue.

You are right. Pavlin and me now found out the same in the linked thread at openTSNE. Decreasing the tolerance to zero does indeed fix this issue can make the runtime a lot slower. So it's not clear that it would be advantageous by default. I guess I will close this issue then.

@dkobak dkobak closed this as completed Mar 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants