Spectral initialization for 1D line #360

dkobak · 2020-02-13T09:37:18Z

Hi Leland, I was playing around with embedding a 1D line and noticed that the spectral initialization does not behave like I think it should.

Here is a reproducible example:

n = 10000
X = np.zeros((n,3))
X[:,0] = np.arange(n)

from sklearn.manifold import SpectralEmbedding
S = SpectralEmbedding(n_components=2, n_neighbors=15).fit_transform(X)
Z = UMAP().fit_transform(X)

plt.figure(figsize=(6,3))
plt.subplot(121)
plt.scatter(Z[:,0], Z[:,1], s=1, c=np.arange(n))
plt.subplot(122)
plt.scatter(S[:,0], S[:,1], s=1, c=np.arange(n))

The spectral embedding looks like a parabola, without any overlaps. But the UMAP result looks like a parabola folded in two. I ran it with small n_epochs and then it's even clearer that it is initialized with the parabola folded in two.

Why would that be?

The text was updated successfully, but these errors were encountered:

lmcinnes · 2020-02-13T14:57:22Z

An error in the UMAP version of spectral embedding perhaps?

…

On Thu, Feb 13, 2020 at 4:37 AM Dmitry Kobak ***@***.***> wrote: Hi Leland, I was playing around with embedding a 1D line and noticed that the spectral initialization does not behave like I think it should. Here is a reproducible example: n = 10000 X = np.zeros((n,3)) X[:,0] = np.arange(n) from sklearn.manifold import SpectralEmbedding S = SpectralEmbedding(n_components=2, n_neighbors=15).fit_transform(X) Z = UMAP().fit_transform(X) plt.figure(figsize=(6,3)) plt.subplot(121) plt.scatter(Z[:,0], Z[:,1], s=1, c=np.arange(n)) plt.subplot(122) plt.scatter(S[:,0], S[:,1], s=1, c=np.arange(n)) [image: index] <https://user-images.githubusercontent.com/8970231/74421054-b5a21200-4e4c-11ea-9383-f57385dc82fc.png> The spectral embedding looks like a parabola, without any overlaps. But the UMAP result looks like a parabola folded in two. I ran it with small n_epochs and then it's even clearer that it is initialized with the parabola folded in two. Why would that be? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#360?email_source=notifications&email_token=AC3IUBNSMY6O2VNMFNTVGDDRCUIFJA5CNFSM4KUOJWPKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4INGNOJQ>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IUBLDQNBGTDQ53FYW5QDRCUIFJANCNFSM4KUOJWPA> .

dkobak · 2020-02-13T16:06:41Z

That's what I am suspecting. Is this the relevant function? https://github.com/lmcinnes/umap/blob/master/umap/spectral.py#L213

jlmelville · 2020-02-22T06:45:23Z

@dkobak, yes that would be the function. For the example data you provide, it's using the scipy.sparse.linalg.eigsh code path.

This is probably an initialization issue (and specifically probably a convergence of the eigenvector calculation): the spectral initialization in uwot, which uses the same graph Laplacian but uses RSpectra to find the eigenvectors, correctly initializes the data to a parabola. Conversely, the Laplacian Eigenmap initialization does a pretty bad job, even though the only difference is the choice of the graph Laplacian.

dkobak · 2020-02-22T22:56:45Z

Hi @jlmelville.

This is probably an initialization issue (and specifically probably a convergence of the eigenvector calculation): the spectral initialization in uwot, which uses the same graph Laplacian but uses RSpectra to find the eigenvectors, correctly initializes the data to a parabola.

Do you think it's possible? scipy.sparse.linalg.eigsh uses ARPACK to compute the eigenvectors and as far as I know Spectra is a C++ reimplementation of ARPACK routines. ARPACK is very well tested, so I'd be surprised if it returns nonconverged results. Also, for this 1D line example, the Laplacian should be so well behaved that I wouldn't expect any tricky convergence problems.

To be honest, I rather suspected some bug/problem in the UMAP code around the eigsh. But I didn't investigate it.

Conversely, the Laplacian Eigenmap initialization does a pretty bad job, even though the only difference is the choice of the graph Laplacian.

Not sure what you mean here. My code snippet above uses https://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html which is Laplacian Eigenmaps, and it yields a parabola shape.

jlmelville · 2020-02-22T23:14:22Z

There are a whole bunch of convergence options that are exposed in the routines we are discussing, so it seems feasible that the speed-vs-accuracy trade-off is set incorrectly for some datasets.

In my Laplacian Eigenmaps example, I had tol = 1e-4 and I get a bad result. If I change tol = 1e-6 and set maxitr = 5000 (up from 1000), I get a nice parabola like you do.

dkobak · 2020-02-23T09:01:42Z

Hmm. Sklearn SpectralEmbedding uses ARPACK eigsh by default, with tol=0.0 (which also is default in eigsh): https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/manifold/_spectral_embedding.py#L179

Leland's code calls eigsh with tol=1e-4: https://github.com/lmcinnes/umap/blob/master/umap/spectral.py#L267

One could change the params of the eigsh call in spectral.py and see if it makes the difference in my toy example.

jlmelville · 2020-02-23T18:27:35Z

Based on my experiences with RSpectra in uwot, I am sure that decreasing the tol parameter would fix the issue.

dkobak · 2020-03-11T10:23:44Z

@jlmelville Interestingly, adding a small amount of Gaussian noise to the simulated data makes the problem go away. I wonder if it somehow can make eigsh converge faster.

dkobak · 2020-03-16T23:17:17Z

Based on my experiences with RSpectra in uwot, I am sure that decreasing the tol parameter would fix the issue.

You are right. Pavlin and me now found out the same in the linked thread at openTSNE. Decreasing the tolerance to zero does indeed fix this issue can make the runtime a lot slower. So it's not clear that it would be advantageous by default. I guess I will close this issue then.

dkobak mentioned this issue Mar 16, 2020

Add spectral initialization using diffusion maps pavlin-policar/openTSNE#115

Merged

3 tasks

dkobak closed this as completed Mar 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spectral initialization for 1D line #360

Spectral initialization for 1D line #360

dkobak commented Feb 13, 2020

lmcinnes commented Feb 13, 2020 via email

dkobak commented Feb 13, 2020

jlmelville commented Feb 22, 2020

dkobak commented Feb 22, 2020

jlmelville commented Feb 22, 2020

dkobak commented Feb 23, 2020

jlmelville commented Feb 23, 2020

dkobak commented Mar 11, 2020

dkobak commented Mar 16, 2020

Spectral initialization for 1D line #360

Spectral initialization for 1D line #360

Comments

dkobak commented Feb 13, 2020

lmcinnes commented Feb 13, 2020 via email

dkobak commented Feb 13, 2020

jlmelville commented Feb 22, 2020

dkobak commented Feb 22, 2020

jlmelville commented Feb 22, 2020

dkobak commented Feb 23, 2020

jlmelville commented Feb 23, 2020

dkobak commented Mar 11, 2020

dkobak commented Mar 16, 2020