Converging to a single point #32

kylemcdonald · 2017-12-04T18:13:11Z

I'm using UMAP to embed a bunch of 128 dimensional face embeddings generated by a neural net.

As I increase the number of embeddings (I have 3M total) the output from UMAP converges to a single point in the center surrounded by a sparse cloud around it. How can I fix this? Here are some examples from fewer samples to more samples. n = 73728, 114688, 172032, 196608, 245760

lmcinnes · 2017-12-04T18:27:43Z

That's an interesting and disconcerting phenomenon. It isn't immediately clear to me what would be causing this. I would speculate that the issue is "noise" -- points that are sufficiently far from everything that UMAP ends up trying to spread them all apart from one another, with the result that any points that are close end up getting packed into the point in the center to make them far from the scattered points around the outside. Assuming this speculation is correct I would expect the central dense cluster to have significant further substructure if you were to zoom in on only it and ignore the outlying points.

As to how to remedy this -- assuming my speculation is correct (it may not be) then increasing the n_neighbors value may help since it will ensure more points are connected into the overall manifold structure and reduce the effect of the outliers. The other alternative might be to increase min_dist to prevent UMAP from packing points quite so close, but that feels more like a hack than a solution. I would be interested, if you can share the data, to dig in a little and see if I can actually figure out what is causing this and whether it can be fixed easily, or whether it requires rethinking parts of the algorithm.

kylemcdonald · 2017-12-04T19:17:06Z

I ran it again with increased n_neighbors and then again with increased min_dist.

Here it is with n_neighbors=5 and again with min_dist=0.01:

The central dense cluster does appear to have more substructure than is visible from a distance:

And, interestingly enough, running UMAP again on the sparse cloud surrounding the dense cluster might reveal some other structure?

Here's a subset of 250Kx128 points https://drive.google.com/open?id=18tEzVM7nQ3KZhJNH6HuvEHL9rrDMmGAC (122MB). This should be enough to show the effect.

lmcinnes · 2017-12-04T20:11:58Z

I would guess you might need quite a large n_neighbors value, potentially around 30 or more, to manage to connect up, and hence signficantly reduce, that outer cloud of points. Thanks for the data sample. I'll see if I can play with it a bit and work out what is actually happenning here.

kylemcdonald · 2017-12-04T22:45:03Z

Sorry, I just realized I went in exactly the wrong direction with these values :) I'll re-run it.

edit: min_dist=0.5 helped a little, n_neighbors=30 did not.

Also, fwiw, here's the code I'm using to render things quickly:

def draw_embedding(embedding, size=(1024,1024), face_color=255, stroke_color=0):
    canvas = np.empty(size, dtype=np.uint8)
    canvas.fill(face_color)
    emax = embedding.max(axis=0)
    emin = embedding.min(axis=0)
    erange = emax - emin
    scale = np.subtract(canvas.shape[:2], 1) / erange
    indices = ((embedding - emin) * scale).astype(np.int32)
    canvas[indices[:,0], indices[:,1]] = stroke_color
    return canvas

lmcinnes · 2017-12-05T02:07:01Z

So I'm playing a little and the obvious potential issues (the simplicial set skeleton has lots of tiny connected components) is not the case. Something very odd is going on. Increasing n_neighbors helps, but not as much as I would like (although I'm now trying very large values like 128 out of curiousity). There is something going on here, and the structure of the "noise" cloud that you found implies there are some possibilities. I feel like UMAP isn't managing to adjust for a couple of very different scales of density and so it isn't managing to render things quite right, but I'm still looking as to what's actually going on internally to cause results like this. It's certainly an intriguing dataset.

lmcinnes · 2017-12-05T03:11:04Z

Thanks for the update. It seems like min_dist is the key here: in trying to get all the distances "right" UMAP is compacting the dense region into a very tiny spot in the middle, and right now the only way to prevent that is to set min_dist large enough to not let it compact points too tightly. This isn't really a satisfactory answer however.

After more exploration I am more convinced that this actually a structure of the data itself (a scattering of points that are are all relatively different from one another and then a more interesting manifold that is essentially equidistant from all the "noise") rather than a "bug", but I do agree that this is not a helpful presentation. What would you like to see in this circumstance however? I think the results with the larger min_dist certainly seem better, but I would prefer to have a more principled way to derive this as the right approach from a data driven perspective rather than having to guess. I think I'll have to ponder this a little longer to come up with a good answer rather than merely an expedient one. In the meantime hopefully increasing min_dist further will help for now. Sorry I don't have better answers.

lmcinnes · 2017-12-05T12:36:05Z

An alternative possibility occurred to me: it could be the approximate nearest neighbors failing in enough cases, and that may be what the cloud around the outside is. That's a little harder to look into, but I'll see if I can at least find out if that's true this evening. If that is the case then it is certainly fixable as it is a bug, although exactly how to fix it will be an interesting question.

Edit: I thought about this some more and it seems like a likely candidate, as I am pretty sure it would produce the behaviour we are seeing here. As for a fix I have some initial heuristics that should work and hopefully I can refine them into something sensible that would do the job well. Definitely some work required though.

lmcinnes · 2017-12-05T18:49:12Z

I can confirm that the approx nearest neighbors is not working as well as would be desireable, and importantly the distribution of precision is quite wide, which leads me to believe that this is indeed the source of the issue. I still have to figure out the "right" way to fix this.

Edit: Making progress on this -- I think I can have an "interim" solution soon, and hopefully a more robust solution not too long after that. Sorry for the lack of visible progress, but I am now convinced that this is an implementation related bug rather than anything fundamental to the algorithm, and so its just a matter of figuring out how best to dig myself out of that particular implementation issue.

jay-reynolds · 2017-12-06T04:45:59Z

I'm seeing a possible precision-related issue in some of my tests, but it goes away when I change the random_state seed. I'll work on getting some examples...

lmcinnes · 2017-12-06T14:34:50Z

So the good news is I made some progress figuring out how to improve the nearest neighbor issues. The current approach would cause some performance regressions, so I just need to tweak things a little more to work well in cases like this but not lose (too much) performance in general.

The bad news is that it didn't actually "fix" the problem, which tips me back toward it possibly being something structural in the data. I will have to play more to see if I can find a better way to give a nicer presentation.

lmcinnes · 2017-12-06T19:10:15Z

Alright, I have an appropriate solution that should work with the current code! The nearest neighbor approximation does need to be fixed, but that is not so much the problem here, because this is "structurally true" of the data. What we actually want is to have the effective repulsive forces between data points to be dampened (since that is what is actually causing the packing). Fortunately there is already a parameter for this: gamma. If you raise min_dist and lower gamma you can get something much better. Here is what I got with min_dist=0.25 and gamma=0.01 (admittedly with better approx nearest neighbor code still turned on):

As my colleague pointed out, if you have a manifold and noise that is noise in the full dimensional ambient space (as opposed to noise off the manifold) then this is exactly what you expect to happen, and the only way to reasonably combat that is to reduce how hard we force the noise points away from everything else.

lmcinnes · 2017-12-07T00:48:20Z

Here is the same (min_dist=0.25 and gamma=0.01) with the original approximate nearest neighbors:

I feel like this is (hopefully) the solution you were probably looking for. Clearly some more documentation on parameters and what to tweak under different circumstances is needed. Let me know if this is sufficient in term of what you were looking for, or if you had a different sort of result in mind.

kylemcdonald · 2017-12-10T19:52:32Z

with more data, i was hoping for more resolution and data points in these smaller clusters.

and it happens up to a point, but once there are enough points these clusters turn into these "spiking" structures that shoot out. my ideal embedding would avoid those star-like spikes. but i need to look at the actual data closer and see if those small clusters are getting turned into spikes because they exist on a 1d manifold, or if it's just a "bug" and they really should be represented as a small cluster.

going to close this though, since it solves my original issue of everything collapsing to a point.

thanks so much for all your help and involvement in developing this tool :)

lmcinnes · 2017-12-12T03:54:57Z

For reference I have reproduced similar issues on another dataset, again at around the same amount of data. That seems a little suspicious to me, so I will continue digging. Sorry that I still don't have any good answers, but it is hard to understand exactly what is happening, let along what the correct fix is.

lmcinnes · 2018-01-26T22:43:59Z

I have finally found and fixed the issue that was causing this -- it was a (subtle) code bug in the SGD optimization. Moving to a different approach to the SGD optimization phase made this evident and resolved the issue. The latest master branch (v0.2.0+) should give better embeddings, particularly of larger datasets. For the data sample you provided I got the following:

kylemcdonald · 2018-03-07T00:53:36Z

wow this is great news! this embedding looks incredible! way more like what i would have expected!

edit: i double-checked for myself and confirm i get the same output. to be clear, this is with all default parameters, no gamma or min_dist customization 😱

arita37 · 2018-09-30T17:25:06Z

Why not create a repository of dataset
and pre-configured parameters ?
This would be easier for benchmarking.

lmcinnes mentioned this issue Dec 5, 2017

Recursion Error (Different from Previous Post) #23

Closed

lmcinnes added a commit that referenced this issue Dec 8, 2017

Make nn_descent params scale with data (issue #32 tangentially)

a434c63

kylemcdonald closed this as completed Dec 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converging to a single point #32

Converging to a single point #32

kylemcdonald commented Dec 4, 2017 •

edited

lmcinnes commented Dec 4, 2017

kylemcdonald commented Dec 4, 2017 •

edited

lmcinnes commented Dec 4, 2017

kylemcdonald commented Dec 4, 2017 •

edited

lmcinnes commented Dec 5, 2017

lmcinnes commented Dec 5, 2017

lmcinnes commented Dec 5, 2017 •

edited

lmcinnes commented Dec 5, 2017 •

edited

jay-reynolds commented Dec 6, 2017 •

edited

lmcinnes commented Dec 6, 2017

lmcinnes commented Dec 6, 2017

lmcinnes commented Dec 7, 2017

kylemcdonald commented Dec 10, 2017

lmcinnes commented Dec 12, 2017

lmcinnes commented Jan 26, 2018

kylemcdonald commented Mar 7, 2018 •

edited

arita37 commented Sep 30, 2018

Converging to a single point #32

Converging to a single point #32

Comments

kylemcdonald commented Dec 4, 2017 • edited

lmcinnes commented Dec 4, 2017

kylemcdonald commented Dec 4, 2017 • edited

lmcinnes commented Dec 4, 2017

kylemcdonald commented Dec 4, 2017 • edited

lmcinnes commented Dec 5, 2017

lmcinnes commented Dec 5, 2017

lmcinnes commented Dec 5, 2017 • edited

lmcinnes commented Dec 5, 2017 • edited

jay-reynolds commented Dec 6, 2017 • edited

lmcinnes commented Dec 6, 2017

lmcinnes commented Dec 6, 2017

lmcinnes commented Dec 7, 2017

kylemcdonald commented Dec 10, 2017

lmcinnes commented Dec 12, 2017

lmcinnes commented Jan 26, 2018

kylemcdonald commented Mar 7, 2018 • edited

arita37 commented Sep 30, 2018

kylemcdonald commented Dec 4, 2017 •

edited

kylemcdonald commented Dec 4, 2017 •

edited

kylemcdonald commented Dec 4, 2017 •

edited

lmcinnes commented Dec 5, 2017 •

edited

lmcinnes commented Dec 5, 2017 •

edited

jay-reynolds commented Dec 6, 2017 •

edited

kylemcdonald commented Mar 7, 2018 •

edited