Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecursionError #99

Closed
sametdumankaya opened this issue Aug 2, 2018 · 23 comments · Fixed by lmcinnes/pynndescent#24
Closed

RecursionError #99

sametdumankaya opened this issue Aug 2, 2018 · 23 comments · Fixed by lmcinnes/pynndescent#24

Comments

@sametdumankaya
Copy link

Hi, thanks for the great package.

I have a dataset which has 200000 rows and 15 columns. I tried to apply UMAP as following

embedding = umap.UMAP(n_neighbors=5, min_dist=0.3, metric='correlation').fit_transform(data)

After 10 seconds, I got following exceptions :

  • RecursionError: maximum recursion depth exceeded while calling a Python object
  • return make_angular_tree(data, indices, rng_state, leaf_size)
    SystemError: CPUDispatcher(<function angular_random_projection_split at 0x000001C8260D6378>)
    returned a result with an error set

I set the system recursion limit to 10000 as below and tried again but then python exited with a code like -143537645 meaning exited with error.

sys.setrecursionlimit(10000)

Is there any solution, workaround or anything I can do for this problem?

Thanks.

@lmcinnes
Copy link
Owner

lmcinnes commented Aug 2, 2018 via email

@diego-vicente
Copy link

I just ran into the same issue, and indeed removing duplicate points did solved the problem. I used a simple workaround, but I am not sure if it is justified to do so or it actually affects the method: can we safely remove the duplicates, embed the unique points and then reconstruct the original array by re-duplicating the embeddings?

My intuitition of the technique is that if N points are exactly the same we could safely work with only one of them and map all of them to the result, but I would like confirmation from someone who actually understands all the math involved. If that assumption is correct, I guess I can start working in a pull request adding the fix.

@lmcinnes
Copy link
Owner

lmcinnes commented Aug 27, 2018 via email

@madkoppa
Copy link

madkoppa commented Sep 4, 2018

I'm also having this problem. Adding an uncomfortably large amount of noise as a hack also works for me

@lmcinnes
Copy link
Owner

lmcinnes commented Sep 4, 2018 via email

@madkoppa
Copy link

madkoppa commented Sep 5, 2018

Just an update, it seems that when it complains about recursion depth but does not crash out, the result gets funky. It is apparent in visualisation purposes. Im trying to track down what the problem vectors are as the amount of noise I need to add for it to work ruins the end result too

@lmcinnes
Copy link
Owner

lmcinnes commented Sep 5, 2018 via email

@yizhouyan
Copy link

I encountered the same problem with 17.5M data points.

I used euclidean distance and scaled the data during preprocessing...
Here is the code:
X_scaled = preprocessing.scale(X)
fit = umap.UMAP(n_neighbors=20, min_dist=0.01, metric='euclidean', init='random',n_epochs=500)
The same parameter setting works on the sampling data with 100K data points.

Does anyone have any ideas about this?

Thanks!

@lmcinnes
Copy link
Owner

lmcinnes commented Nov 19, 2018 via email

@dsaxton
Copy link

dsaxton commented Nov 20, 2018

I was receiving the same error and adding a small amount of noise did help. Thanks!

@scharron
Copy link

@lmcinnes I attached you a 5000x100 numpy array that always makes UMAP crashes.
I'm reproducing the error with :
umap = UMAP(metric="cosine", n_components=2, min_dist=0.3).fit_transform(data)

data.npy.zip

@lmcinnes
Copy link
Owner

Thanks @scharron, reproducing examples are very useful. I'll try to look ta this when I get a little more time.

@thommiano
Copy link

Experiencing the same issue. Removing duplicates solved my problem, and was fine for my visualization purposes.

@lucidyan
Copy link

Same issue here

@lmcinnes
Copy link
Owner

I think some fixes to other issues may actually resolve this, so I'll try to roll out a patch release in the next few days that will hopefully solve this.

@mkarbo
Copy link

mkarbo commented Apr 10, 2019

Did you roll out your patch yet?

@lmcinnes
Copy link
Owner

lmcinnes commented Apr 10, 2019 via email

@jlmelville
Copy link
Collaborator

I have confirmed that @scharron's attached data causes a failure on the latest UMAP with metric="cosine". I don't know if it will reach the recursion limit and give an error as I always terminate the calculation after about 30 minutes of nothing happening.

The culprit seems to be the @numba.njit(fastmath=True) decorating angular_random_projection_split at:

umap/umap/rp_tree.py

Lines 28 to 29 in 6a32f62

@numba.njit(fastmath=True)
def angular_random_projection_split(data, indices, rng_state):

Set fastmath=False and everything proceeds as normal (except slower). I have confirmed that this also affects the current master of https://github.com/lmcinnes/pynndescent.

I suppose this could also be affecting the Euclidean and sparse versions of the projection split and just waiting for a dataset to trigger them.

The downside of fastmath=False is that there is a noticeable speed decrease of of between 65%-120% for the nearest neighbor search with metric="euclidean", for the datasets I looked at (these are MNIST and bigger in size and dimensionality), although most of them were on the lower end of those numbers. And for metric="cosine", the slow down is less severe, more like 20%, although I haven't tested that very much.

According to the numba docs, it's possible to opt in to a subset of the fast math settings, so that might be worth looking into.

@lmcinnes
Copy link
Owner

Thanks @jlmelville ! I will have to look into the details of what subset of fastmath I can opt in to, and what the actual offending aspect is. Presumably the issue is in the generation of the hyperplane (resulting in a plane that does not split that data at all, which then gets generated repeatedly). Perhaps an alternative would be to have some sort of bail-out failsafe if no points are separated by a given node of the tree since this does seem to very much be an odd corner case.

@jlmelville
Copy link
Collaborator

I looked at trying out some subsets of fastmath, but they always caused the infinite loop.

Looking in more detail at this, the problem seems to arise because of checking float values for 0. There's a lot of it in rp_tree.py, but this line in particular seemed to cause the issue:

if margin == 0:

If that changes to if abs(margin < EPS) for some suitable EPS (1.e-8?), the tree building succeeds, but I think we would want to change all such comparisons.

Alternatively, is it the case that make_angular_tree should never create nodes with zero membership, i.e. in:

umap/umap/rp_tree.py

Lines 465 to 466 in 6a32f62

left_node = make_angular_tree(data, left_indices, rng_state, leaf_size)
right_node = make_angular_tree(data, right_indices, rng_state, leaf_size)

left_node.shape[0] and right_node.shape[0] should always be > 0? The infinite loop seen with @scharron's data is due to one of those nodes having size 0.

You can prevent the loop by bailing out by raiseing a RecursionError, but that's obviously a drastic step because I think it causes the entire RP forest routine to abort and you start the nearest neighbor descent with a random initialization? Would it be ok to just arbitrarily split the remaining nodes into two equally sized nodes at this point? This would allow the sizes to eventually reach leaf_size and the algorithm to proceed in the event of there being more than leaf_size exact duplicates in the data set, although I am not yet familiar with the implementation details to see if that will cause problems down the line. I think it would just affect the neighbor list for those points as it gets passed into the nearest neighbor descent routine. If so, that would definitely be a better initialization strategy than the RecursionError path.

I was hoping to try out some of these solutions and submit a pr through pynndescent first, but I am getting failing unit tests on that project at the moment. @lmcinnes, I am happy to work on a pr for this here if you have a sense of which (or both) of these strategies to pursue.

@lmcinnes
Copy link
Owner

Yes, the tree should never create nodes of size zero, so that is certainly part of the problem. I guess the question is where is the best place to catch this. Your observation that the margin == 0 line is a source of issue is potentially the right place to catch things. Here is my suspicion as to what is happening:

The data contains a number of points that, while distinct and non-zero, are all scalar multiples of each other. Thus, they all lie on a line from the origin, and an angular hyperplane cannot possibly separate them. Ideally the margin == 0 case was supposed to catch this. If points were identically aligned with the hyperplane (as would be the case for such points, as the generated hyperplane would be identically zero if the node contained only such points) they would produce a zero margin and then we can randomly assign them one way or the other. I suspect the fast_math is producing enough rounding error despite dot producting with the all zero vector, all the margins are slightly one side or the other of zero and the end result is infinite recursion.

Now, there are a few ways to fix this: we can fix the margin comparison, as you have done to handle issues with floats better (and make the similar required changes to the other splitting routines); we can go after the root of the problem (a bad hyperplane; all zeros for the hyperplane vector in all the cases I believe), and if that occurs then just automate the random splitting; finally we could take the other option you suggest and catch splits that result in zero points in one side and insert a forced random partitioning into two nodes there.

The first way should work well enough, and seems mostly clean aside from the catch that we'll very occasionally mis-assign points that are very close to the hyperplane margin. The second approach gets to the heart of the actual issue, but still requires actually detecting an all zero hyperplane (up to float tolerances). The third option is probably the most robust in that if there are other reasons why the split can go astray (and there may be) it will still catch them. It is a more awkward fix to implement however.

To be honest I'm happiest with option 1 (fix the margin comparison). It will get the job done with the least amount of fuss. I'll see what the issues with pynndescent are -- the tests are all a bit new, so it could well be errors in the testing more than anything.

@radames
Copy link

radames commented Apr 28, 2019

The latest fixes merged on master seems to fix the problem here to my tests on ~100k points (n_neighbors=20, metric='cosine', min_dist=0.4 ,init='random', verbose=2) !

@lmcinnes
Copy link
Owner

lmcinnes commented Apr 28, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.