Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A challenging dataset for pynndescent #18

Open
jlmelville opened this issue Apr 13, 2019 · 10 comments
Open

A challenging dataset for pynndescent #18

jlmelville opened this issue Apr 13, 2019 · 10 comments

Comments

@jlmelville
Copy link
Collaborator

I have been running uwot and UMAP over the same datasets to reassure myself that you get equivalent results. That's true for most of the datasets I've looked at, with the exception of a transcriptomics dataset that's part of the openTSNE documentation, macosko2015. The differences seem to be due to nearest neighbor results.

The needed data can be downloaded via:
https://github.com/hemberg-lab/scRNA.seq.datasets/blob/master/bash/macosko.sh

and then you can follow the code given at:
https://github.com/pavlin-policar/openTSNE/blob/master/examples/prepare_macosko_2015.ipynb

although you also need a few functions from:
https://github.com/pavlin-policar/openTSNE/blob/master/examples/utils.py

I stopped processing after the log transform and 3,000 most active genes are selected, leaving a dataset with shape (44808, 3000), i.e. I didn't do any Z-scaling or PCA.

This dataset seems remarkably challenging for pynndescent. For all the other datasets I've looked at (including a similar RNA-seq dataset), pynndescent gives accuracy of at least 90% with metric="euclidean", based on calculating the exact nearest neighbors with either FNN (in R) or sklearn.neighbors.NearestNeighbors. For macosko2015, the accuracy is only around 40%.

The underlying issue may be the RP-tree phase: it yields an accuracy of only 6%, so to end up as high as 40% is a testament to the utility of nearest neighbor descent. Again, this is anomalous compared to other datasets: the next lowest accuracy I saw with the RP-tree-only approach is 30%, and typically it's in the 60-80% range.

Unfortunately, I've not been able to improve matters by increasing n_trees. If I go up to n_trees=250, then in combination with the typical nearest neighbor descent phase, I get an accuracy of about 55%. Much beyond that, I start getting out of memory errors. On the nearest neighbor descent side, it usually converges after no more than 5 iterations anyway, so to make much progress, I'd have to start looking at modifying rho and max_candidates, which I haven't got round to.

It's not just pynndescent that has trouble: Annoy does even worse. Also, carrying out PCA to reduce the dimensionality, as done automatically by lots of t-SNE packages (and a highly recommended option in uwot to stop Annoy running slowly) is truly ruinous (11% accuracy!). For most other datasets, surprisingly low accuracies in the nearest neighbor results -- say in the region 70-80% -- don't lead to big differences in the visualization, but this is one dataset where you do see a difference. And this also affects t-SNE results, even with the higher number of nearest neighbors that are calculated with its default perplexity of 50.

I do understand the argument that PCA can act to denoise the data, and the PCA-based results actually look nicer to my eye than the results with exact nearest neighbors, but in this case, even 100 components only extracts ~40% of the variance in the data. Maybe these RNA-seq datasets really do have a lot of noise distributed in a way that PCA is good at removing. Nonetheless, this one dataset is obviously anomalous compared to the others, so I think that would be awfully convenient for uwot!

I realize that this has got very long. I am going to add a page to uwot's documentation on this, but I am raising the issue here in case it's of general interest as a test case, and if there is an obvious fix via the settings of pynndescent (I don't think any of the parameter are exposed in UMAP anyway?).

Otherwise, I would be very interested to know simply if these results can be replicated by anyone else. I have repeated them several times, but as usual, the risk of me having messed up somewhere is very high. I'm happy to provide any chunks of R or Python I've used for this if that would be useful.

@lmcinnes
Copy link
Owner

Thanks for this. Challenging datasets are always interesting, and I will have to explore exactly what makes nearest neighbors so hard with this data. It does seem remarkable that it should perform quite so poorly, and is clearly not restricted to RP-trees and NNDesccent, since Annoy seems to have similar issues.

@atarashansky
Copy link

atarashansky commented Oct 28, 2019

Typically with scRNA-seq datasets, there can be many points that are similarly close to a particular cell. For example, if you have a homogeneous population of 100 cells and you're choosing 15 nearest neighbors, then there could be a great deal of interchangeability.

If you plot the distribution of exact distances, do you see a bimodal distribution with a small number of neighbors with relatively small distances? Or do you see something less-than-bimodal, with a large number of neighbors with relatively small distances?

I expect the latter to be true. In my experience, having low accuracy compared to exact nearest neighbors is not really a deal breaker since pynndescent will still pick neighbors that are relatively close. (To begin with, there's nothing inherently biological about picking the exact 15 cells with the largest correlation values if there are many other cells with similar-ish similarities).

Here's my recommendation: Try doing exact nearest neighbors to get a graph and do Louvain clustering to assign clusters. Do the same with pynndescent. Compare the cluster assignments using something like sklearn.metrics.adjusted_rand_score. What's the clustering similarity like? Even if the neighbors aren't the same, you might find that the cluster assignments are extremely similar.

@jlmelville
Copy link
Collaborator Author

I appreciate the comments and suggestions @atarashansky. I don't currently have the bandwidth to do any clustering on the dataset, but it's a good idea. I've not looked at the distribution of all the neighbor distances, but I have for the exact 15 nearest neighbor and 150 nearest neighbor distances. Here's a histogram of the 150 nearest neighbors:

macosko2015-150nn

In general, I agree that the approximate nearest neighbors should do just as good a job as the exact nearest neighbors. But the UMAP plots do come out different (see some of the plots at https://jlmelville.github.io/uwot/pycompare#macosko2015 and then further down at the "What about t-SNE?" section).

@atarashansky
Copy link

How does pynndescent compare to the exact distances if you use correlation distance as opposed to (what seems to be) euclidean distance? Similar error rate?

@jlmelville
Copy link
Collaborator Author

Unfortunately, I did the majority of these calculations and data-processing in R, where I don't have access to nearest neighbor routines with correlation distance, so I'll have to get back to you at an unspecified time in the future on that one.

@jlmelville
Copy link
Collaborator Author

NN-Descent on High-Dimensional Data (further expanded in The Influence of Hubness on NN-Descent) makes the case that NN Descent does badly in datasets with lots of "hubs": nodes that show up in the reverse neighbor list of lots of other nodes. These will consistently appear in the candidate list of other nodes and non-hub-nodes are less likely to get locally-joined to decent candidates.

I calculated the "hubness" (the ratio of the reverse neighbor list size to the forward neighbor list) on this dataset for the exact k = 15 neighbors and it definitely suffers from hubness: the maximum hubness is over 700 (i.e. one point shows up in the 15-nearest neighbors of over 10,000 other nodes, nearly a quarter of the dataset). The second-most hubby (hubful?) dataset I've seen is CIFAR-10, which has a maximum hubness of a mere 140. Datasets like MNIST, FMNIST and KMNIST (which NND does well on) have a maximum hubness in the 5-20 range.

I consider this case closed except for actually fixing the problem. The linked papers above make some suggestions.

@atarashansky
Copy link

Good find! Incorporating indegree scores into the NN-descent algorithm does not seem like it should be too challenging. I'd be interested in submitting a PR tackling this issue. I'll keep you posted!

@lmcinnes
Copy link
Owner

Thanks for the references. I recall we had some concerns about hubness, but never looked into it much. I will have to look through the referenced paper to see if there are practicable solutions to the problem.

Also of note for making NN-descent fail: lots of ties in distances. If there are many points equally close to a given point it can be hard to "descend" down the apparently flat slope of neighbor improvements. We have encountered a few datasets where this is the case, often with ties at 0 due to significant duplicates, or ties at 1.0 for a bounded distance like cosine or jaccard. Either case certainly negatively impacts performance.

@jlmelville
Copy link
Collaborator Author

the two strategies that stood out are:

  • increase the number of neighbors internally and scale down the sample rate to try and equalize the total computation time.
  • calculate hubness of the current graph at each iteration and use that to replace hub nodes with random selections.

I have no reason to believe these don’t work but are fairly brute force approaches. The suggested implementation for estimating the hubness (track which nodes are added and deleted from the graph) seems like a bit fiddly to add to pynndescent (tracking deletions seems like it would complicate things).

I did some preliminary calculations on various datasets and you can spot hub nodes fairly well after the first iteration as long as max_candidates is sufficiently high (20 worked; 5 seems too low). I haven’t got any further than that yet though.

@lmcinnes
Copy link
Owner

That sounds good -- if we can spot hub nodes (and max_candidates should be higher than 5 almost always) then some mitigation might be easily tractable. I look forward to anything further you find on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants