-
Notifications
You must be signed in to change notification settings - Fork 14
Pynndescent knn #830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pynndescent knn #830
Conversation
Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>
* stdout/err stored on fail Co-authored-by: Dries Schaumont <DriesSchaumont@users.noreply.github.com> * changelog entry added + minor fixes --------- Co-authored-by: Dries Schaumont <DriesSchaumont@users.noreply.github.com>
* cellranger mkgtf component working and tested * removed comments * changelog entry added * test unique attribute in result * multiple attribute par added * removed unused packages * use pytest, multiple attributes tested --------- Co-authored-by: DriesSchaumont <5946712+DriesSchaumont@users.noreply.github.com>
* CI - Build: Fix second occurance of namespace separator * Update 10x and Illumina related components. * Add tests for fixed RNA * Remove -k option for pytest * Undo accidental removal of resource * Trigger ci * Increase memory a bit * Increase timeout a little bit * Update CHANGELOG
* add no-lane-splitting param * add no-lane-splitting param * add more unit tests * update changelog * remove unused file * fix typo changelog * undo blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dorien-er ! The code looks good to me, but there is a substantial overlap with the labels_transfer/knn
component. These are core differences:
- The latter uses a reference dataset in
.h5ad
format, assuming that it comes from a publication with scRNA-seq data. You use.h5mu
, which is less common in papers but is extensively used in open pipelines. - In a new component, neighbors are weighted uniformly by default, with an option to make it dependent on the distance. In the existing component, weight decreases exponentially with a distance normalized among nearest neighbors.
- Here, probabilities of labels are stored, while the existing component uses uncertainties. As one can be obtained from another by subtracting from 1, this is not a big difference, but I would suggest a better consistency.
In my opinion, this approach is valuable, and we should include it in the pipeline. However, I would suggest extending the labels_transfer/knn
component instead of creating a new one. Lmk if I can help :)
Thanks a lot for the feedback! I agree there is a lot of overlap between the components - I currently implemented it as a separate component rather than updating the existing one, to avoid disrupting existing workflows because there are some breaking changes. Ideally, we can deprecate one of the two components (now or at a later time point) and combine the functionality of both into a single component, so we avoid having all the duplicate code. @rcannood and @VladimirShitov wdyt? Worth already extending the existing KNN component now instead of implementing parallel ones? We'd also need to update the |
I merged @dorien-er 's implementation with the previous one. The older Would appreciate a review @DriesSchaumont, @rcannood |
I opened a new PR making XGBoost compatible with the new annotation workflow: #858 . It also deletes the outdated We can first merge it to this branch or do it separately |
Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>
def distances_to_affinities(distances): | ||
stds = np.std(distances, axis=1) | ||
stds = (2.0 / stds) ** 2 | ||
stds = stds.reshape(-1, 1) | ||
distances_tilda = np.exp(-np.true_divide(distances, stds)) | ||
|
||
return distances_tilda / np.sum(distances_tilda, axis=1, keepdims=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VladimirShitov I'm noticing that with the current test data, this is returning all NaN, resulting in all NaN probabilities, because of zero devision (also the case in the previous KNN component). It might be due to rounding/floating point precision. Do you have time to look into this? If not, can you provide me with a reference on which this functionality is based?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I didn't notice that! The function should work, I successfully used it in several projects as well as other people. It was suggested in the scArches paper (see the "Cell type annotation" chapter in the methods), and is common approach in atlassing. Human Lung Cell Atlas, for example, leverages the same approach. Maybe something is wrong with the test data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks like it's happening in the return statement distances_tilda / np.sum(distances_tilda, axis=1, keepdims=True)
: In this case, there are many rows where the sum of the distances_tilda is zero (the standard deviations are very low, causing np.exp(-np.true_divide(distances, stds))
to approach zero).
I've made an update to the normalization: if the sum of a row of the distance tilda equals 0, set the normalized values of that row all to one.
LMKWYT
Suggestion: the KNNClassifier also accepts a callable for the weights kwarg, so I just pushed a change where the gaussian weights calculation is directly passed to the classifier. It simplifies the code quite a bit, but also ensures that the sum of probabilities across detected neighbours equals 1. @VladimirShitov and @DriesSchaumont lmkwyt! |
Thank you, @dorien-er ! Looks much clearer. Are the results identical to the previous? |
Yes! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (if the tests succeed)!
Changelog
Added a component
src/labels_transfer/pynndescent_knn
. This component has a lot of overlap withsrc/labels_transfer/knn
, but some changes were made such that the component is compatible with the cell type annotation workflow that is in progress. For now it is implemented as a separate component for backwards compatibility, but eventually we can combine the two components and deprecate one.Major changes include:
Issue ticket number and link
Closes #xxxx (Replace xxxx with the GitHub issue number)
Checklist before requesting a review
I have performed a self-review of my code
Conforms to the Contributor's guide
Check the correct box. Does this PR contain:
Proposed changes are described in the CHANGELOG.md
CI tests succeed!