Pynndescent knn #830

dorien-er · 2024-07-08T15:38:05Z

Changelog

Added a component src/labels_transfer/pynndescent_knn. This component has a lot of overlap with src/labels_transfer/knn, but some changes were made such that the component is compatible with the cell type annotation workflow that is in progress. For now it is implemented as a separate component for backwards compatibility, but eventually we can combine the two components and deprecate one.

Major changes include:

Accept a reference h5mu file as input, such that reference files created by OP can be provided directly to the component
Save probabilities rather than uncertainties in the output h5mu, which can be used to calculate the majority vote
Implement the scikit-learn knn classifier (rather than manual calculations), allowing for multiple distance functions

Issue ticket number and link

Closes #xxxx (Replace xxxx with the GitHub issue number)

Checklist before requesting a review

Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>

* stdout/err stored on fail Co-authored-by: Dries Schaumont <DriesSchaumont@users.noreply.github.com> * changelog entry added + minor fixes --------- Co-authored-by: Dries Schaumont <DriesSchaumont@users.noreply.github.com>

* cellranger mkgtf component working and tested * removed comments * changelog entry added * test unique attribute in result * multiple attribute par added * removed unused packages * use pytest, multiple attributes tested --------- Co-authored-by: DriesSchaumont <5946712+DriesSchaumont@users.noreply.github.com>

* CI - Build: Fix second occurance of namespace separator * Update 10x and Illumina related components. * Add tests for fixed RNA * Remove -k option for pytest * Undo accidental removal of resource * Trigger ci * Increase memory a bit * Increase timeout a little bit * Update CHANGELOG

* add no-lane-splitting param * add no-lane-splitting param * add more unit tests * update changelog * remove unused file * fix typo changelog * undo blank line

VladimirShitov

Thank you @dorien-er ! The code looks good to me, but there is a substantial overlap with the labels_transfer/knn component. These are core differences:

The latter uses a reference dataset in .h5ad format, assuming that it comes from a publication with scRNA-seq data. You use .h5mu, which is less common in papers but is extensively used in open pipelines.
In a new component, neighbors are weighted uniformly by default, with an option to make it dependent on the distance. In the existing component, weight decreases exponentially with a distance normalized among nearest neighbors.
Here, probabilities of labels are stored, while the existing component uses uncertainties. As one can be obtained from another by subtracting from 1, this is not a big difference, but I would suggest a better consistency.

In my opinion, this approach is valuable, and we should include it in the pipeline. However, I would suggest extending the labels_transfer/knn component instead of creating a new one. Lmk if I can help :)

src/labels_transfer/api/common_arguments_2.yaml

src/labels_transfer/pynndescent_knn/script.py

dorien-er · 2024-07-16T13:10:57Z

Thanks a lot for the feedback!

I agree there is a lot of overlap between the components - I currently implemented it as a separate component rather than updating the existing one, to avoid disrupting existing workflows because there are some breaking changes. Ideally, we can deprecate one of the two components (now or at a later time point) and combine the functionality of both into a single component, so we avoid having all the duplicate code.

@rcannood and @VladimirShitov wdyt? Worth already extending the existing KNN component now instead of implementing parallel ones? We'd also need to update the src/labels_transfer/xgboost component in that case, because it relies on the same common arguments

src/labels_transfer/pynndescent_knn/script.py

VladimirShitov · 2024-08-06T16:38:40Z

I merged @dorien-er 's implementation with the previous one. The older knn component can now be deleted. Would be great to standardize the xgboost component with the new annotation format as well. We can then clean up the code a bit more.

Would appreciate a review @DriesSchaumont, @rcannood

VladimirShitov · 2024-08-14T10:12:32Z

I opened a new PR making XGBoost compatible with the new annotation workflow: #858 . It also deletes the outdated knn component and fixes tests here.

We can first merge it to this branch or do it separately

.gitignore

src/labels_transfer/pynndescent_knn/script.py

Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>

dorien-er · 2024-08-28T15:47:00Z

src/labels_transfer/pynndescent_knn/script.py

+def distances_to_affinities(distances):
+    stds = np.std(distances, axis=1)
+    stds = (2.0 / stds) ** 2
+    stds = stds.reshape(-1, 1)
+    distances_tilda = np.exp(-np.true_divide(distances, stds))
+
+    return distances_tilda / np.sum(distances_tilda, axis=1, keepdims=True)


@VladimirShitov I'm noticing that with the current test data, this is returning all NaN, resulting in all NaN probabilities, because of zero devision (also the case in the previous KNN component). It might be due to rounding/floating point precision. Do you have time to look into this? If not, can you provide me with a reference on which this functionality is based?

Interesting, I didn't notice that! The function should work, I successfully used it in several projects as well as other people. It was suggested in the scArches paper (see the "Cell type annotation" chapter in the methods), and is common approach in atlassing. Human Lung Cell Atlas, for example, leverages the same approach. Maybe something is wrong with the test data?

So it looks like it's happening in the return statement distances_tilda / np.sum(distances_tilda, axis=1, keepdims=True): In this case, there are many rows where the sum of the distances_tilda is zero (the standard deviations are very low, causing np.exp(-np.true_divide(distances, stds)) to approach zero).

I've made an update to the normalization: if the sum of a row of the distance tilda equals 0, set the normalized values of that row all to one.

LMKWYT

dorien-er · 2024-08-28T15:49:16Z

Suggestion: the KNNClassifier also accepts a callable for the weights kwarg, so I just pushed a change where the gaussian weights calculation is directly passed to the classifier. It simplifies the code quite a bit, but also ensures that the sum of probabilities across detected neighbours equals 1. @VladimirShitov and @DriesSchaumont lmkwyt!

VladimirShitov · 2024-08-29T11:39:29Z

Thank you, @dorien-er ! Looks much clearer. Are the results identical to the previous?

dorien-er · 2024-08-30T07:18:51Z

Thank you, @dorien-er ! Looks much clearer. Are the results identical to the previous?

Yes!

DriesSchaumont

LGTM (if the tests succeed)!

VladimirShitov and others added 17 commits March 25, 2024 11:35

Add ATAC demux (#726)

9ccc4a3

Remove muon as test dependency for concatenate_h5mu. (#773)

41b60be

scGPT binning component (#765)

7ec3ba4

Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into develop

39dedf2

Add no-lane-splitting parameter to BCL Convert (#804)

3eebd2b

* add no-lane-splitting param * add no-lane-splitting param * add more unit tests * update changelog * remove unused file * fix typo changelog * undo blank line

initial script

3dfedcc

update params

1579728

add unit tests pynndescent knn

826fe2c

merge main

42ddce4

undo unrequired changes

737a83d

undo unrequired changes

60d00ec

update changelog

21e3eea

update for github runners

55637cf

add common params and utilities

a97ea36

dorien-er marked this pull request as ready for review July 9, 2024 08:50

dorien-er requested review from rcannood and VladimirShitov July 9, 2024 08:51

VladimirShitov requested changes Jul 15, 2024

View reviewed changes

src/labels_transfer/api/common_arguments_2.yaml Outdated Show resolved Hide resolved

src/labels_transfer/api/common_arguments_2.yaml Outdated Show resolved Hide resolved

src/labels_transfer/pynndescent_knn/script.py Outdated Show resolved Hide resolved

VladimirShitov reviewed Aug 6, 2024

View reviewed changes

src/labels_transfer/pynndescent_knn/script.py Outdated Show resolved Hide resolved

VladimirShitov and others added 7 commits August 6, 2024 18:11

Remove unnecessary file

03aee13

Combine with old code, calculate neighbors only once

dd29f35

Name weight option "gaussian"

920ccfc

Optimize prediction when all neighbors have the same class

027f935

Improve logging

d0d4c2a

Test different weights

8fe28e1

Merge branch 'main' into pynndescent-knn

68420c9

VladimirShitov and others added 4 commits August 13, 2024 10:17

Update numba to work with python 3.12

5430d6f

Use the same number of neighbors for classifier as in index

004efcf

Remove unicode characters

551232f

Merge remote-tracking branch 'origin/main' into pynndescent-knn

7400134

dorien-er requested a review from DriesSchaumont August 16, 2024 09:58

VladimirShitov and others added 2 commits August 28, 2024 09:58

Make XGBoost component compatible with new annotation workflow (#858)

c6b796b

Merge branch 'main' into pynndescent-knn

531634d

DriesSchaumont requested changes Aug 28, 2024

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

src/labels_transfer/pynndescent_knn/script.py Outdated Show resolved Hide resolved

src/labels_transfer/pynndescent_knn/script.py Outdated Show resolved Hide resolved

src/labels_transfer/pynndescent_knn/script.py Show resolved Hide resolved

VladimirShitov and others added 3 commits August 28, 2024 12:16

Remove bin/

a12fbb2

Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>

address pr comments, implement gaussian callable for sklearn

8c3f7e2

fix unit test, clean up

0655729

dorien-er commented Aug 28, 2024

View reviewed changes

dorien-er added 2 commits August 28, 2024 18:00

fix unit test

1c3c0fd

update anndata version

42b4c4b

dorien-er added 2 commits August 30, 2024 08:18

adjust normalization distances to avoid zero devision

9f44b39

fit pynndescent transformer only once on query data

830662c

dorien-er added 2 commits August 30, 2024 09:26

Merge remote-tracking branch 'origin/main' into pynndescent-knn

68b810e

updat to viash 9

bfc1588

DriesSchaumont approved these changes Aug 30, 2024

View reviewed changes

DriesSchaumont merged commit 7a90f3a into main Aug 30, 2024
4 checks passed

DriesSchaumont deleted the pynndescent-knn branch August 30, 2024 08:41

DriesSchaumont mentioned this pull request Oct 17, 2024

update knn component to accept pre-calculated distances #890

Merged

10 tasks

Pynndescent knn #830

Pynndescent knn #830

Uh oh!

Conversation

dorien-er commented Jul 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog

Issue ticket number and link

Checklist before requesting a review

Uh oh!

VladimirShitov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dorien-er commented Jul 16, 2024

Uh oh!

Uh oh!

VladimirShitov commented Aug 6, 2024

Uh oh!

VladimirShitov commented Aug 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dorien-er Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VladimirShitov Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

dorien-er Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dorien-er commented Aug 28, 2024

Uh oh!

VladimirShitov commented Aug 29, 2024

Uh oh!

dorien-er commented Aug 30, 2024

Uh oh!

DriesSchaumont left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dorien-er commented Jul 8, 2024 •

edited

Loading

VladimirShitov commented Aug 14, 2024 •

edited

Loading

dorien-er Aug 28, 2024 •

edited

Loading

dorien-er Aug 30, 2024 •

edited

Loading

DriesSchaumont left a comment •

edited

Loading