Skip to content

Conversation

dorien-er
Copy link
Contributor

@dorien-er dorien-er commented Jul 8, 2024

Changelog

Added a component src/labels_transfer/pynndescent_knn. This component has a lot of overlap with src/labels_transfer/knn, but some changes were made such that the component is compatible with the cell type annotation workflow that is in progress. For now it is implemented as a separate component for backwards compatibility, but eventually we can combine the two components and deprecate one.

Major changes include:

  • Accept a reference h5mu file as input, such that reference files created by OP can be provided directly to the component
  • Save probabilities rather than uncertainties in the output h5mu, which can be used to calculate the majority vote
  • Implement the scikit-learn knn classifier (rather than manual calculations), allowing for multiple distance functions

Issue ticket number and link

Closes #xxxx (Replace xxxx with the GitHub issue number)

Checklist before requesting a review

  • I have performed a self-review of my code

  • Conforms to the Contributor's guide

  • Check the correct box. Does this PR contain:

    • Breaking changes
    • New functionality
    • Major changes
    • Minor changes
    • Documentation
    • Bug fixes
  • Proposed changes are described in the CHANGELOG.md

  • CI tests succeed!

VladimirShitov and others added 17 commits March 25, 2024 11:35
Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>
* stdout/err stored on fail

Co-authored-by: Dries Schaumont <DriesSchaumont@users.noreply.github.com>

* changelog entry added + minor fixes

---------

Co-authored-by: Dries Schaumont <DriesSchaumont@users.noreply.github.com>
* cellranger mkgtf component working and tested

* removed comments

* changelog entry added

* test unique attribute in result

* multiple attribute par added

* removed unused packages

* use pytest, multiple attributes tested

---------

Co-authored-by: DriesSchaumont <5946712+DriesSchaumont@users.noreply.github.com>
* CI - Build: Fix second occurance of namespace separator

* Update 10x and Illumina related components.

* Add tests for fixed RNA

* Remove -k option for pytest

* Undo accidental removal of resource

* Trigger ci

* Increase memory a bit

* Increase timeout a little bit

* Update CHANGELOG
* add no-lane-splitting param

* add no-lane-splitting param

* add more unit tests

* update changelog

* remove unused file

* fix typo changelog

* undo blank line
@dorien-er dorien-er marked this pull request as ready for review July 9, 2024 08:50
Copy link
Collaborator

@VladimirShitov VladimirShitov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dorien-er ! The code looks good to me, but there is a substantial overlap with the labels_transfer/knn component. These are core differences:

  1. The latter uses a reference dataset in .h5ad format, assuming that it comes from a publication with scRNA-seq data. You use .h5mu, which is less common in papers but is extensively used in open pipelines.
  2. In a new component, neighbors are weighted uniformly by default, with an option to make it dependent on the distance. In the existing component, weight decreases exponentially with a distance normalized among nearest neighbors.
  3. Here, probabilities of labels are stored, while the existing component uses uncertainties. As one can be obtained from another by subtracting from 1, this is not a big difference, but I would suggest a better consistency.

In my opinion, this approach is valuable, and we should include it in the pipeline. However, I would suggest extending the labels_transfer/knn component instead of creating a new one. Lmk if I can help :)

@dorien-er
Copy link
Contributor Author

Thanks a lot for the feedback!

I agree there is a lot of overlap between the components - I currently implemented it as a separate component rather than updating the existing one, to avoid disrupting existing workflows because there are some breaking changes. Ideally, we can deprecate one of the two components (now or at a later time point) and combine the functionality of both into a single component, so we avoid having all the duplicate code.

@rcannood and @VladimirShitov wdyt? Worth already extending the existing KNN component now instead of implementing parallel ones? We'd also need to update the src/labels_transfer/xgboost component in that case, because it relies on the same common arguments

@VladimirShitov
Copy link
Collaborator

I merged @dorien-er 's implementation with the previous one. The older knn component can now be deleted. Would be great to standardize the xgboost component with the new annotation format as well. We can then clean up the code a bit more.

Would appreciate a review @DriesSchaumont, @rcannood

@VladimirShitov
Copy link
Collaborator

VladimirShitov commented Aug 14, 2024

I opened a new PR making XGBoost compatible with the new annotation workflow: #858 . It also deletes the outdated knn component and fixes tests here.

We can first merge it to this branch or do it separately

VladimirShitov and others added 3 commits August 28, 2024 12:16
Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>
Comment on lines 48 to 54
def distances_to_affinities(distances):
stds = np.std(distances, axis=1)
stds = (2.0 / stds) ** 2
stds = stds.reshape(-1, 1)
distances_tilda = np.exp(-np.true_divide(distances, stds))

return distances_tilda / np.sum(distances_tilda, axis=1, keepdims=True)
Copy link
Contributor Author

@dorien-er dorien-er Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VladimirShitov I'm noticing that with the current test data, this is returning all NaN, resulting in all NaN probabilities, because of zero devision (also the case in the previous KNN component). It might be due to rounding/floating point precision. Do you have time to look into this? If not, can you provide me with a reference on which this functionality is based?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I didn't notice that! The function should work, I successfully used it in several projects as well as other people. It was suggested in the scArches paper (see the "Cell type annotation" chapter in the methods), and is common approach in atlassing. Human Lung Cell Atlas, for example, leverages the same approach. Maybe something is wrong with the test data?

Copy link
Contributor Author

@dorien-er dorien-er Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like it's happening in the return statement distances_tilda / np.sum(distances_tilda, axis=1, keepdims=True): In this case, there are many rows where the sum of the distances_tilda is zero (the standard deviations are very low, causing np.exp(-np.true_divide(distances, stds)) to approach zero).

I've made an update to the normalization: if the sum of a row of the distance tilda equals 0, set the normalized values of that row all to one.

LMKWYT

@dorien-er
Copy link
Contributor Author

Suggestion: the KNNClassifier also accepts a callable for the weights kwarg, so I just pushed a change where the gaussian weights calculation is directly passed to the classifier. It simplifies the code quite a bit, but also ensures that the sum of probabilities across detected neighbours equals 1. @VladimirShitov and @DriesSchaumont lmkwyt!

@VladimirShitov
Copy link
Collaborator

Thank you, @dorien-er ! Looks much clearer. Are the results identical to the previous?

@dorien-er
Copy link
Contributor Author

Thank you, @dorien-er ! Looks much clearer. Are the results identical to the previous?

Yes!

Copy link
Member

@DriesSchaumont DriesSchaumont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (if the tests succeed)!

@DriesSchaumont DriesSchaumont merged commit 7a90f3a into main Aug 30, 2024
4 checks passed
@DriesSchaumont DriesSchaumont deleted the pynndescent-knn branch August 30, 2024 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants